Skip to content

OpenAI

geneva.udfs.openai.openai_embedding_udf

openai_embedding_udf(
    column: str = "text",
    model: str = DEFAULT_OPENAI_EMBEDDING_MODEL,
    output_dimensionality: int | None = None,
    normalize: bool = False,
    api_key_env: str = "OPENAI_API_KEY",
    version: str | None = None,
    dimension: int | None = None,
) -> UDF

Return an OpenAI embedding UDF with the API key captured at call time.

The API key is read from os.environ[api_key_env] at call time and serialized with the UDF. On remote workers the key is available without cluster-level env_vars configuration.

Parameters:

  • column (str, default: 'text' ) –

    Name of the input column containing text to embed. Defaults to "text".

  • model (str, default: DEFAULT_OPENAI_EMBEDDING_MODEL ) –

    OpenAI embedding model identifier (default text-embedding-3-small).

  • output_dimensionality (int | None, default: None ) –

    Optional reduced output dimensionality. When specified the API returns truncated embeddings (only supported by text-embedding-3-* models). If None, the model's full dimensionality is used.

  • normalize (bool, default: False ) –

    Whether to L2-normalise the embeddings. Defaults to False because OpenAI embedding models return pre-normalized vectors.

  • api_key_env (str, default: 'OPENAI_API_KEY' ) –

    Environment variable that holds the API key (default OPENAI_API_KEY).

  • version (str | None, default: None ) –

    Explicit version string for the UDF so that key rotation does not change the UDF hash and trigger a re-backfill.

  • dimension (int | None, default: None ) –

    Optional pre-specified embedding dimension. If None (default), the dimension is looked up from a built-in table of known models (or determined from output_dimensionality if set). If provided, model loading is deferred until UDF execution.

  • Requires

    pip install 'geneva[udf-text-openai]'

Returns:

  • UDF

    A UDF instance ready to be registered with a Geneva dataset.

Examples:

Embed text documents:

>>> udf = openai_embedding_udf(column="body")
>>> table.add_columns({"embedding": udf})

Use a reduced dimensionality:

>>> udf = openai_embedding_udf(
...     column="body",
...     output_dimensionality=256,
... )

geneva.udfs.openai.openai_udf

openai_udf(
    column: str,
    prompt: str,
    model: str = "gpt-5-mini",
    mime_type: str | None = None,
    api_key_env: str = "OPENAI_API_KEY",
    version: str | None = None,
) -> UDF

Return an OpenAI Chat Completions UDF with the API key captured at call time.

The API key is read from os.environ[api_key_env] at call time and serialized with the UDF. On remote workers the key is available without cluster-level env_vars configuration.

Supports both text and binary (e.g. image) columns. For text columns the prompt is prepended to each value. For binary columns the raw bytes are sent as a base64 image_url content part alongside the prompt. The column type is detected at runtime from the Arrow array; pass mime_type when the column contains binary data.

Parameters:

  • column (str) –

    Name of the input column.

  • prompt (str) –

    Instruction sent to OpenAI alongside each row's value.

  • model (str, default: 'gpt-5-mini' ) –

    OpenAI model identifier (default gpt-5-mini).

  • mime_type (str | None, default: None ) –

    MIME type for binary columns. Required when the input column contains binary data; ignored for text columns.

    Supported types:

    • Imageimage/jpeg, image/png, image/webp, image/gif (docs <https://platform.openai.com/docs/guides/images-vision>_)
  • api_key_env (str, default: 'OPENAI_API_KEY' ) –

    Environment variable that holds the API key (default OPENAI_API_KEY).

  • version (str | None, default: None ) –

    Explicit version string for the UDF so that key rotation does not change the UDF hash and trigger a re-backfill.

  • Requires

    pip install 'geneva[udf-text-openai]'

Returns:

  • UDF

    A UDF instance ready to be registered with a Geneva dataset.

Examples:

Caption images with a one-sentence description:

>>> udf = openai_udf(
...     column="image",
...     prompt="Provide a 1 sentence description of the scene",
...     mime_type="image/jpeg",
... )
>>> table.add_columns({"caption": udf})

Summarise text documents:

>>> udf = openai_udf(
...     column="body",
...     prompt="Summarise this document in 3 bullet points",
... )