OpenAI
geneva.udfs.openai.openai_embedding_udf
openai_embedding_udf(
column: str = "text",
model: str = DEFAULT_OPENAI_EMBEDDING_MODEL,
output_dimensionality: int | None = None,
normalize: bool = False,
api_key_env: str = "OPENAI_API_KEY",
version: str | None = None,
dimension: int | None = None,
) -> UDF
Return an OpenAI embedding UDF with the API key captured at call time.
The API key is read from os.environ[api_key_env] at call time and
serialized with the UDF. On remote workers the key is available without
cluster-level env_vars configuration.
Parameters:
-
column(str, default:'text') –Name of the input column containing text to embed. Defaults to
"text". -
model(str, default:DEFAULT_OPENAI_EMBEDDING_MODEL) –OpenAI embedding model identifier (default
text-embedding-3-small). -
output_dimensionality(int | None, default:None) –Optional reduced output dimensionality. When specified the API returns truncated embeddings (only supported by
text-embedding-3-*models). If None, the model's full dimensionality is used. -
normalize(bool, default:False) –Whether to L2-normalise the embeddings. Defaults to
Falsebecause OpenAI embedding models return pre-normalized vectors. -
api_key_env(str, default:'OPENAI_API_KEY') –Environment variable that holds the API key (default
OPENAI_API_KEY). -
version(str | None, default:None) –Explicit version string for the UDF so that key rotation does not change the UDF hash and trigger a re-backfill.
-
dimension(int | None, default:None) –Optional pre-specified embedding dimension. If None (default), the dimension is looked up from a built-in table of known models (or determined from output_dimensionality if set). If provided, model loading is deferred until UDF execution.
-
Requires–pip install 'geneva[udf-text-openai]'
Returns:
-
UDF–A UDF instance ready to be registered with a Geneva dataset.
Examples:
Embed text documents:
Use a reduced dimensionality:
geneva.udfs.openai.openai_udf
openai_udf(
column: str,
prompt: str,
model: str = "gpt-5-mini",
mime_type: str | None = None,
api_key_env: str = "OPENAI_API_KEY",
version: str | None = None,
) -> UDF
Return an OpenAI Chat Completions UDF with the API key captured at call time.
The API key is read from os.environ[api_key_env] at call time and
serialized with the UDF. On remote workers the key is available without
cluster-level env_vars configuration.
Supports both text and binary (e.g. image) columns. For text columns
the prompt is prepended to each value. For binary columns the raw bytes
are sent as a base64 image_url content part alongside the prompt.
The column type is detected at runtime from the Arrow array; pass
mime_type when the column contains binary data.
Parameters:
-
column(str) –Name of the input column.
-
prompt(str) –Instruction sent to OpenAI alongside each row's value.
-
model(str, default:'gpt-5-mini') –OpenAI model identifier (default
gpt-5-mini). -
mime_type(str | None, default:None) –MIME type for binary columns. Required when the input column contains binary data; ignored for text columns.
Supported types:
- Image —
image/jpeg,image/png,image/webp,image/gif(docs <https://platform.openai.com/docs/guides/images-vision>_)
- Image —
-
api_key_env(str, default:'OPENAI_API_KEY') –Environment variable that holds the API key (default
OPENAI_API_KEY). -
version(str | None, default:None) –Explicit version string for the UDF so that key rotation does not change the UDF hash and trigger a re-backfill.
-
Requires–pip install 'geneva[udf-text-openai]'
Returns:
-
UDF–A UDF instance ready to be registered with a Geneva dataset.
Examples:
Caption images with a one-sentence description:
>>> udf = openai_udf(
... column="image",
... prompt="Provide a 1 sentence description of the scene",
... mime_type="image/jpeg",
... )
>>> table.add_columns({"caption": udf})
Summarise text documents: