Embeddings
geneva.udfs.embeddings.sentence_transformer_udf
sentence_transformer_udf(
model: str = DEFAULT_SENTENCE_TRANSFORMER_MODEL,
column: str = DEFAULT_SENTENCE_TRANSFORMER_COLUMN,
normalize: bool = True,
num_gpus: float = 0.0,
trust_remote_code: bool = False,
dimension: int | None = None,
) -> UDF
Return a stateful sentence-transformers embedding UDF.
Parameters:
-
model(str, default:DEFAULT_SENTENCE_TRANSFORMER_MODEL) –The model being used for embedding. by default, it uses
sentence-transformers/all-MiniLM-L6-v2from HuggingFace Hub. -
column(str, default:DEFAULT_SENTENCE_TRANSFORMER_COLUMN) –Name of the column that will be embedded. By default, it uses
text. -
normalize(bool, default:True) –Whether to L2-normalise the generated embeddings. Defaults to
True. -
num_gpus(float, default:0.0) –Fractional GPU allocation requested for the UDF. Values
>= 0Be default, keeps execution on CPU; positive values request CUDA. -
trust_remote_code(bool, default:False) –Whether to trust remote code when loading the model. Defaults to
Falseas recommended by sentence-transformers. -
dimension(int | None, default:None) –Optional pre-specified embedding dimension. If None (default), will eagerly load the model to determine dimension. If provided, model loading is deferred until UDF execution. Use this for lazy loading when the model is not available at UDF definition time (e.g., in manifest upload scripts).
Returns:
-
UDF–A UDF instance that can be registered with a Geneva dataset.
geneva.udfs.embeddings.gemini_embedding_udf
gemini_embedding_udf(
column: str = "text",
model: str = DEFAULT_GEMINI_EMBEDDING_MODEL,
task_type: str | None = None,
output_dimensionality: int | None = None,
normalize: bool = False,
api_key_env: str = "GEMINI_API_KEY",
version: str | None = None,
dimension: int | None = None,
) -> UDF
Return a Gemini embedding UDF with the API key captured at call time.
The API key is read from os.environ[api_key_env] at call time and
serialized with the UDF. On remote workers the key is available without
cluster-level env_vars configuration.
Parameters:
-
column(str, default:'text') –Name of the input column containing text to embed. Defaults to
"text". -
model(str, default:DEFAULT_GEMINI_EMBEDDING_MODEL) –Gemini embedding model identifier (default
gemini-embedding-001). -
task_type(str | None, default:None) –Optional task-type hint for the embedding model. One of
RETRIEVAL_QUERY,RETRIEVAL_DOCUMENT,SEMANTIC_SIMILARITY,CLASSIFICATION,CLUSTERING,QUESTION_ANSWERING,FACT_VERIFICATION. If None, the API default is used. -
output_dimensionality(int | None, default:None) –Optional reduced output dimensionality. When specified the API returns truncated embeddings (Matryoshka Representation Learning). If None, the model's full dimensionality is used (768 for
gemini-embedding-001). -
normalize(bool, default:False) –Whether to L2-normalise the embeddings. Defaults to
Falsebecause Gemini embedding models return pre-normalized vectors. -
api_key_env(str, default:'GEMINI_API_KEY') –Environment variable that holds the API key (default
GEMINI_API_KEY). -
version(str | None, default:None) –Explicit version string for the UDF so that key rotation does not change the UDF hash and trigger a re-backfill.
-
dimension(int | None, default:None) –Optional pre-specified embedding dimension. If None (default), the dimension is looked up from a built-in table of known models (or determined from output_dimensionality if set). If provided, model loading is deferred until UDF execution.
Returns:
-
UDF–A UDF instance ready to be registered with a Geneva dataset.