Skip to content

Embeddings

geneva.udfs.embeddings.sentence_transformer_udf

sentence_transformer_udf(
    model: str = DEFAULT_SENTENCE_TRANSFORMER_MODEL,
    column: str = DEFAULT_SENTENCE_TRANSFORMER_COLUMN,
    normalize: bool = True,
    num_gpus: float = 0.0,
    trust_remote_code: bool = False,
    dimension: int | None = None,
) -> UDF

Return a stateful sentence-transformers embedding UDF.

Parameters:

  • model (str, default: DEFAULT_SENTENCE_TRANSFORMER_MODEL ) –

    The model being used for embedding. by default, it uses sentence-transformers/all-MiniLM-L6-v2 from HuggingFace Hub.

  • column (str, default: DEFAULT_SENTENCE_TRANSFORMER_COLUMN ) –

    Name of the column that will be embedded. By default, it uses text.

  • normalize (bool, default: True ) –

    Whether to L2-normalise the generated embeddings. Defaults to True.

  • num_gpus (float, default: 0.0 ) –

    Fractional GPU allocation requested for the UDF. Values >= 0 Be default, keeps execution on CPU; positive values request CUDA.

  • trust_remote_code (bool, default: False ) –

    Whether to trust remote code when loading the model. Defaults to False as recommended by sentence-transformers.

  • dimension (int | None, default: None ) –

    Optional pre-specified embedding dimension. If None (default), will eagerly load the model to determine dimension. If provided, model loading is deferred until UDF execution. Use this for lazy loading when the model is not available at UDF definition time (e.g., in manifest upload scripts).

Returns:

  • UDF

    A UDF instance that can be registered with a Geneva dataset.

geneva.udfs.embeddings.gemini_embedding_udf

gemini_embedding_udf(
    column: str = "text",
    model: str = DEFAULT_GEMINI_EMBEDDING_MODEL,
    task_type: str | None = None,
    output_dimensionality: int | None = None,
    normalize: bool = False,
    api_key_env: str = "GEMINI_API_KEY",
    version: str | None = None,
    dimension: int | None = None,
) -> UDF

Return a Gemini embedding UDF with the API key captured at call time.

The API key is read from os.environ[api_key_env] at call time and serialized with the UDF. On remote workers the key is available without cluster-level env_vars configuration.

Parameters:

  • column (str, default: 'text' ) –

    Name of the input column containing text to embed. Defaults to "text".

  • model (str, default: DEFAULT_GEMINI_EMBEDDING_MODEL ) –

    Gemini embedding model identifier (default gemini-embedding-001).

  • task_type (str | None, default: None ) –

    Optional task-type hint for the embedding model. One of RETRIEVAL_QUERY, RETRIEVAL_DOCUMENT, SEMANTIC_SIMILARITY, CLASSIFICATION, CLUSTERING, QUESTION_ANSWERING, FACT_VERIFICATION. If None, the API default is used.

  • output_dimensionality (int | None, default: None ) –

    Optional reduced output dimensionality. When specified the API returns truncated embeddings (Matryoshka Representation Learning). If None, the model's full dimensionality is used (768 for gemini-embedding-001).

  • normalize (bool, default: False ) –

    Whether to L2-normalise the embeddings. Defaults to False because Gemini embedding models return pre-normalized vectors.

  • api_key_env (str, default: 'GEMINI_API_KEY' ) –

    Environment variable that holds the API key (default GEMINI_API_KEY).

  • version (str | None, default: None ) –

    Explicit version string for the UDF so that key rotation does not change the UDF hash and trigger a re-backfill.

  • dimension (int | None, default: None ) –

    Optional pre-specified embedding dimension. If None (default), the dimension is looked up from a built-in table of known models (or determined from output_dimensionality if set). If provided, model loading is deferred until UDF execution.

Returns:

  • UDF

    A UDF instance ready to be registered with a Geneva dataset.