sign_language_translator.models.text_embedding package

Submodules

Module contents

class sign_language_translator.models.text_embedding.TextEmbeddingModel[source][source]

Bases: ABC

Abstract class for text embedding models.

embed(text: str) -> torch.Tensor: Embeds text into a vector.

abstract embed(text: str) → Tensor[source][source]

Embeds text into a vector.

Parameters:: text (str) – Text to embed.
Returns:: A vector representation of a text.
Return type:: torch.Tensor

class sign_language_translator.models.text_embedding.VectorLookupModel(tokens: List[str], vectors: Tensor, alignment_matrix: Tensor | None = None, description: str = '')[source][source]

Bases: TextEmbeddingModel

VectorLookupModel class extends TextEmbeddingModel to provide text embedding based on pre-defined token vectors.

- index_to_token

A list containing tokens in the same order as the vectors.

Type:: List[str]

- known_tokens

A frozenset containing unique known tokens.

Type:: frozenset

- token_to_index

A dictionary mapping tokens to their corresponding indices.

Type:: Dict[str, int]

- vectors

A 2D tensor representing the token vectors.

Type:: torch.Tensor

- update(self, tokens: List[str], vectors: torch.Tensor) -> None: Updates existing tokens & hash-table with new vectors.

- embed(self, text

str, pre_normalize=False, post_normalize=False,: tokenizer: Callable[[str], Iterable[str]] = lambda x: x.split()) -> torch.Tensor:

Returns the pretrained embedding vector for a token or average embedding of sub tokens.

- __getitem__(self, token: str) -> torch.Tensor: Returns the vector for a specific token.

- save(self, path: str): Saves the model state (tokens & vectors) to a file.

- load(cls, path: str): Loads a saved model state (tokens & vectors) from a file.

Example:

..code-block:: python

from sign_language_translator.models import VectorLookupModel import torch

tokens = [“example”, “text”] vectors = torch.tensor([[1, 2, 3], [4, 5, 6]]) model = VectorLookupModel(tokens, vectors)

embedding = model.embed(“example text”) # [2.5, 3.5, 4.5]

model.update([“hello”], torch.tensor([[7, 8, 9]]))

model.save(“model.pt”) loaded_model = VectorLookupModel.load(“model.pt”)

embed(text: str, pre_normalize=False, post_normalize=False, align=False, tokenizer: ~typing.Callable[[str], ~typing.Iterable[str]] = <function VectorLookupModel.<lambda>>) → Tensor[source][source]

Embeds the given text into a vector representation by lookup or averaging pre-computed embeddings.

Parameters:

text (str) – The input text to be embedded, (can be in the model vocabulary or be a string of tokens from the model dictionary). If unknown, returns a zero vector.
pre_normalize (bool, optional) – Whether to normalize the vectors of tokens in the text before averaging. Defaults to False.
post_normalize (bool, optional) – Whether to normalize the vector after embedding. Defaults to False.
align (bool, optional) – Whether to transform the final vector using the alignment matrix. Defaults to False.
tokenizer (Callable[[str], Iterable[str]], optional) – A callable function to tokenize the text. Only used if the text is not present in the model vocabulary. Defaults to splitting on whitespace.

Returns:

The embedded vector representation of the input text.

Return type:

torch.Tensor

classmethod load(path: str)[source][source]

Load a VectorLookupModel from a saved checkpoint. If the path ends with ‘.zip’ the file will be decompressed.

Parameters:: path (str) – The path to the saved checkpoint.
Returns:: The loaded VectorLookupModel instance.
Return type:: VectorLookupModel

property normalized_vectors[source]

save(path: str)[source][source]

Serialize the tokens list and corresponding vectors to a file. If the path ends with ‘.zip’ the file will be compressed.

Parameters:: path (str) – The path to save the model file.

similar(vector: Tensor, k: int = 1) → Tuple[List[str], List[float]][source][source]

Find the k most similar tokens to the given vector.

Parameters:

vector (torch.Tensor) – The 1D vector for which to find similar tokens.
k (int, optional) – The number of similar tokens to return. Defaults to 1.

Returns:

A tuple containing the k most similar tokens and their corresponding cosine similarities.

Return type:

Tuple[List[str], List[float]]

property tokens_array[source]

update(tokens: List[str], vectors: Tensor) → None[source][source]

Update the vector lookup model with new tokens and their corresponding vectors.

Parameters:

tokens (List[str]) – The list of new tokens to be added or updated.
vectors (torch.Tensor) – The tensor of corresponding vectors for the new tokens.
alignment_matrix (Optional[torch.Tensor], optional) – A 2D Tensor to transform the final vectors. (e.g. some orthogonal matrix can be used to align the word vector to an embedding for some other language or model). Defaults to None.
description (str, optional) – A description of the model. Defaults to “”.

Raises:

ValueError – If the dimensions of the new vectors do not match the dimensions of the existing vectors.