sign_language_translator.models.text_embedding package

Submodules

Module contents

class sign_language_translator.models.text_embedding.TextEmbeddingModel[source][source]

Bases: ABC

Abstract class for text embedding models.

embed(text

str) -> torch.Tensor: Embeds text into a vector.

abstract embed(text: str) Tensor[source][source]

Embeds text into a vector.

Parameters:

text (str) – Text to embed.

Returns:

A vector representation of a text.

Return type:

torch.Tensor

class sign_language_translator.models.text_embedding.VectorLookupModel(tokens: List[str], vectors: Tensor, alignment_matrix: Tensor | None = None, description: str = '')[source][source]

Bases: TextEmbeddingModel

VectorLookupModel class extends TextEmbeddingModel to provide text embedding based on pre-defined token vectors.

- index_to_token

A list containing tokens in the same order as the vectors.

Type:

List[str]

- known_tokens

A frozenset containing unique known tokens.

Type:

frozenset

- token_to_index

A dictionary mapping tokens to their corresponding indices.

Type:

Dict[str, int]

- vectors

A 2D tensor representing the token vectors.

Type:

torch.Tensor

- update(self, tokens

List[str], vectors: torch.Tensor) -> None: Updates existing tokens & hash-table with new vectors.

- embed(self, text
str, pre_normalize=False, post_normalize=False,

tokenizer: Callable[[str], Iterable[str]] = lambda x: x.split()) -> torch.Tensor:

Returns the pretrained embedding vector for a token or average embedding of sub tokens.

- __getitem__(self, token

str) -> torch.Tensor: Returns the vector for a specific token.

- save(self, path

str): Saves the model state (tokens & vectors) to a file.

- load(cls, path

str): Loads a saved model state (tokens & vectors) from a file.

Example:

..code-block:: python

from sign_language_translator.models import VectorLookupModel import torch

tokens = [“example”, “text”] vectors = torch.tensor([[1, 2, 3], [4, 5, 6]]) model = VectorLookupModel(tokens, vectors)

embedding = model.embed(“example text”) # [2.5, 3.5, 4.5]

model.update([“hello”], torch.tensor([[7, 8, 9]]))

model.save(“model.pt”) loaded_model = VectorLookupModel.load(“model.pt”)

embed(text: str, pre_normalize=False, post_normalize=False, align=False, tokenizer: ~typing.Callable[[str], ~typing.Iterable[str]] = <function VectorLookupModel.<lambda>>) Tensor[source][source]

Embeds the given text into a vector representation by lookup or averaging pre-computed embeddings.

Parameters:
  • text (str) – The input text to be embedded, (can be in the model vocabulary or be a string of tokens from the model dictionary). If unknown, returns a zero vector.

  • pre_normalize (bool, optional) – Whether to normalize the vectors of tokens in the text before averaging. Defaults to False.

  • post_normalize (bool, optional) – Whether to normalize the vector after embedding. Defaults to False.

  • align (bool, optional) – Whether to transform the final vector using the alignment matrix. Defaults to False.

  • tokenizer (Callable[[str], Iterable[str]], optional) – A callable function to tokenize the text. Only used if the text is not present in the model vocabulary. Defaults to splitting on whitespace.

Returns:

The embedded vector representation of the input text.

Return type:

torch.Tensor

classmethod load(path: str)[source][source]

Load a VectorLookupModel from a saved checkpoint. If the path ends with ‘.zip’ the file will be decompressed.

Parameters:

path (str) – The path to the saved checkpoint.

Returns:

The loaded VectorLookupModel instance.

Return type:

VectorLookupModel

property normalized_vectors[source]
save(path: str)[source][source]

Serialize the tokens list and corresponding vectors to a file. If the path ends with ‘.zip’ the file will be compressed.

Parameters:

path (str) – The path to save the model file.

similar(vector: Tensor, k: int = 1) Tuple[List[str], List[float]][source][source]

Find the k most similar tokens to the given vector.

Parameters:
  • vector (torch.Tensor) – The 1D vector for which to find similar tokens.

  • k (int, optional) – The number of similar tokens to return. Defaults to 1.

Returns:

A tuple containing the k most similar tokens and their corresponding cosine similarities.

Return type:

Tuple[List[str], List[float]]

property tokens_array[source]
update(tokens: List[str], vectors: Tensor) None[source][source]

Update the vector lookup model with new tokens and their corresponding vectors.

Parameters:
  • tokens (List[str]) – The list of new tokens to be added or updated.

  • vectors (torch.Tensor) – The tensor of corresponding vectors for the new tokens.

  • alignment_matrix (Optional[torch.Tensor], optional) – A 2D Tensor to transform the final vectors. (e.g. some orthogonal matrix can be used to align the word vector to an embedding for some other language or model). Defaults to None.

  • description (str, optional) – A description of the model. Defaults to “”.

Raises:

ValueError – If the dimensions of the new vectors do not match the dimensions of the existing vectors.