Edit model card

snagbreac/russian-reverse-dictionary-semsearch

This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

This particular model has been trained specifically on Russian definition-word pairs from dictionary data and crosswords. As such, it can be used as a reverse dictionary when plugged into the semantic search pipeline as an encoder. It might be useful for something else, but I haven't tested for that so I wouldn't know.

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('snagbreac/russian-reverse-dictionary-semsearch')
embeddings = model.encode(sentences)
print(embeddings)

Usage (HuggingFace Transformers)

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('snagbreac/ruberttiny2-finetuned-rurevdict-seq2seq-batch16-epoch3-default')
model = AutoModel.from_pretrained('snagbreac/ruberttiny2-finetuned-rurevdict-seq2seq-batch16-epoch3-default')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

Usage as a Russian reverse dictionary with semantic search

For semantic search, you will require an additional list of Russian lemmas. I used a custom list compiled using Zaliznyak dictionary data and Russian Wiktionary data (with pymorphy2 morphological analysis) which I unfortunately cannot publish; however, do note that any list will do as long as it's large enough and known to contain only Russian-language lemmas.

Other than using specifically lemmas as a corpus to perform search upon, the process is quite standard for semantic search. First, you encode all the lemmas in your list as described above (try to use GPU if possible, otherwise this process might take a while). Then encode the query as well and compare it to all lemmas in the corpus, returning only the most similar entries (how many entries to return is probably up to you, I'd recommend 100). SBERT has excellent code snippets for semantic search that you can use for this.

Evaluation Results

This model was evaluated on the test dataset I collected (it's here) using pretty standard reverse dictionary evaluation metrics: median rank of the correct answer, standard deviation of the rank of the correct answer (some people use rank variance, I can't imagine this matters much) and proportion of the test cases in which the correct answer appeared in the top 1/10/100 results (aka TopN). (See here for the first appearance (to my knowledge) of these metrics in the wild.)

Accounting for the split in the test dataset, the results were as follows:

Test dataset subsection Med Dev Top1 Top10 Top100
Dictionary definitions 8 423 0.29 0.51 0.68
User descriptions 6 243 0.21 0.57 0.85

We can compare this to the baselines of rubert-tiny2 encodings:

  1. Without any fine-tuning, just the raw encodings:
Test dataset subsection Med Dev Top1 Top10 Top100
Dictionary definitions 1000 426 0.03 0.13 0.22
User descriptions 479 445 0.02 0.12 0.32
  1. With fine-tuning on the training dataset:
Test dataset subsection Med Dev Top1 Top10 Top100
Dictionary definitions 50 466 0.25 0.41 0.56
User descriptions 18 336 0.06 0.43 0.68

Basically, just the fine-tuning greatly improves the initial baseline, but our model outperforms it further.

Training

The model was trained with the parameters:

DataLoader:

torch.utils.data.dataloader.DataLoader of length 17457 with parameters:

{'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss:

sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss with parameters:

{'scale': 20.0, 'similarity_fct': 'cos_sim'}

Parameters of the fit()-Method:

{
    "epochs": 3,
    "evaluation_steps": 1000,
    "evaluator": "sentence_transformers.evaluation.TranslationEvaluator.TranslationEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 100,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Citing & Authors

Downloads last month
12
Safetensors
Model size
178M params
Tensor type
F32
·

Dataset used to train snagbreac/russian-reverse-dictionary-semsearch