DeepPavlov Russian fastText Static Embeddings

This repository provides a Sentence-Transformers StaticEmbedding conversion of DeepPavlov's ft_native_300_ru_wiki_lenta_lower_case Russian fastText vectors.

The source vectors are unchanged. The model adds an explicit Normalize module after StaticEmbedding, so model.encode(...) returns L2-normalized sentence embeddings by default.

Source

DeepPavlov publishes Russian word vectors trained on Russian Wikipedia and Lenta.ru corpora.

Source docs: https://docs.deeppavlov.ai/en/0.0.6.5/intro/pretrained_vectors.html
Source .vec: https://files.deeppavlov.ai/embeddings/ft_native_300_ru_wiki_lenta_lower_case/ft_native_300_ru_wiki_lenta_lower_case.vec
Source model: ft_native_300_ru_wiki_lenta_lower_case
Source preprocessing: nltk word_tokenize + lowercasing
Training family: fastText skip-gram, 300 dimensions
License: Apache 2.0

This conversion uses the text .vec vocabulary. It does not include fastText .bin subword OOV generation; out-of-vocabulary tokens map to [UNK].

Usage

pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("BorisTM/deeppavlov_ft_native_300_ru_wiki_lenta_lower_case")

sentences = [
    "Сегодня хорошая погода.",
    "На улице солнечно.",
    "Команда выиграла матч.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)  # (3, 300)
print(np.linalg.norm(embeddings, axis=1))  # close to 1.0

similarities = model.similarity(embeddings, embeddings)
print(similarities)

Results

Results on MTEB (rus, v1.1), evaluated with normalized embeddings. Scores are percentages.

Task	`ft_native_300_ru_wiki_lenta_lower_case`
Mean (Task, 23 tasks)	42.85
Mean (Task Type)	39.86
GeoreviewClassification	36.23
HeadlineClassification	80.17
InappropriatenessClassification	56.21
KinopoiskClassification	45.29
MassiveIntentClassification	51.04
MassiveScenarioClassification	57.92
RuReviewsClassification	51.09
RuSciBenchGRNTIClassification	48.91
RuSciBenchOECDClassification	39.99
GeoreviewClusteringP2P	37.85
RuSciBenchGRNTIClusteringP2P	47.35
RuSciBenchOECDClusteringP2P	42.36
CEDRClassification	33.80
SensitiveTopicsClassification	20.83
TERRa	52.46
MIRACLReranking	17.06
RuBQReranking	43.99
MIRACLRetrievalHardNegatives.v2	9.68
RiaNewsRetrievalHardNegatives.v2	28.58
RuBQRetrieval	18.07
RUParaPhraserSTS	56.50
STS22	52.00
RuSTSBenchmarkSTS	58.09

Evaluation artifacts for this update are stored locally in the article workspace under data/metrics/deeppavlov_fasttext_mteb_rus_v1_1_summary.csv and data/metrics/deeppavlov_fasttext_mteb_rus_v1_1_task_scores.csv.

Downloads last month: -; Downloads are not tracked for this model. How to track

Safetensors

Model size

0.5B params

Tensor type

F32