DeepPavlov Russian fastText Static Embeddings

This repository provides a Sentence-Transformers StaticEmbedding conversion of DeepPavlov's ft_native_300_ru_wiki_lenta_lower_case Russian fastText vectors.

The source vectors are unchanged. The model adds an explicit Normalize module after StaticEmbedding, so model.encode(...) returns L2-normalized sentence embeddings by default.

Source

DeepPavlov publishes Russian word vectors trained on Russian Wikipedia and Lenta.ru corpora.

This conversion uses the text .vec vocabulary. It does not include fastText .bin subword OOV generation; out-of-vocabulary tokens map to [UNK].

Usage

pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("BorisTM/deeppavlov_ft_native_300_ru_wiki_lenta_lower_case")

sentences = [
    "Сегодня хорошая погода.",
    "На улице солнечно.",
    "Команда выиграла матч.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)  # (3, 300)
print(np.linalg.norm(embeddings, axis=1))  # close to 1.0

similarities = model.similarity(embeddings, embeddings)
print(similarities)

Results

Results on MTEB (rus, v1.1), evaluated with normalized embeddings. Scores are percentages.

Task ft_native_300_ru_wiki_lenta_lower_case
Mean (Task, 23 tasks) 42.85
Mean (Task Type) 39.86
GeoreviewClassification 36.23
HeadlineClassification 80.17
InappropriatenessClassification 56.21
KinopoiskClassification 45.29
MassiveIntentClassification 51.04
MassiveScenarioClassification 57.92
RuReviewsClassification 51.09
RuSciBenchGRNTIClassification 48.91
RuSciBenchOECDClassification 39.99
GeoreviewClusteringP2P 37.85
RuSciBenchGRNTIClusteringP2P 47.35
RuSciBenchOECDClusteringP2P 42.36
CEDRClassification 33.80
SensitiveTopicsClassification 20.83
TERRa 52.46
MIRACLReranking 17.06
RuBQReranking 43.99
MIRACLRetrievalHardNegatives.v2 9.68
RiaNewsRetrievalHardNegatives.v2 28.58
RuBQRetrieval 18.07
RUParaPhraserSTS 56.50
STS22 52.00
RuSTSBenchmarkSTS 58.09

Evaluation artifacts for this update are stored locally in the article workspace under data/metrics/deeppavlov_fasttext_mteb_rus_v1_1_summary.csv and data/metrics/deeppavlov_fasttext_mteb_rus_v1_1_task_scores.csv.

Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
0.5B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support