Instructions to use BorisTM/deeppavlov_ft_native_300_ru_wiki_lenta_lower_case with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use BorisTM/deeppavlov_ft_native_300_ru_wiki_lenta_lower_case with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("BorisTM/deeppavlov_ft_native_300_ru_wiki_lenta_lower_case") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
DeepPavlov Russian fastText Static Embeddings
This repository provides a Sentence-Transformers StaticEmbedding conversion of DeepPavlov's ft_native_300_ru_wiki_lenta_lower_case Russian fastText vectors.
The source vectors are unchanged. The model adds an explicit Normalize module after StaticEmbedding, so model.encode(...) returns L2-normalized sentence embeddings by default.
Source
DeepPavlov publishes Russian word vectors trained on Russian Wikipedia and Lenta.ru corpora.
- Source docs: https://docs.deeppavlov.ai/en/0.0.6.5/intro/pretrained_vectors.html
- Source
.vec: https://files.deeppavlov.ai/embeddings/ft_native_300_ru_wiki_lenta_lower_case/ft_native_300_ru_wiki_lenta_lower_case.vec - Source model:
ft_native_300_ru_wiki_lenta_lower_case - Source preprocessing:
nltk word_tokenize+ lowercasing - Training family: fastText skip-gram, 300 dimensions
- License: Apache 2.0
This conversion uses the text .vec vocabulary. It does not include fastText .bin subword OOV generation; out-of-vocabulary tokens map to [UNK].
Usage
pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("BorisTM/deeppavlov_ft_native_300_ru_wiki_lenta_lower_case")
sentences = [
"Сегодня хорошая погода.",
"На улице солнечно.",
"Команда выиграла матч.",
]
embeddings = model.encode(sentences)
print(embeddings.shape) # (3, 300)
print(np.linalg.norm(embeddings, axis=1)) # close to 1.0
similarities = model.similarity(embeddings, embeddings)
print(similarities)
Results
Results on MTEB (rus, v1.1), evaluated with normalized embeddings. Scores are percentages.
| Task | ft_native_300_ru_wiki_lenta_lower_case |
|---|---|
| Mean (Task, 23 tasks) | 42.85 |
| Mean (Task Type) | 39.86 |
| GeoreviewClassification | 36.23 |
| HeadlineClassification | 80.17 |
| InappropriatenessClassification | 56.21 |
| KinopoiskClassification | 45.29 |
| MassiveIntentClassification | 51.04 |
| MassiveScenarioClassification | 57.92 |
| RuReviewsClassification | 51.09 |
| RuSciBenchGRNTIClassification | 48.91 |
| RuSciBenchOECDClassification | 39.99 |
| GeoreviewClusteringP2P | 37.85 |
| RuSciBenchGRNTIClusteringP2P | 47.35 |
| RuSciBenchOECDClusteringP2P | 42.36 |
| CEDRClassification | 33.80 |
| SensitiveTopicsClassification | 20.83 |
| TERRa | 52.46 |
| MIRACLReranking | 17.06 |
| RuBQReranking | 43.99 |
| MIRACLRetrievalHardNegatives.v2 | 9.68 |
| RiaNewsRetrievalHardNegatives.v2 | 28.58 |
| RuBQRetrieval | 18.07 |
| RUParaPhraserSTS | 56.50 |
| STS22 | 52.00 |
| RuSTSBenchmarkSTS | 58.09 |
Evaluation artifacts for this update are stored locally in the article workspace under data/metrics/deeppavlov_fasttext_mteb_rus_v1_1_summary.csv and data/metrics/deeppavlov_fasttext_mteb_rus_v1_1_task_scores.csv.