--- license: mit language: - ru - en tags: - transformers - sentence-transformers --- # Model Card for ru-en-RoSBERTa The ru-en-RoSBERTa is a general text embedding model for Russian. The model is based on [ruRoBERTa](https://huggingface.co/ai-forever/ruRoberta-large) and fine-tuned with ~4M pairs of supervised, synthetic and unsupervised data in Russian and English. Tokenizer supports some English tokens from [RoBERTa](https://huggingface.co/FacebookAI/roberta-large) tokenizer. For more model details please refer to our [article](https://arxiv.org/abs/2408.12503). ## Usage The model can be used as is with prefixes. It is recommended to use CLS pooling. The choice of prefix and pooling depends on the task. We use the following basic rules to choose a prefix: - `"search_query: "` and `"search_document: "` prefixes are for answer or relevant paragraph retrieval - `"classification: "` prefix is for symmetric paraphrasing related tasks (STS, NLI, Bitext Mining) - `"clustering: "` prefix is for any tasks that rely on thematic features (topic classification, title-body retrieval) To better tailor the model to your needs, you can fine-tune it with relevant high-quality Russian and English datasets. Below are examples of texts encoding using the Transformers and SentenceTransformers libraries. ### Transformers ```python import torch import torch.nn.functional as F from transformers import AutoTokenizer, AutoModel def pool(hidden_state, mask, pooling_method="cls"): if pooling_method == "mean": s = torch.sum(hidden_state * mask.unsqueeze(-1).float(), dim=1) d = mask.sum(axis=1, keepdim=True).float() return s / d elif pooling_method == "cls": return hidden_state[:, 0] inputs = [ # "classification: Он нам и не нужон ваш Интернет!", "clustering: В Ярославской области разрешили работу бань, но без посетителей", "search_query: Сколько программистов нужно, чтобы вкрутить лампочку?", # "classification: What a time to be alive!", "clustering: Ярославским баням разрешили работать без посетителей", "search_document: Чтобы вкрутить лампочку, требуется три программиста: один напишет программу извлечения лампочки, другой — вкручивания лампочки, а третий проведет тестирование.", ] tokenizer = AutoTokenizer.from_pretrained("ai-forever/ru-en-RoSBERTa") model = AutoModel.from_pretrained("ai-forever/ru-en-RoSBERTa") tokenized_inputs = tokenizer(inputs, max_length=512, padding=True, truncation=True, return_tensors="pt") with torch.no_grad(): outputs = model(**tokenized_inputs) embeddings = pool( outputs.last_hidden_state, tokenized_inputs["attention_mask"], pooling_method="cls" # or try "mean" ) embeddings = F.normalize(embeddings, p=2, dim=1) sim_scores = embeddings[:3] @ embeddings[3:].T print(sim_scores.diag().tolist()) # [0.4796873927116394, 0.9409002065658569, 0.7761015892028809] ``` ### SentenceTransformers ```python from sentence_transformers import SentenceTransformer inputs = [ # "classification: Он нам и не нужон ваш Интернет!", "clustering: В Ярославской области разрешили работу бань, но без посетителей", "search_query: Сколько программистов нужно, чтобы вкрутить лампочку?", # "classification: What a time to be alive!", "clustering: Ярославским баням разрешили работать без посетителей", "search_document: Чтобы вкрутить лампочку, требуется три программиста: один напишет программу извлечения лампочки, другой — вкручивания лампочки, а третий проведет тестирование.", ] # loads model with CLS pooling model = SentenceTransformer("ai-forever/ru-en-RoSBERTa") # embeddings are normalized by default embeddings = model.encode(inputs, convert_to_tensor=True) sim_scores = embeddings[:3] @ embeddings[3:].T print(sim_scores.diag().tolist()) # [0.47968706488609314, 0.940900444984436, 0.7761018872261047] ``` ## Citation ``` @misc{snegirev2024russianfocusedembeddersexplorationrumteb, title={The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design}, author={Artem Snegirev and Maria Tikhonova and Anna Maksimova and Alena Fenogenova and Alexander Abramov}, year={2024}, eprint={2408.12503}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2408.12503}, } ``` ## Limitations The model is designed to process texts in Russian, the quality in English is unknown. Maximum input text length is limited to 512 tokens.