--- license: apache-2.0 language: - en - ru datasets: - unicamp-dl/mmarco --- # Model for English and Russian This is a truncated version of [jeffwan/mmarco-mMiniLMv2-L12-H384-v1](https://huggingface.co/jeffwan/mmarco-mMiniLMv2-L12-H384-v1). This model has only English and Russian tokens left in the vocabulary. Thus making it twice as small as the original model while producing the same embeddings. # Cross-Encoder for multilingual MS Marco This model was trained on the [MMARCO](https://hf.co/unicamp-dl/mmarco) dataset. It is a machine translated version of MS MARCO using Google Translate. It was translated to 14 languages. In our experiments, we observed that it performs also well for other languages. As a base model, we used the [multilingual MiniLMv2](https://huggingface.co/nreimers/mMiniLMv2-L12-H384-distilled-from-XLMR-Large) model. The model can be used for Information Retrieval: Given a query, encode the query will all possible passages (e.g. retrieved with ElasticSearch). Then sort the passages in a decreasing order. See [SBERT.net Retrieve & Re-rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html) for more details. The training code is available here: [SBERT.net Training MS Marco](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/ms_marco) ## Usage with SentenceTransformers The usage becomes easy when you have [SentenceTransformers](https://www.sbert.net/) installed. Then, you can use the pre-trained models like this: ```python from sentence_transformers import CrossEncoder model = CrossEncoder('model_name') scores = model.predict([('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')]) ``` ## Usage with Transformers ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model = AutoModelForSequenceClassification.from_pretrained('model_name') tokenizer = AutoTokenizer.from_pretrained('model_name') features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'], padding=True, truncation=True, return_tensors="pt") model.eval() with torch.no_grad(): scores = model(**features).logits print(scores) ``` The model has been truncated in [this notebook](https://colab.research.google.com/drive/19IFjWpJpxQie1gtHSvDeoKk7CQtpy6bT?usp=sharing).