--- language: - multilingual - af - sq - ar - an - hy - ast - az - ba - eu - bar - be - bn - inc - bs - br - bg - my - ca - ceb - ce - zh - cv - hr - cs - da - nl - en - et - fi - fr - gl - ka - de - el - gu - ht - he - hi - hu - is - io - id - ga - it - ja - jv - kn - kk - ky - ko - la - lv - lt - roa - nds - lm - mk - mg - ms - ml - mr - min - ne - new - nb - nn - oc - fa - pms - pl - pt - pa - ro - ru - sco - sr - hr - scn - sk - sl - aze - es - su - sw - sv - tl - tg - ta - tt - te - tr - uk - ud - uz - vi - vo - war - cy - fry - pnb - yo thumbnail: https://amberoad.de/images/logo_text.png tags: - msmarco - multilingual - passage reranking license: apache-2.0 datasets: - msmarco metrics: - MRR widget: - query: What is a corporation? passage: A company is incorporated in a specific nation, often within the bounds of a smaller subset of that nation, such as a state or province. The corporation is then governed by the laws of incorporation in that state. A corporation may issue stock, either private or public, or may be classified as a non-stock corporation. If stock is issued, the corporation will usually be governed by its shareholders, either directly or indirectly. --- # Passage Reranking Multilingual BERT 🔃 🌍 ## Model description **Input:** Supports over 100 Languages. See [List of supported languages](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages) for all available. **Purpose:** This module takes a search query [1] and a passage [2] and calculates if the passage matches the query. It can be used as an improvement for Elasticsearch Results and boosts the relevancy by up to 100%. **Architecture:** On top of BERT there is a Densly Connected NN which takes the 768 Dimensional [CLS] Token as input and provides the output ([Arxiv](https://arxiv.org/abs/1901.04085)). **Output:** Just a single value between between -10 and 10. Better matching query,passage pairs tend to have a higher a score. ## Intended uses & limitations Both query[1] and passage[2] have to fit in 512 Tokens. As you normally want to rerank the first dozens of search results keep in mind the inference time of approximately 300 ms/query. #### How to use ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("amberoad/bert-multilingual-passage-reranking-msmarco") model = AutoModelForSequenceClassification.from_pretrained("amberoad/bert-multilingual-passage-reranking-msmarco") ``` This Model can be used as a drop-in replacement in the [Nboost Library](https://github.com/koursaros-ai/nboost) Through this you can directly improve your Elasticsearch Results without any coding. ## Training data This model is trained using the [**Microsoft MS Marco Dataset**](https://microsoft.github.io/msmarco/ "Microsoft MS Marco"). This training dataset contains approximately 400M tuples of a query, relevant and non-relevant passages. All datasets used for training and evaluating are listed in this [table](https://github.com/microsoft/MSMARCO-Passage-Ranking#data-information-and-formating). The used dataset for training is called *Train Triples Large*, while the evaluation was made on *Top 1000 Dev*. There are 6,900 queries in total in the development dataset, where each query is mapped to top 1,000 passage retrieved using BM25 from MS MARCO corpus. ## Training procedure The training is performed the same way as stated in this [README](https://github.com/nyu-dl/dl4marco-bert "NYU Github"). See their excellent Paper on [Arxiv](https://arxiv.org/abs/1901.04085). We changed the BERT Model from an English only to the default BERT Multilingual uncased Model from [Google](https://huggingface.co/bert-base-multilingual-uncased). Training was done 400 000 Steps. This equaled 12 hours an a TPU V3-8. ## Eval results We see nearly similar performance than the English only Model in the English [Bing Queries Dataset](http://www.msmarco.org/). Although the training data is English only internal Tests on private data showed a far higher accurancy in German than all other available models. Fine-tuned Models | Dependency | Eval Set | Search Boost | Speed on GPU ----------------------------------------------------------------------------------- | ---------------------------------------------------------------------------- | ------------------------------------------------------------------ | ----------------------------------------------------- | ---------------------------------- **`amberoad/Multilingual-uncased-MSMARCO`** (This Model) | PyTorch | bing queries | **+61%** (0.29 vs 0.18) | ~300 ms/query `nboost/pt-tinybert-msmarco` | PyTorch | bing queries | **+45%** (0.26 vs 0.18) | ~50ms/query `nboost/pt-bert-base-uncased-msmarco` | PyTorch | bing queries | **+62%** (0.29 vs 0.18) | ~300 ms/query `nboost/pt-bert-large-msmarco` | PyTorch | bing queries | **+77%** (0.32 vs 0.18) | - `nboost/pt-biobert-base-msmarco` | PyTorch | biomed | **+66%** (0.17 vs 0.10) | ~300 ms/query This table is taken from [nboost](https://github.com/koursaros-ai/nboost) and extended by the first line. ## Contact Infos ![](https://amberoad.de/images/logo_text.png) Amberoad is a company focussing on Search and Business Intelligence. We provide you: * Advanced Internal Company Search Engines thorugh NLP * External Search Egnines: Find Competitors, Customers, Suppliers **Get in Contact now to benefit from our Expertise:** The training and evaluation was performed by [**Philipp Reissel**](https://reissel.eu/) and [**Igli Manaj**](https://github.com/iglimanaj) [![Amberoad](https://i.stack.imgur.com/gVE0j.png) Linkedin](https://de.linkedin.com/company/amberoad) | [Homepage](https://de.linkedin.com/company/amberoad) | [Email](info@amberoad.de)