prashantkumarbarman commited on
Commit
efa587c
1 Parent(s): c093cc1

Pushing the model to Huggingace hub

Browse files
README.md ADDED
@@ -0,0 +1,210 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - multilingual
4
+ - af
5
+ - sq
6
+ - ar
7
+ - an
8
+ - hy
9
+ - ast
10
+ - az
11
+ - ba
12
+ - eu
13
+ - bar
14
+ - be
15
+ - bn
16
+ - inc
17
+ - bs
18
+ - br
19
+ - bg
20
+ - my
21
+ - ca
22
+ - ceb
23
+ - ce
24
+ - zh
25
+ - cv
26
+ - hr
27
+ - cs
28
+ - da
29
+ - nl
30
+ - en
31
+ - et
32
+ - fi
33
+ - fr
34
+ - gl
35
+ - ka
36
+ - de
37
+ - el
38
+ - gu
39
+ - ht
40
+ - he
41
+ - hi
42
+ - hu
43
+ - is
44
+ - io
45
+ - id
46
+ - ga
47
+ - it
48
+ - ja
49
+ - jv
50
+ - kn
51
+ - kk
52
+ - ky
53
+ - ko
54
+ - la
55
+ - lv
56
+ - lt
57
+ - roa
58
+ - nds
59
+ - lm
60
+ - mk
61
+ - mg
62
+ - ms
63
+ - ml
64
+ - mr
65
+ - min
66
+ - ne
67
+ - new
68
+ - nb
69
+ - nn
70
+ - oc
71
+ - fa
72
+ - pms
73
+ - pl
74
+ - pt
75
+ - pa
76
+ - ro
77
+ - ru
78
+ - sco
79
+ - sr
80
+ - hr
81
+ - scn
82
+ - sk
83
+ - sl
84
+ - aze
85
+ - es
86
+ - su
87
+ - sw
88
+ - sv
89
+ - tl
90
+ - tg
91
+ - ta
92
+ - tt
93
+ - te
94
+ - tr
95
+ - uk
96
+ - ud
97
+ - uz
98
+ - vi
99
+ - vo
100
+ - war
101
+ - cy
102
+ - fry
103
+ - pnb
104
+ - yo
105
+ thumbnail: https://amberoad.de/images/logo_text.png
106
+ tags:
107
+ - msmarco
108
+ - multilingual
109
+ - passage reranking
110
+ license: apache-2.0
111
+ datasets:
112
+ - msmarco
113
+ metrics:
114
+ - MRR
115
+ widget:
116
+ - query: What is a corporation?
117
+ passage: A company is incorporated in a specific nation, often within the bounds
118
+ of a smaller subset of that nation, such as a state or province. The corporation
119
+ is then governed by the laws of incorporation in that state. A corporation may
120
+ issue stock, either private or public, or may be classified as a non-stock corporation.
121
+ If stock is issued, the corporation will usually be governed by its shareholders,
122
+ either directly or indirectly.
123
+ ---
124
+
125
+ # Passage Reranking Multilingual BERT 🔃 🌍
126
+
127
+
128
+
129
+ ## Model description
130
+ **Input:** Supports over 100 Languages. See [List of supported languages](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages) for all available.
131
+
132
+ **Purpose:** This module takes a search query [1] and a passage [2] and calculates if the passage matches the query.
133
+ It can be used as an improvement for Elasticsearch Results and boosts the relevancy by up to 100%.
134
+
135
+ **Architecture:** On top of BERT there is a Densly Connected NN which takes the 768 Dimensional [CLS] Token as input and provides the output ([Arxiv](https://arxiv.org/abs/1901.04085)).
136
+
137
+ **Output:** Just a single value between between -10 and 10. Better matching query,passage pairs tend to have a higher a score.
138
+
139
+
140
+
141
+ ## Intended uses & limitations
142
+ Both query[1] and passage[2] have to fit in 512 Tokens.
143
+ As you normally want to rerank the first dozens of search results keep in mind the inference time of approximately 300 ms/query.
144
+
145
+ #### How to use
146
+
147
+ ```python
148
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
149
+
150
+ tokenizer = AutoTokenizer.from_pretrained("amberoad/bert-multilingual-passage-reranking-msmarco")
151
+
152
+ model = AutoModelForSequenceClassification.from_pretrained("amberoad/bert-multilingual-passage-reranking-msmarco")
153
+ ```
154
+
155
+ This Model can be used as a drop-in replacement in the [Nboost Library](https://github.com/koursaros-ai/nboost)
156
+ Through this you can directly improve your Elasticsearch Results without any coding.
157
+
158
+
159
+ ## Training data
160
+
161
+ This model is trained using the [**Microsoft MS Marco Dataset**](https://microsoft.github.io/msmarco/ "Microsoft MS Marco"). This training dataset contains approximately 400M tuples of a query, relevant and non-relevant passages. All datasets used for training and evaluating are listed in this [table](https://github.com/microsoft/MSMARCO-Passage-Ranking#data-information-and-formating). The used dataset for training is called *Train Triples Large*, while the evaluation was made on *Top 1000 Dev*. There are 6,900 queries in total in the development dataset, where each query is mapped to top 1,000 passage retrieved using BM25 from MS MARCO corpus.
162
+
163
+ ## Training procedure
164
+
165
+ The training is performed the same way as stated in this [README](https://github.com/nyu-dl/dl4marco-bert "NYU Github"). See their excellent Paper on [Arxiv](https://arxiv.org/abs/1901.04085).
166
+
167
+ We changed the BERT Model from an English only to the default BERT Multilingual uncased Model from [Google](https://huggingface.co/bert-base-multilingual-uncased).
168
+
169
+ Training was done 400 000 Steps. This equaled 12 hours an a TPU V3-8.
170
+
171
+
172
+ ## Eval results
173
+
174
+ We see nearly similar performance than the English only Model in the English [Bing Queries Dataset](http://www.msmarco.org/). Although the training data is English only internal Tests on private data showed a far higher accurancy in German than all other available models.
175
+
176
+
177
+
178
+ Fine-tuned Models | Dependency | Eval Set | Search Boost<a href='#benchmarks'> | Speed on GPU
179
+ ----------------------------------------------------------------------------------- | ---------------------------------------------------------------------------- | ------------------------------------------------------------------ | ----------------------------------------------------- | ----------------------------------
180
+ **`amberoad/Multilingual-uncased-MSMARCO`** (This Model) | <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-blue"/> | <a href ='http://www.msmarco.org/'>bing queries</a> | **+61%** <sub><sup>(0.29 vs 0.18)</sup></sub> | ~300 ms/query <a href='#footnotes'>
181
+ `nboost/pt-tinybert-msmarco` | <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-red"/> | <a href ='http://www.msmarco.org/'>bing queries</a> | **+45%** <sub><sup>(0.26 vs 0.18)</sup></sub> | ~50ms/query <a href='#footnotes'>
182
+ `nboost/pt-bert-base-uncased-msmarco` | <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-red"/> | <a href ='http://www.msmarco.org/'>bing queries</a> | **+62%** <sub><sup>(0.29 vs 0.18)</sup></sub> | ~300 ms/query<a href='#footnotes'>
183
+ `nboost/pt-bert-large-msmarco` | <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-red"/> | <a href ='http://www.msmarco.org/'>bing queries</a> | **+77%** <sub><sup>(0.32 vs 0.18)</sup></sub> | -
184
+ `nboost/pt-biobert-base-msmarco` | <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-red"/> | <a href ='https://github.com/naver/biobert-pretrained'>biomed</a> | **+66%** <sub><sup>(0.17 vs 0.10)</sup></sub> | ~300 ms/query<a href='#footnotes'>
185
+
186
+ This table is taken from [nboost](https://github.com/koursaros-ai/nboost) and extended by the first line.
187
+
188
+
189
+
190
+ ## Contact Infos
191
+
192
+ ![](https://amberoad.de/images/logo_text.png)
193
+
194
+ Amberoad is a company focussing on Search and Business Intelligence.
195
+ We provide you:
196
+ * Advanced Internal Company Search Engines thorugh NLP
197
+ * External Search Egnines: Find Competitors, Customers, Suppliers
198
+
199
+ **Get in Contact now to benefit from our Expertise:**
200
+
201
+ The training and evaluation was performed by [**Philipp Reissel**](https://reissel.eu/) and [**Igli Manaj**](https://github.com/iglimanaj)
202
+
203
+ [![Amberoad](https://i.stack.imgur.com/gVE0j.png) Linkedin](https://de.linkedin.com/company/amberoad) | <svg xmlns="http://www.w3.org/2000/svg" x="0px" y="0px"
204
+ width="32" height="32"
205
+ viewBox="0 0 172 172"
206
+ style=" fill:#000000;"><g fill="none" fill-rule="nonzero" stroke="none" stroke-width="1" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="10" stroke-dasharray="" stroke-dashoffset="0" font-family="none" font-weight="none" font-size="none" text-anchor="none" style="mix-blend-mode: normal"><path d="M0,172v-172h172v172z" fill="none"></path><g fill="#e67e22"><path d="M37.625,21.5v86h96.75v-86h-5.375zM48.375,32.25h10.75v10.75h-10.75zM69.875,32.25h10.75v10.75h-10.75zM91.375,32.25h32.25v10.75h-32.25zM48.375,53.75h75.25v43h-75.25zM80.625,112.875v17.61572c-1.61558,0.93921 -2.94506,2.2687 -3.88428,3.88428h-49.86572v10.75h49.86572c1.8612,3.20153 5.28744,5.375 9.25928,5.375c3.97183,0 7.39808,-2.17347 9.25928,-5.375h49.86572v-10.75h-49.86572c-0.93921,-1.61558 -2.2687,-2.94506 -3.88428,-3.88428v-17.61572z"></path></g></g></svg>[Homepage](https://de.linkedin.com/company/amberoad) | [Email](info@amberoad.de)
207
+
208
+
209
+
210
+
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"architectures": [
2
+ "BertForSequenceClassification"
3
+ ],
4
+ "attention_probs_dropout_prob": 0.1,
5
+ "directionality": "bidi",
6
+ "gradient_checkpointing": false,
7
+ "hidden_act": "gelu",
8
+ "hidden_dropout_prob": 0.1,
9
+ "hidden_size": 768,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 3072,
12
+ "layer_norm_eps": 1e-12,
13
+ "max_position_embeddings": 512,
14
+ "model_type": "bert",
15
+ "num_attention_heads": 12,
16
+ "num_hidden_layers": 12,
17
+ "pad_token_id": 0,
18
+ "pooler_fc_size": 768,
19
+ "pooler_num_attention_heads": 12,
20
+ "pooler_num_fc_layers": 3,
21
+ "pooler_size_per_head": 128,
22
+ "pooler_type": "first_token_transform",
23
+ "type_vocab_size": 2,
24
+ "vocab_size": 105879
25
+ }
flax_model.msgpack ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1d4cd912eb99c7d8d5a9e3a58a8cdecd47fda4fcf59bd0fec8a0b06b1584b099
3
+ size 669439034
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:188287a61bb87387f5a2783ccfffb4f649ceed50d8fc7bbff4a7cb964105cbc1
3
+ size 669478888
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:782ba55476d1387470ab4c0b8ecdf05b544fc86f484952044aa57469332a63ec
3
+ size 669702896
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"special_tokens_map_file": null, "full_tokenizer_file": null}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff