setu4993 commited on
Commit
a86d3da
1 Parent(s): 54a891b

Add model card

Browse files
Files changed (1) hide show
  1. README.md +218 -0
README.md ADDED
@@ -0,0 +1,218 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - af
4
+ - am
5
+ - ar
6
+ - as
7
+ - az
8
+ - be
9
+ - bg
10
+ - bn
11
+ - bo
12
+ - bs
13
+ - ca
14
+ - ceb
15
+ - co
16
+ - cs
17
+ - cy
18
+ - da
19
+ - de
20
+ - el
21
+ - en
22
+ - eo
23
+ - es
24
+ - et
25
+ - eu
26
+ - fa
27
+ - fi
28
+ - fr
29
+ - fy
30
+ - ga
31
+ - gd
32
+ - gl
33
+ - gu
34
+ - ha
35
+ - haw
36
+ - he
37
+ - hi
38
+ - hmn
39
+ - hr
40
+ - ht
41
+ - hu
42
+ - hy
43
+ - id
44
+ - ig
45
+ - is
46
+ - it
47
+ - ja
48
+ - jv
49
+ - ka
50
+ - kk
51
+ - km
52
+ - kn
53
+ - ko
54
+ - ku
55
+ - ky
56
+ - la
57
+ - lb
58
+ - lo
59
+ - lt
60
+ - lv
61
+ - mg
62
+ - mi
63
+ - mk
64
+ - ml
65
+ - mn
66
+ - mr
67
+ - ms
68
+ - mt
69
+ - my
70
+ - ne
71
+ - nl
72
+ - no
73
+ - ny
74
+ - or
75
+ - pa
76
+ - pl
77
+ - pt
78
+ - ro
79
+ - ru
80
+ - rw
81
+ - si
82
+ - sk
83
+ - sl
84
+ - sm
85
+ - sn
86
+ - so
87
+ - sq
88
+ - sr
89
+ - st
90
+ - su
91
+ - sv
92
+ - sw
93
+ - ta
94
+ - te
95
+ - tg
96
+ - th
97
+ - tk
98
+ - tl
99
+ - tr
100
+ - tt
101
+ - ug
102
+ - uk
103
+ - ur
104
+ - uz
105
+ - vi
106
+ - wo
107
+ - xh
108
+ - yi
109
+ - yo
110
+ - zh
111
+ - zu
112
+ tags:
113
+ - bert
114
+ - sentence_embedding
115
+ - multilingual
116
+ - google
117
+ license: Apache-2.0
118
+ datasets:
119
+ - CommonCrawl
120
+ - Wikipedia
121
+ ---
122
+
123
+ # LaBSE
124
+
125
+ ## Model description
126
+
127
+ Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model trained for sentence embedding for 109 languages. The pre-training process combines masked language modeling with translation language modeling. The model is useful for getting multilingual sentence embeddings and for bi-text retrieval.
128
+
129
+ - Model: [HuggingFace's model hub](https://huggingface.co/setu4993/LaBSE).
130
+ - Paper: [arXiv](https://arxiv.org/abs/2007.01852).
131
+ - Original model: [TensorFlow Hub](https://tfhub.dev/google/LaBSE/1).
132
+ - Blog post: [Google AI Blog](https://ai.googleblog.com/2020/08/language-agnostic-bert-sentence.html).
133
+
134
+ ## Usage
135
+
136
+ Using the model:
137
+
138
+ ```python
139
+ import torch
140
+ from transformers import BertModel, BertTokenizerFast
141
+
142
+
143
+ tokenizer = BertTokenizerFast.from_pretrained("setu4993/LaBSE")
144
+ model = BertModel.from_pretrained("setu4993/LaBSE")
145
+ model = model.eval()
146
+
147
+ english_sentences = [
148
+ "dog",
149
+ "Puppies are nice.",
150
+ "I enjoy taking long walks along the beach with my dog.",
151
+ ]
152
+ english_inputs = tokenizer(english_sentences, return_tensors="pt", padding=True)
153
+
154
+ with torch.no_grad():
155
+ english_outputs = model(**english_inputs)
156
+ ```
157
+
158
+ To get the sentence embeddings, use the pooler output:
159
+
160
+ ```python
161
+ english_embeddings = english_outputs.pooler_output
162
+ ```
163
+
164
+ Output for other languages:
165
+
166
+ ```python
167
+ italian_sentences = [
168
+ "cane",
169
+ "I cuccioli sono carini.",
170
+ "Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.",
171
+ ]
172
+ japanese_sentences = ["犬", "子犬はいいです", "私は犬と一緒にビーチを散歩するのが好きです"]
173
+ italian_inputs = tokenizer(italian_sentences, return_tensors="pt", padding=True)
174
+ japanese_inputs = tokenizer(japanese_sentences, return_tensors="pt", padding=True)
175
+
176
+ with torch.no_grad():
177
+ italian_outputs = model(**italian_inputs)
178
+ japanese_outputs = model(**japanese_inputs)
179
+
180
+ italian_embeddings = italian_outputs.pooler_output
181
+ japanese_embeddings = japanese_outputs.pooler_output
182
+ ```
183
+
184
+ For similarity between sentences, an L2-norm is recommended before calculating the similarity:
185
+
186
+ ```python
187
+ import torch.nn.functional as F
188
+
189
+
190
+ def similarity(embeddings_1, embeddings_2):
191
+ normalized_embeddings_1 = F.normalize(embeddings_1, p=2)
192
+ normalized_embeddings_2 = F.normalize(embeddings_2, p=2)
193
+ return torch.matmul(
194
+ normalized_embeddings_1, normalized_embeddings_2.transpose(0, 1)
195
+ )
196
+
197
+
198
+ print(similarity(english_embeddings, italian_embeddings))
199
+ print(similarity(english_embeddings, japanese_embeddings))
200
+ print(similarity(italian_embeddings, japanese_embeddings))
201
+ ```
202
+
203
+ ## Details
204
+
205
+ Details about data, training, evaluation and performance metrics are available in the [original paper](https://arxiv.org/abs/2007.01852).
206
+
207
+ ### BibTeX entry and citation info
208
+
209
+ ```bibtex
210
+ @misc{feng2020languageagnostic,
211
+ title={Language-agnostic BERT Sentence Embedding},
212
+ author={Fangxiaoyu Feng and Yinfei Yang and Daniel Cer and Naveen Arivazhagan and Wei Wang},
213
+ year={2020},
214
+ eprint={2007.01852},
215
+ archivePrefix={arXiv},
216
+ primaryClass={cs.CL}
217
+ }
218
+ ```