setu4993 commited on
Commit
5c08f03
1 Parent(s): 52ce802

Add model cards, re-export model

Browse files
Files changed (3) hide show
  1. README.md +220 -0
  2. config.json +1 -4
  3. tf_model.h5 +1 -1
README.md CHANGED
@@ -1,3 +1,223 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - af
4
+ - am
5
+ - ar
6
+ - as
7
+ - az
8
+ - be
9
+ - bg
10
+ - bn
11
+ - bo
12
+ - bs
13
+ - ca
14
+ - ceb
15
+ - co
16
+ - cs
17
+ - cy
18
+ - da
19
+ - de
20
+ - el
21
+ - en
22
+ - eo
23
+ - es
24
+ - et
25
+ - eu
26
+ - fa
27
+ - fi
28
+ - fr
29
+ - fy
30
+ - ga
31
+ - gd
32
+ - gl
33
+ - gu
34
+ - ha
35
+ - haw
36
+ - he
37
+ - hi
38
+ - hmn
39
+ - hr
40
+ - ht
41
+ - hu
42
+ - hy
43
+ - id
44
+ - ig
45
+ - is
46
+ - it
47
+ - ja
48
+ - jv
49
+ - ka
50
+ - kk
51
+ - km
52
+ - kn
53
+ - ko
54
+ - ku
55
+ - ky
56
+ - la
57
+ - lb
58
+ - lo
59
+ - lt
60
+ - lv
61
+ - mg
62
+ - mi
63
+ - mk
64
+ - ml
65
+ - mn
66
+ - mr
67
+ - ms
68
+ - mt
69
+ - my
70
+ - ne
71
+ - nl
72
+ - no
73
+ - ny
74
+ - or
75
+ - pa
76
+ - pl
77
+ - pt
78
+ - ro
79
+ - ru
80
+ - rw
81
+ - si
82
+ - sk
83
+ - sl
84
+ - sm
85
+ - sn
86
+ - so
87
+ - sq
88
+ - sr
89
+ - st
90
+ - su
91
+ - sv
92
+ - sw
93
+ - ta
94
+ - te
95
+ - tg
96
+ - th
97
+ - tk
98
+ - tl
99
+ - tr
100
+ - tt
101
+ - ug
102
+ - uk
103
+ - ur
104
+ - uz
105
+ - vi
106
+ - wo
107
+ - xh
108
+ - yi
109
+ - yo
110
+ - zh
111
+ - zu
112
+ tags:
113
+ - bert
114
+ - sentence_embedding
115
+ - multilingual
116
+ - google
117
+ - sentence-similarity
118
+ - lealla
119
+ - labse
120
  license: apache-2.0
121
+ datasets:
122
+ - CommonCrawl
123
+ - Wikipedia
124
  ---
125
+
126
+ # LEALLA-base
127
+
128
+ ## Model description
129
+
130
+ LEALLA is a collection of lightweight language-agnostic sentence embedding models supporting 109 languages, distilled from [LaBSE](https://ai.googleblog.com/2020/08/language-agnostic-bert-sentence.html). The model is useful for getting multilingual sentence embeddings and for bi-text retrieval.
131
+
132
+ - Model: [HuggingFace's model hub](https://huggingface.co/setu4993/LEALLA-base).
133
+ - Paper: [arXiv](https://arxiv.org/abs/2302.08387).
134
+ - Original model: [TensorFlow Hub](https://tfhub.dev/google/LEALLA/LEALLA-base/1).
135
+ - Conversion from TensorFlow to PyTorch: [GitHub](https://github.com/setu4993/convert-labse-tf-pt).
136
+
137
+ This is migrated from the v1 model on the TF Hub. The embeddings produced by both the versions of the model are [equivalent](https://github.com/setu4993/convert-labse-tf-pt/blob/c0d4fbce789b0709a9664464f032d2e9f5368a86/tests/test_conversion_lealla.py#L31). Though, for some of the languages (like Japanese), the LEALLA models appear to require higher tolerances when comparing embeddings and similarities.
138
+
139
+ ## Usage
140
+
141
+ Using the model:
142
+
143
+ ```python
144
+ import torch
145
+ from transformers import BertModel, BertTokenizerFast
146
+
147
+
148
+ tokenizer = BertTokenizerFast.from_pretrained("setu4993/LEALLA-base")
149
+ model = BertModel.from_pretrained("setu4993/LEALLA-base")
150
+ model = model.eval()
151
+
152
+ english_sentences = [
153
+ "dog",
154
+ "Puppies are nice.",
155
+ "I enjoy taking long walks along the beach with my dog.",
156
+ ]
157
+ english_inputs = tokenizer(english_sentences, return_tensors="pt", padding=True)
158
+
159
+ with torch.no_grad():
160
+ english_outputs = model(**english_inputs)
161
+ ```
162
+
163
+ To get the sentence embeddings, use the pooler output:
164
+
165
+ ```python
166
+ english_embeddings = english_outputs.pooler_output
167
+ ```
168
+
169
+ Output for other languages:
170
+
171
+ ```python
172
+ italian_sentences = [
173
+ "cane",
174
+ "I cuccioli sono carini.",
175
+ "Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.",
176
+ ]
177
+ japanese_sentences = ["犬", "子犬はいいです", "私は犬と一緒にビーチを散歩するのが好きです"]
178
+ italian_inputs = tokenizer(italian_sentences, return_tensors="pt", padding=True)
179
+ japanese_inputs = tokenizer(japanese_sentences, return_tensors="pt", padding=True)
180
+
181
+ with torch.no_grad():
182
+ italian_outputs = model(**italian_inputs)
183
+ japanese_outputs = model(**japanese_inputs)
184
+
185
+ italian_embeddings = italian_outputs.pooler_output
186
+ japanese_embeddings = japanese_outputs.pooler_output
187
+ ```
188
+
189
+ For similarity between sentences, an L2-norm is recommended before calculating the similarity:
190
+
191
+ ```python
192
+ import torch.nn.functional as F
193
+
194
+
195
+ def similarity(embeddings_1, embeddings_2):
196
+ normalized_embeddings_1 = F.normalize(embeddings_1, p=2)
197
+ normalized_embeddings_2 = F.normalize(embeddings_2, p=2)
198
+ return torch.matmul(
199
+ normalized_embeddings_1, normalized_embeddings_2.transpose(0, 1)
200
+ )
201
+
202
+
203
+ print(similarity(english_embeddings, italian_embeddings))
204
+ print(similarity(english_embeddings, japanese_embeddings))
205
+ print(similarity(italian_embeddings, japanese_embeddings))
206
+ ```
207
+
208
+ ## Details
209
+
210
+ Details about data, training, evaluation and performance metrics are available in the [original paper](https://arxiv.org/abs/2302.08387).
211
+
212
+ ### BibTeX entry and citation info
213
+
214
+ ```bibtex
215
+ @misc{mao2023lealla,
216
+ title={LEALLA: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation},
217
+ author={Zhuoyuan Mao and Tetsuji Nakagawa},
218
+ year={2023},
219
+ eprint={2302.08387},
220
+ archivePrefix={arXiv},
221
+ primaryClass={cs.CL}
222
+ }
223
+ ```
config.json CHANGED
@@ -1,8 +1,5 @@
1
  {
2
- "_name_or_path": "/Users/setu/Models/huggingface/setu4993/LEALLA-base",
3
- "architectures": [
4
- "BertModel"
5
- ],
6
  "attention_probs_dropout_prob": 0.1,
7
  "classifier_dropout": null,
8
  "gradient_checkpointing": false,
 
1
  {
2
+ "architectures": ["BertModel"],
 
 
 
3
  "attention_probs_dropout_prob": 0.1,
4
  "classifier_dropout": null,
5
  "gradient_checkpointing": false,
tf_model.h5 CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c7ead4056bb7a75809b98973569f9eeab19de03edb97d7054bee86a97615175e
3
  size 428702448
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c295cd806afc9516e4f2a99192e1a0437d17163f4a82f05107ac7a5e7f91a882
3
  size 428702448