patrickvonplaten commited on
Commit
b2935c8
1 Parent(s): a86d3da

allow flax

Browse files
Files changed (9) hide show
  1. .gitattributes +1 -0
  2. LICENSE +0 -201
  3. README.md +0 -218
  4. config.json +0 -21
  5. pytorch_model.bin +0 -3
  6. special_tokens_map.json +0 -7
  7. tf_model.h5 +0 -3
  8. tokenizer_config.json +0 -13
  9. vocab.txt +0 -0
.gitattributes CHANGED
@@ -6,3 +6,4 @@
6
  *.tar.gz filter=lfs diff=lfs merge=lfs -text
7
  *.ot filter=lfs diff=lfs merge=lfs -text
8
  *.onnx filter=lfs diff=lfs merge=lfs -text
 
6
  *.tar.gz filter=lfs diff=lfs merge=lfs -text
7
  *.ot filter=lfs diff=lfs merge=lfs -text
8
  *.onnx filter=lfs diff=lfs merge=lfs -text
9
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
LICENSE DELETED
@@ -1,201 +0,0 @@
1
- Apache License
2
- Version 2.0, January 2004
3
- http://www.apache.org/licenses/
4
-
5
- TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
-
7
- 1. Definitions.
8
-
9
- "License" shall mean the terms and conditions for use, reproduction,
10
- and distribution as defined by Sections 1 through 9 of this document.
11
-
12
- "Licensor" shall mean the copyright owner or entity authorized by
13
- the copyright owner that is granting the License.
14
-
15
- "Legal Entity" shall mean the union of the acting entity and all
16
- other entities that control, are controlled by, or are under common
17
- control with that entity. For the purposes of this definition,
18
- "control" means (i) the power, direct or indirect, to cause the
19
- direction or management of such entity, whether by contract or
20
- otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
- outstanding shares, or (iii) beneficial ownership of such entity.
22
-
23
- "You" (or "Your") shall mean an individual or Legal Entity
24
- exercising permissions granted by this License.
25
-
26
- "Source" form shall mean the preferred form for making modifications,
27
- including but not limited to software source code, documentation
28
- source, and configuration files.
29
-
30
- "Object" form shall mean any form resulting from mechanical
31
- transformation or translation of a Source form, including but
32
- not limited to compiled object code, generated documentation,
33
- and conversions to other media types.
34
-
35
- "Work" shall mean the work of authorship, whether in Source or
36
- Object form, made available under the License, as indicated by a
37
- copyright notice that is included in or attached to the work
38
- (an example is provided in the Appendix below).
39
-
40
- "Derivative Works" shall mean any work, whether in Source or Object
41
- form, that is based on (or derived from) the Work and for which the
42
- editorial revisions, annotations, elaborations, or other modifications
43
- represent, as a whole, an original work of authorship. For the purposes
44
- of this License, Derivative Works shall not include works that remain
45
- separable from, or merely link (or bind by name) to the interfaces of,
46
- the Work and Derivative Works thereof.
47
-
48
- "Contribution" shall mean any work of authorship, including
49
- the original version of the Work and any modifications or additions
50
- to that Work or Derivative Works thereof, that is intentionally
51
- submitted to Licensor for inclusion in the Work by the copyright owner
52
- or by an individual or Legal Entity authorized to submit on behalf of
53
- the copyright owner. For the purposes of this definition, "submitted"
54
- means any form of electronic, verbal, or written communication sent
55
- to the Licensor or its representatives, including but not limited to
56
- communication on electronic mailing lists, source code control systems,
57
- and issue tracking systems that are managed by, or on behalf of, the
58
- Licensor for the purpose of discussing and improving the Work, but
59
- excluding communication that is conspicuously marked or otherwise
60
- designated in writing by the copyright owner as "Not a Contribution."
61
-
62
- "Contributor" shall mean Licensor and any individual or Legal Entity
63
- on behalf of whom a Contribution has been received by Licensor and
64
- subsequently incorporated within the Work.
65
-
66
- 2. Grant of Copyright License. Subject to the terms and conditions of
67
- this License, each Contributor hereby grants to You a perpetual,
68
- worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69
- copyright license to reproduce, prepare Derivative Works of,
70
- publicly display, publicly perform, sublicense, and distribute the
71
- Work and such Derivative Works in Source or Object form.
72
-
73
- 3. Grant of Patent License. Subject to the terms and conditions of
74
- this License, each Contributor hereby grants to You a perpetual,
75
- worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76
- (except as stated in this section) patent license to make, have made,
77
- use, offer to sell, sell, import, and otherwise transfer the Work,
78
- where such license applies only to those patent claims licensable
79
- by such Contributor that are necessarily infringed by their
80
- Contribution(s) alone or by combination of their Contribution(s)
81
- with the Work to which such Contribution(s) was submitted. If You
82
- institute patent litigation against any entity (including a
83
- cross-claim or counterclaim in a lawsuit) alleging that the Work
84
- or a Contribution incorporated within the Work constitutes direct
85
- or contributory patent infringement, then any patent licenses
86
- granted to You under this License for that Work shall terminate
87
- as of the date such litigation is filed.
88
-
89
- 4. Redistribution. You may reproduce and distribute copies of the
90
- Work or Derivative Works thereof in any medium, with or without
91
- modifications, and in Source or Object form, provided that You
92
- meet the following conditions:
93
-
94
- (a) You must give any other recipients of the Work or
95
- Derivative Works a copy of this License; and
96
-
97
- (b) You must cause any modified files to carry prominent notices
98
- stating that You changed the files; and
99
-
100
- (c) You must retain, in the Source form of any Derivative Works
101
- that You distribute, all copyright, patent, trademark, and
102
- attribution notices from the Source form of the Work,
103
- excluding those notices that do not pertain to any part of
104
- the Derivative Works; and
105
-
106
- (d) If the Work includes a "NOTICE" text file as part of its
107
- distribution, then any Derivative Works that You distribute must
108
- include a readable copy of the attribution notices contained
109
- within such NOTICE file, excluding those notices that do not
110
- pertain to any part of the Derivative Works, in at least one
111
- of the following places: within a NOTICE text file distributed
112
- as part of the Derivative Works; within the Source form or
113
- documentation, if provided along with the Derivative Works; or,
114
- within a display generated by the Derivative Works, if and
115
- wherever such third-party notices normally appear. The contents
116
- of the NOTICE file are for informational purposes only and
117
- do not modify the License. You may add Your own attribution
118
- notices within Derivative Works that You distribute, alongside
119
- or as an addendum to the NOTICE text from the Work, provided
120
- that such additional attribution notices cannot be construed
121
- as modifying the License.
122
-
123
- You may add Your own copyright statement to Your modifications and
124
- may provide additional or different license terms and conditions
125
- for use, reproduction, or distribution of Your modifications, or
126
- for any such Derivative Works as a whole, provided Your use,
127
- reproduction, and distribution of the Work otherwise complies with
128
- the conditions stated in this License.
129
-
130
- 5. Submission of Contributions. Unless You explicitly state otherwise,
131
- any Contribution intentionally submitted for inclusion in the Work
132
- by You to the Licensor shall be under the terms and conditions of
133
- this License, without any additional terms or conditions.
134
- Notwithstanding the above, nothing herein shall supersede or modify
135
- the terms of any separate license agreement you may have executed
136
- with Licensor regarding such Contributions.
137
-
138
- 6. Trademarks. This License does not grant permission to use the trade
139
- names, trademarks, service marks, or product names of the Licensor,
140
- except as required for reasonable and customary use in describing the
141
- origin of the Work and reproducing the content of the NOTICE file.
142
-
143
- 7. Disclaimer of Warranty. Unless required by applicable law or
144
- agreed to in writing, Licensor provides the Work (and each
145
- Contributor provides its Contributions) on an "AS IS" BASIS,
146
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147
- implied, including, without limitation, any warranties or conditions
148
- of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149
- PARTICULAR PURPOSE. You are solely responsible for determining the
150
- appropriateness of using or redistributing the Work and assume any
151
- risks associated with Your exercise of permissions under this License.
152
-
153
- 8. Limitation of Liability. In no event and under no legal theory,
154
- whether in tort (including negligence), contract, or otherwise,
155
- unless required by applicable law (such as deliberate and grossly
156
- negligent acts) or agreed to in writing, shall any Contributor be
157
- liable to You for damages, including any direct, indirect, special,
158
- incidental, or consequential damages of any character arising as a
159
- result of this License or out of the use or inability to use the
160
- Work (including but not limited to damages for loss of goodwill,
161
- work stoppage, computer failure or malfunction, or any and all
162
- other commercial damages or losses), even if such Contributor
163
- has been advised of the possibility of such damages.
164
-
165
- 9. Accepting Warranty or Additional Liability. While redistributing
166
- the Work or Derivative Works thereof, You may choose to offer,
167
- and charge a fee for, acceptance of support, warranty, indemnity,
168
- or other liability obligations and/or rights consistent with this
169
- License. However, in accepting such obligations, You may act only
170
- on Your own behalf and on Your sole responsibility, not on behalf
171
- of any other Contributor, and only if You agree to indemnify,
172
- defend, and hold each Contributor harmless for any liability
173
- incurred by, or claims asserted against, such Contributor by reason
174
- of your accepting any such warranty or additional liability.
175
-
176
- END OF TERMS AND CONDITIONS
177
-
178
- APPENDIX: How to apply the Apache License to your work.
179
-
180
- To apply the Apache License to your work, attach the following
181
- boilerplate notice, with the fields enclosed by brackets "[]"
182
- replaced with your own identifying information. (Don't include
183
- the brackets!) The text should be enclosed in the appropriate
184
- comment syntax for the file format. We also recommend that a
185
- file or class name and description of purpose be included on the
186
- same "printed page" as the copyright notice for easier
187
- identification within third-party archives.
188
-
189
- Copyright [2021] [Google, Inc.]
190
-
191
- Licensed under the Apache License, Version 2.0 (the "License");
192
- you may not use this file except in compliance with the License.
193
- You may obtain a copy of the License at
194
-
195
- http://www.apache.org/licenses/LICENSE-2.0
196
-
197
- Unless required by applicable law or agreed to in writing, software
198
- distributed under the License is distributed on an "AS IS" BASIS,
199
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200
- See the License for the specific language governing permissions and
201
- limitations under the License.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md DELETED
@@ -1,218 +0,0 @@
1
- ---
2
- language:
3
- - af
4
- - am
5
- - ar
6
- - as
7
- - az
8
- - be
9
- - bg
10
- - bn
11
- - bo
12
- - bs
13
- - ca
14
- - ceb
15
- - co
16
- - cs
17
- - cy
18
- - da
19
- - de
20
- - el
21
- - en
22
- - eo
23
- - es
24
- - et
25
- - eu
26
- - fa
27
- - fi
28
- - fr
29
- - fy
30
- - ga
31
- - gd
32
- - gl
33
- - gu
34
- - ha
35
- - haw
36
- - he
37
- - hi
38
- - hmn
39
- - hr
40
- - ht
41
- - hu
42
- - hy
43
- - id
44
- - ig
45
- - is
46
- - it
47
- - ja
48
- - jv
49
- - ka
50
- - kk
51
- - km
52
- - kn
53
- - ko
54
- - ku
55
- - ky
56
- - la
57
- - lb
58
- - lo
59
- - lt
60
- - lv
61
- - mg
62
- - mi
63
- - mk
64
- - ml
65
- - mn
66
- - mr
67
- - ms
68
- - mt
69
- - my
70
- - ne
71
- - nl
72
- - no
73
- - ny
74
- - or
75
- - pa
76
- - pl
77
- - pt
78
- - ro
79
- - ru
80
- - rw
81
- - si
82
- - sk
83
- - sl
84
- - sm
85
- - sn
86
- - so
87
- - sq
88
- - sr
89
- - st
90
- - su
91
- - sv
92
- - sw
93
- - ta
94
- - te
95
- - tg
96
- - th
97
- - tk
98
- - tl
99
- - tr
100
- - tt
101
- - ug
102
- - uk
103
- - ur
104
- - uz
105
- - vi
106
- - wo
107
- - xh
108
- - yi
109
- - yo
110
- - zh
111
- - zu
112
- tags:
113
- - bert
114
- - sentence_embedding
115
- - multilingual
116
- - google
117
- license: Apache-2.0
118
- datasets:
119
- - CommonCrawl
120
- - Wikipedia
121
- ---
122
-
123
- # LaBSE
124
-
125
- ## Model description
126
-
127
- Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model trained for sentence embedding for 109 languages. The pre-training process combines masked language modeling with translation language modeling. The model is useful for getting multilingual sentence embeddings and for bi-text retrieval.
128
-
129
- - Model: [HuggingFace's model hub](https://huggingface.co/setu4993/LaBSE).
130
- - Paper: [arXiv](https://arxiv.org/abs/2007.01852).
131
- - Original model: [TensorFlow Hub](https://tfhub.dev/google/LaBSE/1).
132
- - Blog post: [Google AI Blog](https://ai.googleblog.com/2020/08/language-agnostic-bert-sentence.html).
133
-
134
- ## Usage
135
-
136
- Using the model:
137
-
138
- ```python
139
- import torch
140
- from transformers import BertModel, BertTokenizerFast
141
-
142
-
143
- tokenizer = BertTokenizerFast.from_pretrained("setu4993/LaBSE")
144
- model = BertModel.from_pretrained("setu4993/LaBSE")
145
- model = model.eval()
146
-
147
- english_sentences = [
148
- "dog",
149
- "Puppies are nice.",
150
- "I enjoy taking long walks along the beach with my dog.",
151
- ]
152
- english_inputs = tokenizer(english_sentences, return_tensors="pt", padding=True)
153
-
154
- with torch.no_grad():
155
- english_outputs = model(**english_inputs)
156
- ```
157
-
158
- To get the sentence embeddings, use the pooler output:
159
-
160
- ```python
161
- english_embeddings = english_outputs.pooler_output
162
- ```
163
-
164
- Output for other languages:
165
-
166
- ```python
167
- italian_sentences = [
168
- "cane",
169
- "I cuccioli sono carini.",
170
- "Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.",
171
- ]
172
- japanese_sentences = ["犬", "子犬はいいです", "私は犬と一緒にビーチを散歩するのが好きです"]
173
- italian_inputs = tokenizer(italian_sentences, return_tensors="pt", padding=True)
174
- japanese_inputs = tokenizer(japanese_sentences, return_tensors="pt", padding=True)
175
-
176
- with torch.no_grad():
177
- italian_outputs = model(**italian_inputs)
178
- japanese_outputs = model(**japanese_inputs)
179
-
180
- italian_embeddings = italian_outputs.pooler_output
181
- japanese_embeddings = japanese_outputs.pooler_output
182
- ```
183
-
184
- For similarity between sentences, an L2-norm is recommended before calculating the similarity:
185
-
186
- ```python
187
- import torch.nn.functional as F
188
-
189
-
190
- def similarity(embeddings_1, embeddings_2):
191
- normalized_embeddings_1 = F.normalize(embeddings_1, p=2)
192
- normalized_embeddings_2 = F.normalize(embeddings_2, p=2)
193
- return torch.matmul(
194
- normalized_embeddings_1, normalized_embeddings_2.transpose(0, 1)
195
- )
196
-
197
-
198
- print(similarity(english_embeddings, italian_embeddings))
199
- print(similarity(english_embeddings, japanese_embeddings))
200
- print(similarity(italian_embeddings, japanese_embeddings))
201
- ```
202
-
203
- ## Details
204
-
205
- Details about data, training, evaluation and performance metrics are available in the [original paper](https://arxiv.org/abs/2007.01852).
206
-
207
- ### BibTeX entry and citation info
208
-
209
- ```bibtex
210
- @misc{feng2020languageagnostic,
211
- title={Language-agnostic BERT Sentence Embedding},
212
- author={Fangxiaoyu Feng and Yinfei Yang and Daniel Cer and Naveen Arivazhagan and Wei Wang},
213
- year={2020},
214
- eprint={2007.01852},
215
- archivePrefix={arXiv},
216
- primaryClass={cs.CL}
217
- }
218
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json DELETED
@@ -1,21 +0,0 @@
1
- {
2
- "architectures": [
3
- "BertModel"
4
- ],
5
- "attention_probs_dropout_prob": 0.1,
6
- "gradient_checkpointing": false,
7
- "hidden_act": "gelu",
8
- "hidden_dropout_prob": 0.1,
9
- "hidden_size": 768,
10
- "initializer_range": 0.02,
11
- "intermediate_size": 3072,
12
- "layer_norm_eps": 1e-12,
13
- "max_position_embeddings": 512,
14
- "model_type": "bert",
15
- "num_attention_heads": 12,
16
- "num_hidden_layers": 12,
17
- "pad_token_id": 0,
18
- "position_embedding_type": "absolute",
19
- "type_vocab_size": 2,
20
- "vocab_size": 501153
21
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
pytorch_model.bin DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:13fd994ca91dbc613c710478c3be07b87665a0afc2889ef5470c31157d28d979
3
- size 1883757089
 
 
 
special_tokens_map.json DELETED
@@ -1,7 +0,0 @@
1
- {
2
- "unk_token": "[UNK]",
3
- "sep_token": "[SEP]",
4
- "pad_token": "[PAD]",
5
- "cls_token": "[CLS]",
6
- "mask_token": "[MASK]"
7
- }
 
 
 
 
 
 
 
tf_model.h5 DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:83e58c168e463f248cb182afe1f8b74129f490df7193c27e74ab44911bdf639b
3
- size 1883969304
 
 
 
tokenizer_config.json DELETED
@@ -1,13 +0,0 @@
1
- {
2
- "do_lower_case": false,
3
- "unk_token": "[UNK]",
4
- "sep_token": "[SEP]",
5
- "pad_token": "[PAD]",
6
- "cls_token": "[CLS]",
7
- "mask_token": "[MASK]",
8
- "tokenize_chinese_chars": true,
9
- "strip_accents": null,
10
- "do_basic_tokenize": true,
11
- "never_split": null,
12
- "model_max_length": 512
13
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
vocab.txt DELETED
The diff for this file is too large to render. See raw diff