Dani commited on
Commit
6663a43
1 Parent(s): efb0a99

fixed several vocab issues

Browse files
Files changed (4) hide show
  1. README.md +4 -13
  2. config.json +1 -1
  3. pytorch_model.bin +2 -2
  4. vocab.txt +13 -0
README.md CHANGED
@@ -4,24 +4,15 @@ license: apache-2.0
4
  datasets:
5
  - wikipedia
6
  widget:
7
- - text: "El español es un idioma muy [MASK] en el mundo."
8
  ---
9
 
10
  # DistilBERT base multilingual model Spanish subset (cased)
11
 
12
  This model is the Spanish extract of `distilbert-base-multilingual-cased` (https://huggingface.co/distilbert-base-multilingual-cased), a distilled version of the [BERT base multilingual model](bert-base-multilingual-cased). This model is cased: it does make a difference between english and English.
13
 
14
- It uses the extraction method proposed by Geotrend, which is described in https://github.com/Geotrend-research/smaller-transformers.
15
- Specifically, we've ran the following script:
16
 
17
- ```sh
18
- python reduce_model.py \
19
- --source_model distilbert-base-multilingual-cased \
20
- --vocab_file notebooks/selected_tokens/selected_es_tokens.txt \
21
- --output_model distilbert-base-es-multilingual-cased \
22
- --convert_to_tf False
23
- ```
24
 
25
- The resulting model has the same architecture as DistilmBERT: 6 layers, 768 dimension and 12 heads, with a total of **65M parameters** (compared to 134M parameters for DistilmBERT).
26
-
27
- The goal of this model is to reduce even further the size of the `distilbert-base-multilingual` multilingual model by selecting only most frequent tokens for Spanish, reducing the size of the embedding layer. For more details visit the paper from the Geotrend team: Load What You Need: Smaller Versions of Multilingual BERT.
4
  datasets:
5
  - wikipedia
6
  widget:
7
+ - text: "Mi nombre es Juan y vivo en [MASK]."
8
  ---
9
 
10
  # DistilBERT base multilingual model Spanish subset (cased)
11
 
12
  This model is the Spanish extract of `distilbert-base-multilingual-cased` (https://huggingface.co/distilbert-base-multilingual-cased), a distilled version of the [BERT base multilingual model](bert-base-multilingual-cased). This model is cased: it does make a difference between english and English.
13
 
14
+ It uses the extraction method proposed by Geotrend described in https://github.com/Geotrend-research/smaller-transformers.
 
15
 
16
+ The resulting model has the same architecture as DistilmBERT: 6 layers, 768 dimension and 12 heads, with a total of **63M parameters** (compared to 134M parameters for DistilmBERT).
 
 
 
 
 
 
17
 
18
+ The goal of this model is to reduce even further the size of the `distilbert-base-multilingual` multilingual model by selecting only most frequent tokens for Spanish, reducing the size of the embedding layer. For more details visit the paper from the Geotrend team: Load What You Need: Smaller Versions of Multilingual BERT.
 
 
config.json CHANGED
@@ -18,5 +18,5 @@
18
  "seq_classif_dropout": 0.2,
19
  "sinusoidal_pos_embds": false,
20
  "tie_weights_": true,
21
- "vocab_size": 26346
22
  }
18
  "seq_classif_dropout": 0.2,
19
  "sinusoidal_pos_embds": false,
20
  "tie_weights_": true,
21
+ "vocab_size": 26360
22
  }
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:0a7e9034002f6027c9c3e2644bf743b008fc7081072839124abd6673e6740c5c
3
- size 255139145
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:02e8562d1e4f7f2fe58e9970fa28b3544b066591bc475777c823ab10adcd9af2
3
+ size 255182217
vocab.txt CHANGED
@@ -1,4 +1,17 @@
 
 
 
 
 
 
 
 
 
 
1
  [UNK]
 
 
 
2
  !
3
  "
4
  #
1
+ [PAD]
2
+ [unused1]
3
+ [unused2]
4
+ [unused3]
5
+ [unused4]
6
+ [unused5]
7
+ [unused6]
8
+ [unused7]
9
+ [unused8]
10
+ [unused9]
11
  [UNK]
12
+ [CLS]
13
+ [SEP]
14
+ [MASK]
15
  !
16
  "
17
  #