gonzalez-agirre commited on
Commit
7aecd66
1 Parent(s): 698101b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -16
README.md CHANGED
@@ -148,22 +148,6 @@ At the time of submission, no measures have been taken to estimate the bias and
148
 
149
  We adapted the original Falcon-7B model to Spanish and Catalan by swapping the tokenizer and adjusting the embedding layer. The adaptation procedure is explained in this [blog](https://medium.com/@mpamies247/ee1ebc70bc79).
150
 
151
- ### New vocabulary
152
- We trained a new BPE Tokenizer for the Catalan and Spanish languages (equal representation). We shuffled a small amount of English in the mixture (since English is in the model training data).
153
- The resulting data has the following language distribution:
154
-
155
- |Language|%|
156
- |---|---|
157
- |En|16.84%|
158
- |Es|41.38%|
159
- |Ca|41.79%|
160
-
161
- This reduced drastically the number of tokens required to tokenize a text in the target language while the English tokenization shows a small increase.
162
-
163
- ### Embedding Layer Initialization
164
- In order to fully take advantage of the English Pre-Training of the original Falcon model, we decided to re-use the embedding weights of the original model for those tokens shared between the two Tokenizers (the new and the old one). The rest of the embedding weights are initialized as the mean value of the weights of the original Tokenizer.
165
-
166
-
167
  ## Training
168
 
169
  ### Training data
 
148
 
149
  We adapted the original Falcon-7B model to Spanish and Catalan by swapping the tokenizer and adjusting the embedding layer. The adaptation procedure is explained in this [blog](https://medium.com/@mpamies247/ee1ebc70bc79).
150
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
151
  ## Training
152
 
153
  ### Training data