jpalomar commited on
Commit
89ca69b
1 Parent(s): 7b0cbfd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -154,7 +154,7 @@ to be adapted before continuing its pre-training with data in the target languag
154
 
155
  ### Training data
156
 
157
- The training corpus is composed of 140B tokens gathered from web crawlings and public domain data. Most of the sources in Catalan have been obtained from the CATalog dataset, filtered with a minimum threshold of 0.6 and oversampling some of the sources it integrates to different extents.
158
 
159
  Dataset | Language | Words (per-epoch) | Epochs | Total Tokens |
160
  |---------------------|----------|--------------------|--------------|--------------|
 
154
 
155
  ### Training data
156
 
157
+ The training corpus is composed of 140B tokens gathered from web crawlings and public domain data. Most of the sources in Catalan have been obtained from the [CATalog 1.0](https://huggingface.co/datasets/projecte-aina/CATalog) dataset, filtered with a minimum threshold of 0.6 and oversampling some of the sources it integrates to different extents.
158
 
159
  Dataset | Language | Words (per-epoch) | Epochs | Total Tokens |
160
  |---------------------|----------|--------------------|--------------|--------------|