Update README.md
Browse files
README.md
CHANGED
@@ -154,7 +154,7 @@ to be adapted before continuing its pre-training with data in the target languag
|
|
154 |
|
155 |
### Training data
|
156 |
|
157 |
-
The training corpus is composed of 140B tokens gathered from web crawlings and public domain data. Most of the sources in Catalan have been obtained from the CATalog dataset, filtered with a minimum threshold of 0.6 and oversampling some of the sources it integrates to different extents.
|
158 |
|
159 |
Dataset | Language | Words (per-epoch) | Epochs | Total Tokens |
|
160 |
|---------------------|----------|--------------------|--------------|--------------|
|
|
|
154 |
|
155 |
### Training data
|
156 |
|
157 |
+
The training corpus is composed of 140B tokens gathered from web crawlings and public domain data. Most of the sources in Catalan have been obtained from the [CATalog 1.0](https://huggingface.co/datasets/projecte-aina/CATalog) dataset, filtered with a minimum threshold of 0.6 and oversampling some of the sources it integrates to different extents.
|
158 |
|
159 |
Dataset | Language | Words (per-epoch) | Epochs | Total Tokens |
|
160 |
|---------------------|----------|--------------------|--------------|--------------|
|