projecte-aina
/

FLOR-6.3B

Text Generation

Inference Endpoints

text-generation-inference

Model card Files Files and versions Community

jpalomar commited on Feb 7

Commit

89ca69b

•

1 Parent(s): 7b0cbfd

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -154,7 +154,7 @@ to be adapted before continuing its pre-training with data in the target languag
 ### Training data
-The training corpus is composed of 140B tokens gathered from web crawlings and public domain data. Most of the sources in Catalan have been obtained from the CATalog dataset, filtered with a minimum threshold of 0.6 and oversampling some of the sources it integrates to different extents.
 Dataset	| Language	| Words (per-epoch)	| Epochs	| Total Tokens |
 |---------------------|----------|--------------------|--------------|--------------|

 ### Training data
+The training corpus is composed of 140B tokens gathered from web crawlings and public domain data. Most of the sources in Catalan have been obtained from the [CATalog 1.0](https://huggingface.co/datasets/projecte-aina/CATalog) dataset, filtered with a minimum threshold of 0.6 and oversampling some of the sources it integrates to different extents.
 Dataset	| Language	| Words (per-epoch)	| Epochs	| Total Tokens |
 |---------------------|----------|--------------------|--------------|--------------|