projecte-aina
/

aina-translator-ca-es

Fairseq

Catalan

Spanish

Model card Files Files and versions Community

fdelucaf commited on May 15

Commit

5176ee4

•

1 Parent(s): f968ade

Update README.md

Browse files

Files changed (1) hide show

README.md +7 -19

README.md CHANGED Viewed

@@ -12,8 +12,8 @@ library_name: fairseq
 ## Model description
 This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Spanish datasets,
-up to 92 million sentences. Additionally, the model is evaluated on several public datasets comprising 5 different domains (general, adminstrative, technology,
-biomedical, and news).
 ## Intended uses and limitations
@@ -51,21 +51,8 @@ However, we are well aware that our models may be biased. We intend to conduct r
 ### Training data
-The was trained on a combination of the following datasets:
-| Dataset           | Sentences      | Tokens            |
-|-------------------|----------------|-------------------|
-| DOGC v2           | 8.472.786      | 188.929.206       |
-| El Periodico      | 6.483.106      | 145.591.906       |
-| EuroParl          | 1.876.669      | 49.212.670        |
-| WikiMatrix        | 1.421.077      | 34.902.039        |
-| Wikimedia         | 335.955        | 8.682.025         |
-| QED               | 71.867         | 1.079.705         |
-| TED2020 v1        | 52.177         | 836.882           |
-| CCMatrix v1       | 56.103.820     | 1.064.182.320     |
-| MultiCCAligned v1 | 2.433.418      | 48.294.144        |
-| ParaCrawl         | 15.327.808     | 334.199.408       |
-| **Total**         | **92.578.683** | **1.875.910.305** |
 ### Training procedure
@@ -75,7 +62,7 @@ The was trained on a combination of the following datasets:
  cleaned using the clean-corpus-n.pl script from [moses](https://github.com/moses-smt/mosesdecoder), allowing sentences between 5 and 150 words.
  Before training, the punctuation is normalized using a modified version of the join-single-file.py script
- from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 #### Tokenization
@@ -116,7 +103,8 @@ Weights were saved every 1000 updates and reported results are the average of th
 ### Variable and metrics
-We use the BLEU score for evaluation on test sets: [Flores-101](https://github.com/facebookresearch/flores),
 [TaCon](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/),
 [United Nations](https://zenodo.org/record/3888414#.Y33-_tLMIW0),
 [Cybersecurity](https://elrc-share.eu/repository/browse/cyber-mt-test-set/2bd93faab98c11ec9c1a00155d026706b96a490ed3e140f0a29a80a08c46e91e/),

 ## Model description
 This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Spanish datasets,
+up to 92 million sentences before cleaning and filtering. Additionally, the model is evaluated on several public datasets comprising 5 different domains
+(general, adminstrative, technology, biomedical, and news).
 ## Intended uses and limitations
 ### Training data
+The model was trained on a combination of several datasets, totalling around 92 million parallel sentences before filtering and cleaning.
+The trainig data includes corpora collected from [Opus](https://opus.nlpl.eu/), internally created parallel datsets, and corpora from other sources.
 ### Training procedure
  cleaned using the clean-corpus-n.pl script from [moses](https://github.com/moses-smt/mosesdecoder), allowing sentences between 5 and 150 words.
  Before training, the punctuation is normalized using a modified version of the join-single-file.py script
+ from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py).
 #### Tokenization
 ### Variable and metrics
+We use the BLEU score for evaluation on test sets:
+[Flores-101](https://github.com/facebookresearch/flores),
 [TaCon](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/),
 [United Nations](https://zenodo.org/record/3888414#.Y33-_tLMIW0),
 [Cybersecurity](https://elrc-share.eu/repository/browse/cyber-mt-test-set/2bd93faab98c11ec9c1a00155d026706b96a490ed3e140f0a29a80a08c46e91e/),