|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- projecte-aina/CA-EN_Parallel_Corpus |
|
language: |
|
- ca |
|
- en |
|
metrics: |
|
- bleu |
|
library_name: fairseq |
|
--- |
|
## Aina Project's Catalan-English machine translation model |
|
|
|
## Model description |
|
|
|
This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-English datasets, |
|
up to 11 million sentences. Additionally, the model is evaluated on several public datasets comprising 5 different domains (general, adminstrative, technology, |
|
biomedical, and news). |
|
|
|
## Intended uses and limitations |
|
|
|
You can use this model for machine translation from Catalan to English. |
|
|
|
## How to use |
|
|
|
### Usage |
|
Required libraries: |
|
|
|
```bash |
|
pip install ctranslate2 pyonmttok |
|
``` |
|
|
|
Translate a sentence using python |
|
```python |
|
import ctranslate2 |
|
import pyonmttok |
|
from huggingface_hub import snapshot_download |
|
model_dir = snapshot_download(repo_id="projecte-aina/aina-translator-ca-en", revision="main") |
|
|
|
tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model") |
|
tokenized=tokenizer.tokenize("Benvingut al projecte Aina!") |
|
|
|
translator = ctranslate2.Translator(model_dir) |
|
translated = translator.translate_batch([tokenized[0]]) |
|
print(tokenizer.detokenize(translated[0][0]['tokens'])) |
|
``` |
|
|
|
## Limitations and bias |
|
At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. |
|
However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated. |
|
|
|
## Training |
|
|
|
### Training data |
|
|
|
The model was trained on a combination of the following datasets: |
|
|
|
| Dataset | Sentences | |
|
|--------------------|----------------| |
|
| Global Voices | 21.342 | |
|
| Memories Lluires | 1.173.055 | |
|
| Wikimatrix | 1.205.908 | |
|
| TED Talks | 50.979 | |
|
| Tatoeba | 5.500 | |
|
| CoVost 2 ca-en | 79.633 | |
|
| CoVost 2 en-ca | 263.891 | |
|
| Europarl | 1.965.734 | |
|
| jw300 | 97.081 | |
|
| Crawled Generalitat| 38.595 | |
|
| Opus Books | 4.580 | |
|
| CC Aligned | 5.787.682 | |
|
| COVID_Wikipedia | 1.531 | |
|
| EuroBooks | 3.746 | |
|
| Gnome | 2.183 | |
|
| KDE 4 | 144.153 | |
|
| OpenSubtitles | 427.913 | |
|
| QED | 69.823 | |
|
| Ubuntu | 6.781 | |
|
| Wikimedia | 208.073 | |
|
|--------------------|----------------| |
|
| **Total** | **11.558.183** | |
|
|
|
### Training procedure |
|
|
|
### Data preparation |
|
|
|
All datasets are concatenated and filtered using the [mBERT Gencata parallel filter](https://huggingface.co/projecte-aina/mbert-base-gencata). |
|
Before training, the punctuation is normalized using a modified version of the join-single-file.py script from |
|
[SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py) |
|
|
|
|
|
#### Tokenization |
|
|
|
All data is tokenized using sentencepiece, using 50 thousand token sentencepiece model learned from the combination of all filtered training data. |
|
This model is included. |
|
|
|
#### Hyperparameters |
|
|
|
The model is based on the Transformer-XLarge proposed by [Subramanian et al.](https://aclanthology.org/2021.wmt-1.18.pdf) |
|
The following hyperparamenters were set on the Fairseq toolkit: |
|
|
|
| Hyperparameter | Value | |
|
|------------------------------------|-----------------------------------| |
|
| Architecture | transformer_vaswani_wmt_en_de_big | |
|
| Embedding size | 1024 | |
|
| Feedforward size | 4096 | |
|
| Number of heads | 16 | |
|
| Encoder layers | 24 | |
|
| Decoder layers | 6 | |
|
| Normalize before attention | True | |
|
| --share-decoder-input-output-embed | True | |
|
| --share-all-embeddings | True | |
|
| Effective batch size | 96.000 | |
|
| Optimizer | adam | |
|
| Adam betas | (0.9, 0.980) | |
|
| Clip norm | 0.0 | |
|
| Learning rate | 1e-3 | |
|
| Lr. schedurer | inverse sqrt | |
|
| Warmup updates | 4000 | |
|
| Dropout | 0.1 | |
|
| Label smoothing | 0.1 | |
|
|
|
The model was trained for a total of 35.000 updates. Weights were saved every 1000 updates and reported results are the average of the last 16 checkpoints. |
|
|
|
## Evaluation |
|
|
|
### Variable and metrics |
|
|
|
We use the BLEU score for evaluation on test sets: |
|
|
|
[Spanish Constitution (TaCon)](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/), |
|
[United Nations](https://zenodo.org/record/3888414#.Y33-_tLMIW0), |
|
[European Commission](https://elrc-share.eu/repository/browse/european-commission-corpus/8a419b1758ea11ed9c1a00155d0267069bd085cae124481589b0858e5b274327/), |
|
[Flores-101](https://github.com/facebookresearch/flores), |
|
[Cybersecurity](https://elrc-share.eu/repository/browse/cyber-mt-test-set/2bd93faab98c11ec9c1a00155d026706b96a490ed3e140f0a29a80a08c46e91e/), |
|
[wmt19 biomedical test set](http://www.statmt.org/wmt19/biomedical-translation-task.html), |
|
[wmt13 news test set](https://elrc-share.eu/repository/browse/catalan-wmt2013-machine-translation-shared-task-test-set/84a96139b98611ec9c1a00155d0267061a0aa1b62e2248e89aab4952f3c230fc/). |
|
|
|
### Evaluation results |
|
|
|
Below are the evaluation results on the machine translation from Catalan to English compared to [Softcatalà](https://www.softcatala.org/) and |
|
[Google Translate](https://translate.google.es/?hl=es): |
|
|
|
|
|
| Test set | SoftCatalà | Google Translate | aina-translator-ca-en | |
|
|----------------------|------------|------------------|---------------| |
|
| Spanish Constitution | 35,8 | **43,2** | 40,3 | |
|
| United Nations | 44,4 | **47,4** | 44,8 | |
|
| European Commission | 52,0 | **53,7** | 53,1 | |
|
| Flores 101 dev | 42,7 | **47,5** | 46,1 | |
|
| Flores 101 devtest | 42,5 | **46,9** | 45,2 | |
|
| Cybersecurity | 52,5 | **58,0** | 54,2 | |
|
| wmt 19 biomedical | 18,3 | **23,4** | 21,6 | |
|
| wmt 13 news | 37,8 | **39,8** | 39,3 | |
|
| Average | 39,2 | **45,0** | 41,6 | |
|
|
|
## Additional information |
|
|
|
### Author |
|
The Language Technologies Unit from Barcelona Supercomputing Center. |
|
|
|
### Contact |
|
For further information, please send an email to <langtech@bsc.es>. |
|
|
|
### Copyright |
|
Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center. |
|
|
|
### License |
|
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
|
|
|
### Funding |
|
This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/). |
|
|
|
### Disclaimer |
|
|
|
<details> |
|
<summary>Click to expand</summary> |
|
|
|
The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0. |
|
|
|
Be aware that the model may have biases and/or any other undesirable distortions. |
|
|
|
When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it) |
|
or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and, |
|
in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence. |
|
|
|
In no event shall the owner and creator of the model (Barcelona Supercomputing Center) |
|
be liable for any results arising from the use made by third parties. |
|
|
|
</details> |