|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- projecte-aina/CA-ZH_Parallel_Corpus |
|
language: |
|
- ca |
|
- zh |
|
metrics: |
|
- bleu |
|
library_name: fairseq |
|
--- |
|
## Projecte Aina’s Catalan-Chinese machine translation model |
|
|
|
## Model description |
|
|
|
This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Chinese |
|
datasets totalling 6.833.114 sentence pairs. 174.507 sentence pairs were parallel data collected from the web while the remaining 6.658.607 sentence pairs |
|
were parallel synthetic data created using the ES-CA translator of [PlanTL](https://huggingface.co/PlanTL-GOB-ES/mt-plantl-es-ca). |
|
The model was evaluated on the Flores and NTREX evaluation datasets. |
|
|
|
## Intended uses and limitations |
|
|
|
You can use this model for machine translation from Catalan to simplified Chinese. |
|
|
|
## How to use |
|
|
|
### Usage |
|
Required libraries: |
|
|
|
```bash |
|
pip install ctranslate2 pyonmttok |
|
``` |
|
|
|
Translate a sentence using python |
|
```python |
|
import ctranslate2 |
|
import pyonmttok |
|
import re |
|
|
|
def remove_jieba(text): |
|
preserve_spaces = re.sub(r'(?<=[\x00-\x7F])\s(?=[\x00-\x7F])', '@@', text) |
|
quit_jieba = re.sub(r'\s', '', preserve_spaces) |
|
replace_spaces = re.sub(r'@@', ' ', quit_jieba) |
|
return replace_spaces |
|
|
|
from huggingface_hub import snapshot_download |
|
model_dir = snapshot_download(repo_id="projecte-aina/aina-translator-ca-zh", revision="main") |
|
tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model") |
|
tokenized=tokenizer.tokenize("Benvingut al projecte Aina!") |
|
translator = ctranslate2.Translator(model_dir) |
|
translated = translator.translate_batch([tokenized[0]], beam_size=10) |
|
translation = tokenizer.detokenize(translated[0][0]['tokens']) |
|
print(remove_jieba(translation)) |
|
``` |
|
|
|
## Limitations and bias |
|
At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. |
|
However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated. |
|
|
|
## Training |
|
|
|
### Training data |
|
|
|
The Catalan-Chinese data collected from the web was a combination of the following datasets: |
|
|
|
| Dataset | Sentences before cleaning | |
|
|-------------------|----------------| |
|
| WikiMatrix | 90.643 | |
|
| XLENT | 535.803 | |
|
| GNOME | 78| |
|
| OpenSubtitles | 139.300 | |
|
|
|
The 6.658.607 sentence pairs of synthetic parallel data were created from the following Spanish-Chinese datasets: |
|
|
|
| Dataset | Sentences before cleaning | |
|
|-------------------|----------------| |
|
| UNPC |17.599.223| |
|
| CCMatrix | 24.051.233 | |
|
| MultiParacrawl| 3.410.087| |
|
| **Total** | **45.060.543** | |
|
|
|
|
|
### Training procedure |
|
|
|
### Data preparation |
|
|
|
The Chinese side of all datasets are passed through the [fastlangid](https://github.com/currentslab/fastlangid) language detector |
|
and any sentences which are not identified as simplified Chinese are discarded. |
|
The datasets are then also deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75. |
|
This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE). |
|
The filtered datasets are then concatenated to form a final corpus of 6.833.114. |
|
The Chinese side of the dataset is tokenized using [Jieba](https://github.com/fxsjy/jieba) and before training the punctuation is normalized |
|
using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py). |
|
|
|
|
|
#### Tokenization |
|
|
|
All data is tokenized using sentencepiece, with a 50 thousand token sentencepiece model learned from the combination of all filtered training data. |
|
This model is included. |
|
|
|
#### Hyperparameters |
|
|
|
The model is based on the Transformer-XLarge proposed by [Subramanian et al.](https://aclanthology.org/2021.wmt-1.18.pdf) |
|
The following hyperparameters were set on the Fairseq toolkit: |
|
|
|
| Hyperparameter | Value | |
|
|------------------------------------|----------------------------------| |
|
| Architecture | transformer_vaswani_wmt_en_de_big | |
|
| Embedding size | 1024 | |
|
| Feedforward size | 4096 | |
|
| Number of heads | 16 | |
|
| Encoder layers | 24 | |
|
| Decoder layers | 6 | |
|
| Normalize before attention | True | |
|
| --share-decoder-input-output-embed | True | |
|
| --share-all-embeddings | True | |
|
| Effective batch size | 48.000 | |
|
| Optimizer | adam | |
|
| Adam betas | (0.9, 0.980) | |
|
| Clip norm | 0.0 | |
|
| Learning rate | 5e-4 | |
|
| Lr. schedurer | inverse sqrt | |
|
| Warmup updates | 8000 | |
|
| Dropout | 0.1 | |
|
| Label smoothing | 0.1 | |
|
|
|
The model was trained for 17.000 updates. |
|
Weights were saved every 1000 updates and reported results are the average of the last 4 checkpoints. |
|
|
|
## Evaluation |
|
|
|
### Variable and metrics |
|
|
|
We use the BLEU score for evaluation on the [Flores-200](https://github.com/facebookresearch/flores/tree/main/flores200) and |
|
[NTREX](https://github.com/MicrosoftTranslator/NTREX) test sets. |
|
|
|
### Evaluation results |
|
|
|
Below are the evaluation results on the machine translation from Catalan to Chinese compared to [Google Translate](https://translate.google.com/), |
|
[M2M 1.2B](https://huggingface.co/facebook/m2m100_1.2B) and [ NLLB-200's distilled 1.3B variant](https://huggingface.co/facebook/nllb-200-distilled-1.3B): |
|
|
|
| Test set | Google Translate | M2M 1.2B | NLLB 1.3B | aina-translator-ca-zh | |
|
|----------------------|------------|------------|------------------|---------------| |
|
|Flores Dev | **42,6** | 27,8 | 18,9 | 31,4 | |
|
|Flores Devtest | **43,7** | 28,4 | 18,4 | 32,6| |
|
|NTREX| **36,3** | 24,4 | 14,2 | 26,6| |
|
|Average |**41,0** | 26,9| 17,0 | 30,2 | |
|
|
|
## Additional information |
|
|
|
### Author |
|
The Language Technologies Unit from Barcelona Supercomputing Center. |
|
|
|
### Contact |
|
For further information, please send an email to <langtech@bsc.es>. |
|
|
|
### Copyright |
|
Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center. |
|
|
|
### License |
|
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
|
|
|
### Funding |
|
This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/). |
|
|
|
### Disclaimer |
|
|
|
<details> |
|
<summary>Click to expand</summary> |
|
|
|
The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0. |
|
|
|
Be aware that the model may have biases and/or any other undesirable distortions. |
|
|
|
When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it) |
|
or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and, |
|
in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence. |
|
|
|
In no event shall the owner and creator of the model (Barcelona Supercomputing Center) |
|
be liable for any results arising from the use made by third parties. |