|
---
|
|
license: apache-2.0
|
|
datasets:
|
|
- projecte-aina/CA-ZH_Parallel_Corpus
|
|
language:
|
|
- zh
|
|
- ca
|
|
base_model:
|
|
- facebook/m2m100_1.2B
|
|
---
|
|
## Projecte Aina’s Catalan-Chinese machine translation model
|
|
|
|
## Table of Contents
|
|
<details>
|
|
<summary>Click to expand</summary>
|
|
|
|
- [Model description](#model-description)
|
|
- [Intended uses and limitations](#intended-uses-and-limitations)
|
|
- [How to use](#how-to-use)
|
|
- [Limitations and bias](#limitations-and-bias)
|
|
- [Training](#training)
|
|
- [Evaluation](#evaluation)
|
|
- [Additional information](#additional-information)
|
|
|
|
</details>
|
|
|
|
|
|
## Model description
|
|
|
|
This machine translation model is built upon the M2M100 1.2B, fine-tuned specifically for Catalan-Chinese translation.
|
|
It is trained on a combination of Catalan-Chinese datasets
|
|
totalling 94,187,858 sentence pairs. 113,305 sentence pairs were parallel data collected from the web, while the remaining 94,074,553 sentence pairs
|
|
were parallel synthetic data created using the
|
|
[Aina Project's Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca) and the [Aina Project's English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
|
|
|
|
Following the fine-tuning phase, Contrastive Preference Optimization (CPO) was applied to further refine the model's outputs. CPO training involved pairs of "chosen" and "rejected" translations for a total of 4,006 sentences. These sentences were sourced from the Flores development set (997 sentences), the Flores devtest set (1,012 sentences), and the NTREX set (1,997 sentences).
|
|
|
|
The model was evaluated on the Projecte Aina's Catalan-Chinese evaluation dataset, which contains 1022 sentences.
|
|
|
|
## Intended uses and limitations
|
|
|
|
You can use this model for machine translation from Catalan to simplified Chinese.
|
|
|
|
## How to use
|
|
|
|
### Usage
|
|
|
|
Translate a sentence using python
|
|
```python
|
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
|
|
|
model_id = "projecte-aina/aina-translator-ca-zh-v2"
|
|
|
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
|
|
|
sentence = "Benvingut al projecte Aina!"
|
|
|
|
input_ids = tokenizer(sentence, return_tensors="pt").input_ids
|
|
output_ids = model.generate(input_ids, max_length=200, num_beams=5)
|
|
|
|
generated_translation= tokenizer.decode(output_ids[0], skip_special_tokens=True).strip()
|
|
print(generated_translation)
|
|
#欢迎来到 Aina 项目!
|
|
```
|
|
|
|
|
|
## Limitations and bias
|
|
At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
|
|
However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
|
|
|
|
## Training
|
|
|
|
### Training data
|
|
|
|
The Catalan-Chinese data collected from the web was a combination of the following datasets:
|
|
|
|
| Dataset | Sentences before cleaning |
|
|
|-------------------|----------------|
|
|
| OpenSubtitles | 139.300 |
|
|
| WikiMatrix | 90.643 |
|
|
| Wikipedia | 68.623|
|
|
| **Total** | **298.566** |
|
|
|
|
94.074.553 sentence pairs of synthetic parallel data were created from the following Spanish-Chinese datasets and English-Chinese datasets:
|
|
|
|
**Spanish-Chinese:**
|
|
|
|
| Dataset | Sentences before cleaning |
|
|
|-------------------|----------------|
|
|
| NLLB |24.051.233|
|
|
| UNPC | 17.599.223 |
|
|
| MultiUN | 9.847.770 |
|
|
| OpenSubtitles | 9.319.658 |
|
|
| MultiParaCrawl | 3.410.087 |
|
|
| MultiCCAligned | 3.006.694 |
|
|
| WikiMatrix | 1.214.322 |
|
|
| News Commentary | 375.982 |
|
|
| Tatoeba | 9.404 |
|
|
| **Total** | **68.834.373** |
|
|
|
|
**English-Chinese:**
|
|
|
|
| Dataset | Sentences before cleaning |
|
|
|-------------------|----------------|
|
|
| NLLB |71.383.325|
|
|
| CCAligned | 15.181.415 |
|
|
| Paracrawl | 14.170.869|
|
|
| WikiMatrix | 2.595.119|
|
|
| **Total** | **103.330.728** |
|
|
|
|
|
|
|
|
### Training procedure
|
|
|
|
### Data preparation
|
|
|
|
**Catalan-Chinese parallel data**
|
|
|
|
The Chinese side of all datasets were first processed using the [Hanzi Identifier](https://github.com/tsroten/hanzidentifier) to detect Traditional Chinese, which was subsequently converted to Simplified Chinese using [OpenCC](https://github.com/BYVoid/OpenCC).
|
|
|
|
All data was then filtered according to two specific criteria:
|
|
|
|
- Alignment: sentence level alignments were calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) and sentence pairs with a score below 0.75 were discarded.
|
|
|
|
- Language identification: the probability of being the target language was calculated using [Lingua.py](https://github.com/pemistahl/lingua-py) and sentences with a language probability score below 0.5 were discarded.
|
|
|
|
Next, Spanish data was translated into Catalan using the Aina Project's [Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca), while English data was translated into Catalan using the Aina Project's [English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
|
|
|
|
The filtered and translated datasets are then concatenated and deduplicated to form a final corpus of 94.187.858.
|
|
|
|
**Catalan-Chinese Contrastive Preference Optimization dataset**
|
|
|
|
The CPO dataset is built by comparing the quality of translations across four distinct sources:
|
|
|
|
- Reference translation: Chinese sentences from Flores test set, Flores devtest set, and NTREX dataset.
|
|
- [aina-translator-ca-zh](https://huggingface.co/projecte-aina/aina-translator-ca-zh): A specialized bilingual model for Catalan-Chinese translations.
|
|
- Google Translate: A widely-used general-purpose machine translation system.
|
|
- OpenAI GPT-4: A large-scale language model capable of performing a wide range of tasks in conversational settings, including high-quality translation.
|
|
|
|
To evaluate the quality of translations without relying on human annotations, we employ two reference-free evaluation models:
|
|
|
|
- [Unbabel/wmt23-cometkiwi-da-xxl](https://huggingface.co/Unbabel/wmt23-cometkiwi-da-xxl)
|
|
- [Unbabel/XCOMET-XXL](https://huggingface.co/Unbabel/XCOMET-XXL)
|
|
|
|
These models provide direct assessment scores for each translation. The scores from both models are averaged to determine the relative quality of each translation. Based on this evaluation, the highest-scoring ("chosen") and lowest-scoring ("rejected") translations are identified for each source sentence, forming contrastive pairs. The CPO dataset comprises a total of 4,006 such pairs of "chosen" and "rejected" translations.
|
|
|
|
|
|
#### Training
|
|
|
|
The training was executed on NVIDIA GPUs utilizing the Hugging Face Transformers framework.
|
|
The model was trained for 245.000 updates.
|
|
|
|
Following fine-tuning on the M2M100 1.2B model, Contrastive Preference Optimization (CPO) was performed using our CPO dataset and the Hugging Face CPO Trainer. This phase involved 1,500 updates.
|
|
|
|
## Evaluation
|
|
|
|
### Variable and metrics
|
|
|
|
Below are the evaluation results on the Projecte Aina's Catalan-Chinese test set, compared to Google Translate for the CA-ZH direction. The evaluation was conducted using [`tower-eval`](https://github.com/deep-spin/tower-eval) following the standard setting (beam search with beam size 5, limiting the translation length to 200 tokens). We report the following metrics:
|
|
|
|
- BLEU: Sacrebleu implementation, version:2.4.0
|
|
- ChrF: Sacrebleu implementation.
|
|
- Comet: Model checkpoint: "Unbabel/wmt22-comet-da".
|
|
- Comet-kiwi: Model checkpoint: "Unbabel/wmt22-cometkiwi-da".
|
|
|
|
|
|
### Evaluation results
|
|
|
|
Below are the evaluation results on the machine translation from Chinese to Catalan compared to [Google Translate](https://translate.google.com/):
|
|
|
|
|
|
#### Projecte Aina's Catalan-Chinese evaluation dataset
|
|
|
|
| | Bleu ↑ | ChrF ↑ | Comet ↑ | Comet-kiwi ↑ |
|
|
|:-----------------------|-------:|------:|-------:|--------:|-------------:|---------:|
|
|
| aina-translator-zh-ca-v2 | **28.55** | **57.64** | **0.87** | **0.82** |
|
|
| Google Translate | 26.84 | 55.7 | 0.86 | **0.82** |
|
|
|
|
|
|
|
|
## Additional information
|
|
|
|
### Author
|
|
The Language Technologies Unit from Barcelona Supercomputing Center.
|
|
|
|
### Contact
|
|
For further information, please send an email to <langtech@bsc.es>.
|
|
|
|
### Copyright
|
|
Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
|
|
|
|
### License
|
|
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
|
|
|
### Funding
|
|
This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
|
|
|
|
### Disclaimer
|
|
|
|
<details>
|
|
<summary>Click to expand</summary>
|
|
|
|
The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
|
|
|
|
Be aware that the model may have biases and/or any other undesirable distortions.
|
|
|
|
When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
|
|
or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
|
|
in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
|
|
|
|
In no event shall the owner and creator of the model (Barcelona Supercomputing Center)
|
|
be liable for any results arising from the use made by third parties.
|
|
|
|
</details>
|
|
|