update model to ct2
Browse files- .gitattributes +1 -0
- README.md +13 -10
- README_English.md +31 -18
- ct2-es-gl_12L/config.json +9 -0
- NOS-MT-es-gl.pt → ct2-es-gl_12L/model.bin +2 -2
- ct2-es-gl_12L/source_vocabulary.json +0 -0
- ct2-es-gl_12L/target_vocabulary.json +0 -0
- phrase_table-es-gl.txt +0 -0
- translate.py +18 -0
.gitattributes
CHANGED
@@ -35,3 +35,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
35 |
NOS-MT-es-gl filter=lfs diff=lfs merge=lfs -text
|
36 |
embeddings/es.emb.txt filter=lfs diff=lfs merge=lfs -text
|
37 |
embeddings/gl.emb.txt filter=lfs diff=lfs merge=lfs -text
|
|
|
|
35 |
NOS-MT-es-gl filter=lfs diff=lfs merge=lfs -text
|
36 |
embeddings/es.emb.txt filter=lfs diff=lfs merge=lfs -text
|
37 |
embeddings/gl.emb.txt filter=lfs diff=lfs merge=lfs -text
|
38 |
+
ct2-es-gl_12L/model.bin filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
@@ -13,26 +13,29 @@ metrics:
|
|
13 |
|
14 |
**Descrición do Modelo**
|
15 |
|
16 |
-
Modelo feito con OpenNMT para o par español-galego utilizando unha arquitectura transformer.
|
17 |
|
18 |
**Como traducir con este Modelo**
|
19 |
|
20 |
-
+ Abrir terminal bash
|
21 |
+ Instalar o [Python 3.9](https://www.python.org/downloads/release/python-390/)
|
22 |
-
+ Instalar o [
|
23 |
+ Traducir un input_text utilizando o modelo NOS-MT-es-gl co seguinte comando:
|
24 |
-
```
|
|
|
25 |
```
|
26 |
-
```
|
|
|
27 |
```
|
28 |
```bash
|
29 |
-
|
|
|
|
|
|
|
30 |
```
|
31 |
-
+ O resultado da tradución estará no PATH indicado no flag --output.
|
32 |
|
33 |
**Adestramento**
|
34 |
|
35 |
-
No adestramento, utilizamos córpora auténticos e sintéticos do [ProxectoNós](https://github.com/proxectonos/corpora). Os primeiros son córpora de traducións feitas directamente por tradutores humanos. Os segundos son córpora de traducións español-portugués, que convertemos en español-galego a través da tradución automática portugués-galego con Opentrad/Apertium e transliteración para palabras fóra de vocabulario.
|
36 |
|
37 |
**Procedemento de adestramento**
|
38 |
|
@@ -43,7 +46,7 @@ No adestramento, utilizamos córpora auténticos e sintéticos do [ProxectoNós]
|
|
43 |
+ Utilizando o .yaml deste repositorio pode replicar o proceso de adestramento. É preciso modificar os paths do ficheiro .yaml para a Open NMT saber onde ir buscar os textos. Após facer isto, pode do seguinte xeito comezar o proceso:
|
44 |
|
45 |
```bash
|
46 |
-
onmt_build_vocab -config bpe-es-gl_emb.yaml -n_sample
|
47 |
onmt_train -config bpe-es-gl_emb.yaml
|
48 |
```
|
49 |
|
@@ -57,7 +60,7 @@ A avaliación BLEU dos modelos é feita cunha mistura de tests desenvolvidos int
|
|
57 |
|
58 |
| GOLD 1 | GOLD 2 | FLORES | TEST-SUITE|
|
59 |
| ------------- |:-------------:| -------:|----------:|
|
60 |
-
| 79.
|
61 |
|
62 |
**Licenzas do Modelo**
|
63 |
|
|
|
13 |
|
14 |
**Descrición do Modelo**
|
15 |
|
16 |
+
Modelo feito con OpenNMT-py 3.2 para o par español-galego utilizando unha arquitectura transformer. O modelo foi transformado para o formato da ctranslate2.
|
17 |
|
18 |
**Como traducir con este Modelo**
|
19 |
|
|
|
20 |
+ Instalar o [Python 3.9](https://www.python.org/downloads/release/python-390/)
|
21 |
+
+ Instalar o [ctranslate 3.2](https://github.com/OpenNMT/CTranslate2)
|
22 |
+ Traducir un input_text utilizando o modelo NOS-MT-es-gl co seguinte comando:
|
23 |
+
```bash
|
24 |
+
perl tokenizer.perl < input.txt > input.tok
|
25 |
```
|
26 |
+
```bash
|
27 |
+
subword_nmt.apply_bpe -c ./bpe/es.bpe < input.tok > input.bpe
|
28 |
```
|
29 |
```bash
|
30 |
+
python3 translate.py ./ct2-es-gl_12L input.bpe > output.txt
|
31 |
+
```
|
32 |
+
```bash
|
33 |
+
': sed -i 's/@@ //g' output.txt
|
34 |
```
|
|
|
35 |
|
36 |
**Adestramento**
|
37 |
|
38 |
+
No adestramento, utilizamos córpora auténticos e sintéticos do [ProxectoNós](https://github.com/proxectonos/corpora). Os primeiros son córpora de traducións feitas directamente por tradutores humanos. É importante salientar que a pesar destes textos seren feitos por humanos, non están libres de erros lingüísticos. Os segundos son córpora de traducións español-portugués, que convertemos en español-galego a través da tradución automática portugués-galego con Opentrad/Apertium e transliteración para palabras fóra de vocabulario.
|
39 |
|
40 |
**Procedemento de adestramento**
|
41 |
|
|
|
46 |
+ Utilizando o .yaml deste repositorio pode replicar o proceso de adestramento. É preciso modificar os paths do ficheiro .yaml para a Open NMT saber onde ir buscar os textos. Após facer isto, pode do seguinte xeito comezar o proceso:
|
47 |
|
48 |
```bash
|
49 |
+
onmt_build_vocab -config bpe-es-gl_emb.yaml -n_sample 40000
|
50 |
onmt_train -config bpe-es-gl_emb.yaml
|
51 |
```
|
52 |
|
|
|
60 |
|
61 |
| GOLD 1 | GOLD 2 | FLORES | TEST-SUITE|
|
62 |
| ------------- |:-------------:| -------:|----------:|
|
63 |
+
| 79.5 | 43.5 | 21.4 | 73.4 |
|
64 |
|
65 |
**Licenzas do Modelo**
|
66 |
|
README_English.md
CHANGED
@@ -9,51 +9,58 @@ metrics:
|
|
9 |
- bleu (Test-suite): 74.3
|
10 |
---
|
11 |
|
12 |
-
**
|
13 |
|
14 |
-
Model
|
15 |
|
16 |
-
|
|
|
|
|
17 |
|
18 |
-
+ Open bash terminal
|
19 |
+ Install [Python 3.9](https://www.python.org/downloads/release/python-390/)
|
20 |
-
+ Install [
|
21 |
+ Translate an input_text using the NOS-MT-es-gl model with the following command:
|
22 |
-
|
|
|
|
|
|
|
|
|
|
|
23 |
```bash
|
24 |
-
|
|
|
|
|
|
|
25 |
```
|
26 |
-
+ The resulting translation will be in the PATH indicated by the -output flag.
|
27 |
|
28 |
**Training**
|
29 |
|
30 |
-
|
31 |
|
32 |
-
|
33 |
|
|
|
34 |
|
35 |
-
|
36 |
|
37 |
-
+
|
38 |
-
+ All BPE models were generated with the script [learn_bpe.py](https://github.com/OpenNMT/OpenNMT-py/blob/master/tools/learn_bpe.py)
|
39 |
-
+ Using the .yaml in this repository, it is possible to replicate the original training process. Before training the model, please verify that the path to each target (tgt) and (src) file is correct. Once this is done, proceed as follows:
|
40 |
|
41 |
```bash
|
42 |
-
onmt_build_vocab -config bpe-es-gl_emb.yaml -n_sample
|
43 |
onmt_train -config bpe-es-gl_emb.yaml
|
44 |
```
|
45 |
|
46 |
**Hyperparameters**
|
47 |
|
48 |
-
|
49 |
|
50 |
**Evaluation**
|
51 |
|
52 |
-
The BLEU evaluation of the models is done
|
53 |
|
54 |
| GOLD 1 | GOLD 2 | FLORES | TEST-SUITE|
|
55 |
| ------------- |:-------------:| -------:|----------:|
|
56 |
-
| 79.
|
57 |
|
58 |
**Licensing information**
|
59 |
|
@@ -85,3 +92,9 @@ This research was funded by the project "Nós: Galician in the society and econo
|
|
85 |
|
86 |
**Citation Information**
|
87 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
- bleu (Test-suite): 74.3
|
10 |
---
|
11 |
|
12 |
+
**English text [here](https://huggingface.co/proxectonos/NOS-MT-OpenNMT-es-gl/blob/main/README_English.md)**
|
13 |
|
14 |
+
**Model Description**
|
15 |
|
16 |
+
Model created with OpenNMT-py 3.2 for the Spanish-Galician pair using a transformer architecture. The model was converted to the ctranslate2 format.
|
17 |
+
|
18 |
+
**How to Translate with this Model**
|
19 |
|
|
|
20 |
+ Install [Python 3.9](https://www.python.org/downloads/release/python-390/)
|
21 |
+
+ Install [ctranslate 3.2](https://github.com/OpenNMT/CTranslate2)
|
22 |
+ Translate an input_text using the NOS-MT-es-gl model with the following command:
|
23 |
+
```bash
|
24 |
+
perl tokenizer.perl < input.txt > input.tok
|
25 |
+
```
|
26 |
+
```bash
|
27 |
+
subword_nmt.apply_bpe -c ./bpe/es.bpe < input.tok > input.bpe
|
28 |
+
```
|
29 |
```bash
|
30 |
+
python3 translate.py ./ct2-es-gl_12L input.bpe > output.txt
|
31 |
+
```
|
32 |
+
```bash
|
33 |
+
': sed -i 's/@@ //g' output.txt
|
34 |
```
|
|
|
35 |
|
36 |
**Training**
|
37 |
|
38 |
+
We used authentic and synthetic corpora from the [ProxectoNós](https://github.com/proxectonos/corpora). The former are translation corpora made directly by human translators. It is important to note that despite these texts being made by humans, they are not free from linguistic errors. The latter are Spanish-Portuguese translation corpora, which we converted into Spanish-Galician through Portuguese-Galician automatic translation with Opentrad/Apertium and transliteration for out-of-vocabulary words.
|
39 |
|
40 |
+
**Training Procedure**
|
41 |
|
42 |
+
+ Tokenization of the datasets was done with the tokenizer (tokenizer.pl) from [linguakit](https://github.com/citiususc/Linguakit) which was modified to avoid line breaks per token from the original file.
|
43 |
|
44 |
+
+ The BPE vocabulary for the models was generated through the [learn_bpe.py](https://github.com/OpenNMT/OpenNMT-py/blob/master/tools/learn_bpe.py) script from OpenNMT.
|
45 |
|
46 |
+
+ Using the .yaml from this repository, you can replicate the training process. It is necessary to modify the paths in the .yaml file for Open NMT to know where to find the texts. After doing this, you can start the process as follows:
|
|
|
|
|
47 |
|
48 |
```bash
|
49 |
+
onmt_build_vocab -config bpe-es-gl_emb.yaml -n_sample 40000
|
50 |
onmt_train -config bpe-es-gl_emb.yaml
|
51 |
```
|
52 |
|
53 |
**Hyperparameters**
|
54 |
|
55 |
+
The parameters used for the model development can be consulted directly in the same .yaml file bpe-es-gl_emb.yaml
|
56 |
|
57 |
**Evaluation**
|
58 |
|
59 |
+
The BLEU evaluation of the models is done with a mix of internally developed tests (gold1, gold2, test-suite) with other available datasets in Galician (Flores).
|
60 |
|
61 |
| GOLD 1 | GOLD 2 | FLORES | TEST-SUITE|
|
62 |
| ------------- |:-------------:| -------:|----------:|
|
63 |
+
| 79.5 | 43.5 | 21.4 | 73.4 |
|
64 |
|
65 |
**Licensing information**
|
66 |
|
|
|
92 |
|
93 |
**Citation Information**
|
94 |
|
95 |
+
|
96 |
+
Daniel Bardanca Outeirinho, Pablo Gamallo Otero, Iria de-Dios-Flores, and José Ramom Pichel Campos. 2024.
|
97 |
+
Exploring the effects of vocabulary size in neural machine translation: Galician as a target language.
|
98 |
+
In Proceedings of the 16th International Conference on Computational Processing of Portuguese, pages 600–604,
|
99 |
+
Santiago de Compostela, Galiza. Association for Computational Lingustics.
|
100 |
+
|
ct2-es-gl_12L/config.json
ADDED
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"add_source_bos": false,
|
3 |
+
"add_source_eos": false,
|
4 |
+
"bos_token": "<s>",
|
5 |
+
"decoder_start_token": "<s>",
|
6 |
+
"eos_token": "</s>",
|
7 |
+
"layer_norm_epsilon": null,
|
8 |
+
"unk_token": "<unk>"
|
9 |
+
}
|
NOS-MT-es-gl.pt → ct2-es-gl_12L/model.bin
RENAMED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:3e76e5cf8a252ca8d54fcc7f841792809df0271e9b06613f16616c1f8aeae1f5
|
3 |
+
size 497179365
|
ct2-es-gl_12L/source_vocabulary.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
ct2-es-gl_12L/target_vocabulary.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
phrase_table-es-gl.txt
DELETED
The diff for this file is too large to render.
See raw diff
|
|
translate.py
ADDED
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import ctranslate2
|
2 |
+
import sys
|
3 |
+
|
4 |
+
|
5 |
+
model = sys.argv[1]
|
6 |
+
file_name = sys.argv[2]
|
7 |
+
|
8 |
+
file = open(file_name, 'r')
|
9 |
+
|
10 |
+
translator = ctranslate2.Translator(model, device="cuda")
|
11 |
+
|
12 |
+
for line in file:
|
13 |
+
line = line.strip()
|
14 |
+
r = translator.translate_batch(
|
15 |
+
[line.split()], replace_unknowns=True, beam_size=5, batch_type='examples'
|
16 |
+
)
|
17 |
+
results =' '.join(r[0].hypotheses[0])
|
18 |
+
print(results)
|