imdbo commited on
Commit
436467d
1 Parent(s): c7ac43d

update model to ct2

Browse files
.gitattributes CHANGED
@@ -35,3 +35,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
35
  NOS-MT-es-gl filter=lfs diff=lfs merge=lfs -text
36
  embeddings/es.emb.txt filter=lfs diff=lfs merge=lfs -text
37
  embeddings/gl.emb.txt filter=lfs diff=lfs merge=lfs -text
 
 
35
  NOS-MT-es-gl filter=lfs diff=lfs merge=lfs -text
36
  embeddings/es.emb.txt filter=lfs diff=lfs merge=lfs -text
37
  embeddings/gl.emb.txt filter=lfs diff=lfs merge=lfs -text
38
+ ct2-es-gl_12L/model.bin filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -13,26 +13,29 @@ metrics:
13
 
14
  **Descrición do Modelo**
15
 
16
- Modelo feito con OpenNMT para o par español-galego utilizando unha arquitectura transformer.
17
 
18
  **Como traducir con este Modelo**
19
 
20
- + Abrir terminal bash
21
  + Instalar o [Python 3.9](https://www.python.org/downloads/release/python-390/)
22
- + Instalar o [Open NMT toolkit v.2.2](https://github.com/OpenNMT/OpenNMT-py)
23
  + Traducir un input_text utilizando o modelo NOS-MT-es-gl co seguinte comando:
24
- ```perl tokenizer.perl < input.txt > input.tok
 
25
  ```
26
- ```subword_nmt.apply_bpe -c ./bpe/gl.bpe < input.tok > input.bpe
 
27
  ```
28
  ```bash
29
- onmt_translate -src input.bpe -model NOS-MT-es-gl.pt --output ./output_file.txt --replace_unk --phrase_table phrase_table-es-gl.txt -gpu 0
 
 
 
30
  ```
31
- + O resultado da tradución estará no PATH indicado no flag --output.
32
 
33
  **Adestramento**
34
 
35
- No adestramento, utilizamos córpora auténticos e sintéticos do [ProxectoNós](https://github.com/proxectonos/corpora). Os primeiros son córpora de traducións feitas directamente por tradutores humanos. Os segundos son córpora de traducións español-portugués, que convertemos en español-galego a través da tradución automática portugués-galego con Opentrad/Apertium e transliteración para palabras fóra de vocabulario.
36
 
37
  **Procedemento de adestramento**
38
 
@@ -43,7 +46,7 @@ No adestramento, utilizamos córpora auténticos e sintéticos do [ProxectoNós]
43
  + Utilizando o .yaml deste repositorio pode replicar o proceso de adestramento. É preciso modificar os paths do ficheiro .yaml para a Open NMT saber onde ir buscar os textos. Após facer isto, pode do seguinte xeito comezar o proceso:
44
 
45
  ```bash
46
- onmt_build_vocab -config bpe-es-gl_emb.yaml -n_sample 100000
47
  onmt_train -config bpe-es-gl_emb.yaml
48
  ```
49
 
@@ -57,7 +60,7 @@ A avaliación BLEU dos modelos é feita cunha mistura de tests desenvolvidos int
57
 
58
  | GOLD 1 | GOLD 2 | FLORES | TEST-SUITE|
59
  | ------------- |:-------------:| -------:|----------:|
60
- | 79.6 | 43.3 | 21.8 | 74.3 |
61
 
62
  **Licenzas do Modelo**
63
 
 
13
 
14
  **Descrición do Modelo**
15
 
16
+ Modelo feito con OpenNMT-py 3.2 para o par español-galego utilizando unha arquitectura transformer. O modelo foi transformado para o formato da ctranslate2.
17
 
18
  **Como traducir con este Modelo**
19
 
 
20
  + Instalar o [Python 3.9](https://www.python.org/downloads/release/python-390/)
21
+ + Instalar o [ctranslate 3.2](https://github.com/OpenNMT/CTranslate2)
22
  + Traducir un input_text utilizando o modelo NOS-MT-es-gl co seguinte comando:
23
+ ```bash
24
+ perl tokenizer.perl < input.txt > input.tok
25
  ```
26
+ ```bash
27
+ subword_nmt.apply_bpe -c ./bpe/es.bpe < input.tok > input.bpe
28
  ```
29
  ```bash
30
+ python3 translate.py ./ct2-es-gl_12L input.bpe > output.txt
31
+ ```
32
+ ```bash
33
+ ': sed -i 's/@@ //g' output.txt
34
  ```
 
35
 
36
  **Adestramento**
37
 
38
+ No adestramento, utilizamos córpora auténticos e sintéticos do [ProxectoNós](https://github.com/proxectonos/corpora). Os primeiros son córpora de traducións feitas directamente por tradutores humanos. É importante salientar que a pesar destes textos seren feitos por humanos, non están libres de erros lingüísticos. Os segundos son córpora de traducións español-portugués, que convertemos en español-galego a través da tradución automática portugués-galego con Opentrad/Apertium e transliteración para palabras fóra de vocabulario.
39
 
40
  **Procedemento de adestramento**
41
 
 
46
  + Utilizando o .yaml deste repositorio pode replicar o proceso de adestramento. É preciso modificar os paths do ficheiro .yaml para a Open NMT saber onde ir buscar os textos. Após facer isto, pode do seguinte xeito comezar o proceso:
47
 
48
  ```bash
49
+ onmt_build_vocab -config bpe-es-gl_emb.yaml -n_sample 40000
50
  onmt_train -config bpe-es-gl_emb.yaml
51
  ```
52
 
 
60
 
61
  | GOLD 1 | GOLD 2 | FLORES | TEST-SUITE|
62
  | ------------- |:-------------:| -------:|----------:|
63
+ | 79.5 | 43.5 | 21.4 | 73.4 |
64
 
65
  **Licenzas do Modelo**
66
 
README_English.md CHANGED
@@ -9,51 +9,58 @@ metrics:
9
  - bleu (Test-suite): 74.3
10
  ---
11
 
12
- **Model description**
13
 
14
- Model developed with OpenNMT for the Spanish-Galician pair using the transformer architecture.
15
 
16
- **How to translate**
 
 
17
 
18
- + Open bash terminal
19
  + Install [Python 3.9](https://www.python.org/downloads/release/python-390/)
20
- + Install [Open NMT toolkit v.2.2](https://github.com/OpenNMT/OpenNMT-py)
21
  + Translate an input_text using the NOS-MT-es-gl model with the following command:
22
-
 
 
 
 
 
23
  ```bash
24
- onmt_translate -src input_text -model NOS-MT-es-gl -output ./output_file.txt -replace_unk -phrase_table phrase_table-es-gl.txt -gpu 0
 
 
 
25
  ```
26
- + The resulting translation will be in the PATH indicated by the -output flag.
27
 
28
  **Training**
29
 
30
- To train this model, we have used authentic and synthetic corpora from [ProxectoNós](https://github.com/proxectonos/corpora).
31
 
32
- Authentic corpora are corpora produced by human translators. Synthetic corpora are Spanish-Portuguese translations, which have been converted to Spanish-Galician by means of Portuguese-Galician translation with Opentrad/Apertium and transliteration for out-of-vocabulary words.
33
 
 
34
 
35
- **Training process**
36
 
37
- + Tokenisation was performed with a modified version of the [linguakit](https://github.com/citiususc/Linguakit) tokeniser (tokenizer.pl) that does not append a new line after each token.
38
- + All BPE models were generated with the script [learn_bpe.py](https://github.com/OpenNMT/OpenNMT-py/blob/master/tools/learn_bpe.py)
39
- + Using the .yaml in this repository, it is possible to replicate the original training process. Before training the model, please verify that the path to each target (tgt) and (src) file is correct. Once this is done, proceed as follows:
40
 
41
  ```bash
42
- onmt_build_vocab -config bpe-es-gl_emb.yaml -n_sample 100000
43
  onmt_train -config bpe-es-gl_emb.yaml
44
  ```
45
 
46
  **Hyperparameters**
47
 
48
- You may find the parameters used for this model inside the file bpe-es-gl_emb.yaml
49
 
50
  **Evaluation**
51
 
52
- The BLEU evaluation of the models is done by mixing internally developed tests (gold1, gold2, test-suite) and other datasets available in Galician (Flores).
53
 
54
  | GOLD 1 | GOLD 2 | FLORES | TEST-SUITE|
55
  | ------------- |:-------------:| -------:|----------:|
56
- | 79.6 | 43.3 | 21.8 | 74.3 |
57
 
58
  **Licensing information**
59
 
@@ -85,3 +92,9 @@ This research was funded by the project "Nós: Galician in the society and econo
85
 
86
  **Citation Information**
87
 
 
 
 
 
 
 
 
9
  - bleu (Test-suite): 74.3
10
  ---
11
 
12
+ **English text [here](https://huggingface.co/proxectonos/NOS-MT-OpenNMT-es-gl/blob/main/README_English.md)**
13
 
14
+ **Model Description**
15
 
16
+ Model created with OpenNMT-py 3.2 for the Spanish-Galician pair using a transformer architecture. The model was converted to the ctranslate2 format.
17
+
18
+ **How to Translate with this Model**
19
 
 
20
  + Install [Python 3.9](https://www.python.org/downloads/release/python-390/)
21
+ + Install [ctranslate 3.2](https://github.com/OpenNMT/CTranslate2)
22
  + Translate an input_text using the NOS-MT-es-gl model with the following command:
23
+ ```bash
24
+ perl tokenizer.perl < input.txt > input.tok
25
+ ```
26
+ ```bash
27
+ subword_nmt.apply_bpe -c ./bpe/es.bpe < input.tok > input.bpe
28
+ ```
29
  ```bash
30
+ python3 translate.py ./ct2-es-gl_12L input.bpe > output.txt
31
+ ```
32
+ ```bash
33
+ ': sed -i 's/@@ //g' output.txt
34
  ```
 
35
 
36
  **Training**
37
 
38
+ We used authentic and synthetic corpora from the [ProxectoNós](https://github.com/proxectonos/corpora). The former are translation corpora made directly by human translators. It is important to note that despite these texts being made by humans, they are not free from linguistic errors. The latter are Spanish-Portuguese translation corpora, which we converted into Spanish-Galician through Portuguese-Galician automatic translation with Opentrad/Apertium and transliteration for out-of-vocabulary words.
39
 
40
+ **Training Procedure**
41
 
42
+ + Tokenization of the datasets was done with the tokenizer (tokenizer.pl) from [linguakit](https://github.com/citiususc/Linguakit) which was modified to avoid line breaks per token from the original file.
43
 
44
+ + The BPE vocabulary for the models was generated through the [learn_bpe.py](https://github.com/OpenNMT/OpenNMT-py/blob/master/tools/learn_bpe.py) script from OpenNMT.
45
 
46
+ + Using the .yaml from this repository, you can replicate the training process. It is necessary to modify the paths in the .yaml file for Open NMT to know where to find the texts. After doing this, you can start the process as follows:
 
 
47
 
48
  ```bash
49
+ onmt_build_vocab -config bpe-es-gl_emb.yaml -n_sample 40000
50
  onmt_train -config bpe-es-gl_emb.yaml
51
  ```
52
 
53
  **Hyperparameters**
54
 
55
+ The parameters used for the model development can be consulted directly in the same .yaml file bpe-es-gl_emb.yaml
56
 
57
  **Evaluation**
58
 
59
+ The BLEU evaluation of the models is done with a mix of internally developed tests (gold1, gold2, test-suite) with other available datasets in Galician (Flores).
60
 
61
  | GOLD 1 | GOLD 2 | FLORES | TEST-SUITE|
62
  | ------------- |:-------------:| -------:|----------:|
63
+ | 79.5 | 43.5 | 21.4 | 73.4 |
64
 
65
  **Licensing information**
66
 
 
92
 
93
  **Citation Information**
94
 
95
+
96
+ Daniel Bardanca Outeirinho, Pablo Gamallo Otero, Iria de-Dios-Flores, and José Ramom Pichel Campos. 2024.
97
+ Exploring the effects of vocabulary size in neural machine translation: Galician as a target language.
98
+ In Proceedings of the 16th International Conference on Computational Processing of Portuguese, pages 600–604,
99
+ Santiago de Compostela, Galiza. Association for Computational Lingustics.
100
+
ct2-es-gl_12L/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_source_bos": false,
3
+ "add_source_eos": false,
4
+ "bos_token": "<s>",
5
+ "decoder_start_token": "<s>",
6
+ "eos_token": "</s>",
7
+ "layer_norm_epsilon": null,
8
+ "unk_token": "<unk>"
9
+ }
NOS-MT-es-gl.pt → ct2-es-gl_12L/model.bin RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7bae19d962e0901d52ebd7d3ec43246a9946c003751272b713881c3151d4d8b9
3
- size 1030405931
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3e76e5cf8a252ca8d54fcc7f841792809df0271e9b06613f16616c1f8aeae1f5
3
+ size 497179365
ct2-es-gl_12L/source_vocabulary.json ADDED
The diff for this file is too large to render. See raw diff
 
ct2-es-gl_12L/target_vocabulary.json ADDED
The diff for this file is too large to render. See raw diff
 
phrase_table-es-gl.txt DELETED
The diff for this file is too large to render. See raw diff
 
translate.py ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import ctranslate2
2
+ import sys
3
+
4
+
5
+ model = sys.argv[1]
6
+ file_name = sys.argv[2]
7
+
8
+ file = open(file_name, 'r')
9
+
10
+ translator = ctranslate2.Translator(model, device="cuda")
11
+
12
+ for line in file:
13
+ line = line.strip()
14
+ r = translator.translate_batch(
15
+ [line.split()], replace_unknowns=True, beam_size=5, batch_type='examples'
16
+ )
17
+ results =' '.join(r[0].hypotheses[0])
18
+ print(results)