update model to ct2

Browse files

Files changed (9) hide show

.gitattributes +1 -0
README.md +13 -10
README_English.md +31 -18
ct2-es-gl_12L/config.json +9 -0
NOS-MT-es-gl.pt → ct2-es-gl_12L/model.bin +2 -2
ct2-es-gl_12L/source_vocabulary.json +0 -0
ct2-es-gl_12L/target_vocabulary.json +0 -0
phrase_table-es-gl.txt +0 -0
translate.py +18 -0

.gitattributes CHANGED Viewed

@@ -35,3 +35,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 NOS-MT-es-gl filter=lfs diff=lfs merge=lfs -text
 embeddings/es.emb.txt filter=lfs diff=lfs merge=lfs -text
 embeddings/gl.emb.txt filter=lfs diff=lfs merge=lfs -text

 NOS-MT-es-gl filter=lfs diff=lfs merge=lfs -text
 embeddings/es.emb.txt filter=lfs diff=lfs merge=lfs -text
 embeddings/gl.emb.txt filter=lfs diff=lfs merge=lfs -text
+ct2-es-gl_12L/model.bin filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -13,26 +13,29 @@ metrics:
 **Descrición do Modelo**
-Modelo feito con OpenNMT para o par español-galego utilizando unha arquitectura transformer.
 **Como traducir con este Modelo**
-+ Abrir terminal bash
 + Instalar o [Python 3.9](https://www.python.org/downloads/release/python-390/)
-+ Instalar o [Open NMT toolkit v.2.2](https://github.com/OpenNMT/OpenNMT-py)
 + Traducir un input_text utilizando o modelo NOS-MT-es-gl co seguinte comando:
-```perl tokenizer.perl < input.txt > input.tok
 ```
-```subword_nmt.apply_bpe -c ./bpe/gl.bpe < input.tok > input.bpe
 ```
 ```bash
-onmt_translate -src input.bpe -model NOS-MT-es-gl.pt --output ./output_file.txt --replace_unk --phrase_table phrase_table-es-gl.txt -gpu 0
 ```
-+ O resultado da tradución estará no PATH indicado no flag --output.
 **Adestramento**
-No adestramento, utilizamos córpora auténticos e sintéticos do [ProxectoNós](https://github.com/proxectonos/corpora). Os primeiros son córpora de traducións feitas directamente por tradutores humanos. Os segundos son córpora de traducións español-portugués, que convertemos en español-galego a través da tradución automática portugués-galego con Opentrad/Apertium e transliteración para palabras fóra de vocabulario.
 **Procedemento de adestramento**
@@ -43,7 +46,7 @@ No adestramento, utilizamos córpora auténticos e sintéticos do [ProxectoNós]
 + Utilizando o .yaml deste repositorio pode replicar o proceso de adestramento. É preciso modificar os paths do ficheiro .yaml para a Open NMT saber onde ir buscar os textos. Após facer isto, pode do seguinte xeito comezar o proceso:
 ```bash
-onmt_build_vocab -config  bpe-es-gl_emb.yaml -n_sample 100000
 onmt_train -config bpe-es-gl_emb.yaml
 ```
@@ -57,7 +60,7 @@ A avaliación BLEU dos modelos é feita cunha mistura de tests desenvolvidos int
 | GOLD 1        | GOLD 2        | FLORES  | TEST-SUITE|
 | ------------- |:-------------:| -------:|----------:|
-| 79.6          | 43.3          | 21.8    | 74.3      |
 **Licenzas do Modelo**

 **Descrición do Modelo**
+Modelo feito con OpenNMT-py 3.2 para o par español-galego utilizando unha arquitectura transformer. O modelo foi transformado para o formato da ctranslate2.
 **Como traducir con este Modelo**
 + Instalar o [Python 3.9](https://www.python.org/downloads/release/python-390/)
++ Instalar o [ctranslate 3.2](https://github.com/OpenNMT/CTranslate2)
 + Traducir un input_text utilizando o modelo NOS-MT-es-gl co seguinte comando:
+```bash
+    perl tokenizer.perl < input.txt > input.tok
 ```
+```bash
+    subword_nmt.apply_bpe -c ./bpe/es.bpe < input.tok > input.bpe
 ```
 ```bash
+    python3 translate.py ./ct2-es-gl_12L input.bpe > output.txt
+```
+```bash
+    ': sed -i 's/@@ //g' output.txt
 ```
 **Adestramento**
+No adestramento, utilizamos córpora auténticos e sintéticos do [ProxectoNós](https://github.com/proxectonos/corpora). Os primeiros son córpora de traducións feitas directamente por tradutores humanos. É importante salientar que a pesar destes textos seren feitos por humanos, non están libres de erros lingüísticos. Os segundos son córpora de traducións español-portugués, que convertemos en español-galego a través da tradución automática portugués-galego con Opentrad/Apertium e transliteración para palabras fóra de vocabulario.
 **Procedemento de adestramento**
 + Utilizando o .yaml deste repositorio pode replicar o proceso de adestramento. É preciso modificar os paths do ficheiro .yaml para a Open NMT saber onde ir buscar os textos. Após facer isto, pode do seguinte xeito comezar o proceso:
 ```bash
+onmt_build_vocab -config  bpe-es-gl_emb.yaml -n_sample 40000
 onmt_train -config bpe-es-gl_emb.yaml
 ```
 | GOLD 1        | GOLD 2        | FLORES  | TEST-SUITE|
 | ------------- |:-------------:| -------:|----------:|
+| 79.5          | 43.5          | 21.4    | 73.4      |
 **Licenzas do Modelo**

README_English.md CHANGED Viewed

@@ -9,51 +9,58 @@ metrics:
 - bleu (Test-suite): 74.3
 ---
-**Model description**
-Model developed with OpenNMT for the Spanish-Galician pair using the transformer architecture.
-**How to translate**
-+ Open bash terminal
 + Install [Python 3.9](https://www.python.org/downloads/release/python-390/)
-+ Install [Open NMT toolkit v.2.2](https://github.com/OpenNMT/OpenNMT-py)
 + Translate an input_text using the NOS-MT-es-gl model with the following command:
 ```bash
-onmt_translate -src input_text -model NOS-MT-es-gl -output ./output_file.txt -replace_unk -phrase_table phrase_table-es-gl.txt -gpu 0
 ```
-+ The resulting translation will be in the PATH indicated by the -output flag.
 **Training**
-To train this model, we have used authentic and synthetic corpora from [ProxectoNós](https://github.com/proxectonos/corpora).
-Authentic corpora are corpora produced by human translators. Synthetic corpora are Spanish-Portuguese translations, which have been converted to Spanish-Galician by means of Portuguese-Galician translation with Opentrad/Apertium and transliteration for out-of-vocabulary words.
-**Training process**
-+ Tokenisation was performed with a modified version of the [linguakit](https://github.com/citiususc/Linguakit) tokeniser (tokenizer.pl) that does not append a new line after each token.
-+ All BPE models were generated with the script [learn_bpe.py](https://github.com/OpenNMT/OpenNMT-py/blob/master/tools/learn_bpe.py)
-+ Using the .yaml in this repository, it is possible to replicate the original training process. Before training the model, please verify that the path to each target (tgt) and (src) file is correct. Once this is done, proceed as follows:
 ```bash
-onmt_build_vocab -config  bpe-es-gl_emb.yaml -n_sample 100000
 onmt_train -config bpe-es-gl_emb.yaml
 ```
 **Hyperparameters**
-You may find the parameters used for this model inside the file bpe-es-gl_emb.yaml
 **Evaluation**
-The BLEU evaluation of the models is done by mixing internally developed tests (gold1, gold2, test-suite) and other datasets available in Galician (Flores).
 | GOLD 1        | GOLD 2        | FLORES  | TEST-SUITE|
 | ------------- |:-------------:| -------:|----------:|
-| 79.6          | 43.3          | 21.8    | 74.3      |
 **Licensing information**
@@ -85,3 +92,9 @@ This research was funded by the project "Nós: Galician in the society and econo
 **Citation Information**

 - bleu (Test-suite): 74.3
 ---
+**English text [here](https://huggingface.co/proxectonos/NOS-MT-OpenNMT-es-gl/blob/main/README_English.md)**
+**Model Description**
+Model created with OpenNMT-py 3.2 for the Spanish-Galician pair using a transformer architecture. The model was converted to the ctranslate2 format.
+**How to Translate with this Model**
 + Install [Python 3.9](https://www.python.org/downloads/release/python-390/)
++ Install [ctranslate 3.2](https://github.com/OpenNMT/CTranslate2)
 + Translate an input_text using the NOS-MT-es-gl model with the following command:
+```bash
+    perl tokenizer.perl < input.txt > input.tok
+```
+```bash
+    subword_nmt.apply_bpe -c ./bpe/es.bpe < input.tok > input.bpe
+```
 ```bash
+    python3 translate.py ./ct2-es-gl_12L input.bpe > output.txt
+```
+```bash
+    ': sed -i 's/@@ //g' output.txt
 ```
 **Training**
+We used authentic and synthetic corpora from the [ProxectoNós](https://github.com/proxectonos/corpora). The former are translation corpora made directly by human translators. It is important to note that despite these texts being made by humans, they are not free from linguistic errors. The latter are Spanish-Portuguese translation corpora, which we converted into Spanish-Galician through Portuguese-Galician automatic translation with Opentrad/Apertium and transliteration for out-of-vocabulary words.
+**Training Procedure**
++ Tokenization of the datasets was done with the tokenizer (tokenizer.pl) from [linguakit](https://github.com/citiususc/Linguakit) which was modified to avoid line breaks per token from the original file.
++ The BPE vocabulary for the models was generated through the [learn_bpe.py](https://github.com/OpenNMT/OpenNMT-py/blob/master/tools/learn_bpe.py) script from OpenNMT.
++ Using the .yaml from this repository, you can replicate the training process. It is necessary to modify the paths in the .yaml file for Open NMT to know where to find the texts. After doing this, you can start the process as follows:
 ```bash
+onmt_build_vocab -config  bpe-es-gl_emb.yaml -n_sample 40000
 onmt_train -config bpe-es-gl_emb.yaml
 ```
 **Hyperparameters**
+The parameters used for the model development can be consulted directly in the same .yaml file bpe-es-gl_emb.yaml
 **Evaluation**
+The BLEU evaluation of the models is done with a mix of internally developed tests (gold1, gold2, test-suite) with other available datasets in Galician (Flores).
 | GOLD 1        | GOLD 2        | FLORES  | TEST-SUITE|
 | ------------- |:-------------:| -------:|----------:|
+| 79.5          | 43.5          | 21.4    | 73.4      |
 **Licensing information**
 **Citation Information**
+Daniel Bardanca Outeirinho, Pablo Gamallo Otero, Iria de-Dios-Flores, and José Ramom Pichel Campos. 2024.
+Exploring the effects of vocabulary size in neural machine translation: Galician as a target language.
+In Proceedings of the 16th International Conference on Computational Processing of Portuguese, pages 600–604,
+Santiago de Compostela, Galiza. Association for Computational Lingustics.

ct2-es-gl_12L/config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "add_source_bos": false,
+  "add_source_eos": false,
+  "bos_token": "<s>",
+  "decoder_start_token": "<s>",
+  "eos_token": "</s>",
+  "layer_norm_epsilon": null,
+  "unk_token": "<unk>"
+}

NOS-MT-es-gl.pt → ct2-es-gl_12L/model.bin RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7bae19d962e0901d52ebd7d3ec43246a9946c003751272b713881c3151d4d8b9
-size 1030405931

 version https://git-lfs.github.com/spec/v1
+oid sha256:3e76e5cf8a252ca8d54fcc7f841792809df0271e9b06613f16616c1f8aeae1f5
+size 497179365

ct2-es-gl_12L/source_vocabulary.json ADDED Viewed

The diff for this file is too large to render. See raw diff

ct2-es-gl_12L/target_vocabulary.json ADDED Viewed

The diff for this file is too large to render. See raw diff

phrase_table-es-gl.txt DELETED Viewed

The diff for this file is too large to render. See raw diff

translate.py ADDED Viewed

	@@ -0,0 +1,18 @@

+import ctranslate2
+import sys
+model = sys.argv[1]
+file_name = sys.argv[2]
+file = open(file_name, 'r')
+translator = ctranslate2.Translator(model, device="cuda")
+for line in file:
+    line = line.strip()
+    r = translator.translate_batch(
+            [line.split()], replace_unknowns=True,  beam_size=5, batch_type='examples'
+        )
+    results =' '.join(r[0].hypotheses[0])
+    print(results)