imdbo commited on
Commit
3978465
1 Parent(s): 04749e2

Create README_English.md

Browse files
Files changed (1) hide show
  1. README_English.md +90 -0
README_English.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - gl
5
+ metrics:
6
+ - bleu (Gold1): 82.6
7
+ - bleu (Gold2): 49.9
8
+ - bleu (Flores): 23.8
9
+ - bleu (Test-suite): 77.2
10
+ ---
11
+
12
+ ---
13
+ License: MIT
14
+ ---
15
+
16
+ **Model Description**
17
+
18
+ OpenNMT model for English-Galician using a transformer architecture.
19
+
20
+ **How to translate**
21
+
22
+ + Open bash terminal
23
+ + Install [Python 3.9](https://www.python.org/downloads/release/python-390/)
24
+ + Install [Open NMT toolkit v.2.2](https://github.com/OpenNMT/OpenNMT-py)
25
+ + Translate an input_text using the NOS-MT-gl-es model with the following command:
26
+
27
+ ```bash
28
+ onmt_translate -src input_text -model NOS-MT-gl-es.pt -output ./output_file.txt -replace_unk -gpu 0
29
+ ```
30
+ + The result of the translation will be in the PATH indicated by the -output flag.
31
+
32
+ **Training**
33
+
34
+ In the training we have used authentic and synthetic corpora from [ProxectoNós](https://github.com/proxectonos/corpora). The former are corpora of translations directly produced by human translators. The latter are corpora of English-Portuguese translations, which we have converted into English-Galician by means of Portuguese-Galician translation with Opentrad/Apertium and transliteration for out-of-vocabulary words.
35
+
36
+ **Training process**
37
+
38
+ + Tokenization of the datasets made with linguakit tokeniser https://github.com/citiususc/Linguakit
39
+ + The vocabulary for the models was generated through the script [learn_bpe.py](https://github.com/OpenNMT/OpenNMT-py/blob/master/tools/learn_bpe.py) of OpenNMT
40
+ + Using .yaml in this repository you can replicate the training process as follows
41
+
42
+ ```bash
43
+ onmt_build_vocab -config bpe-gl-es_emb.yaml -n_sample 100000
44
+ onmt_train -config bpe-gl-es_emb.yaml
45
+ ```
46
+
47
+ **Hyper-parameters**
48
+
49
+ The parameters used for the development of the model can be directly consulted in the same .yaml file bpe-en-gl_emb.yaml
50
+
51
+ **Evaluation**
52
+
53
+ The BLEU evaluation of the models is made with a mixture of internally developed tests (gold1, gold2, test-suite) and other datasets available in Galician (Flores).
54
+
55
+ | GOLD 1 | GOLD 2 | FLORES | TEST-SUITE|
56
+ | ------------- |:-------------:| -------:|----------:|
57
+ | 82.6 | 49.9 | 23.8 | 77.2 |
58
+
59
+
60
+ **Licensing information**
61
+
62
+ MIT License
63
+
64
+ Copyright (c) 2023 Proxecto Nós
65
+
66
+ Permission is hereby granted, free of charge, to any person obtaining a copy
67
+ of this software and associated documentation files (the "Software"), to deal
68
+ in the Software without restriction, including without limitation the rights
69
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
70
+ copies of the Software, and to permit persons to whom the Software is
71
+ furnished to do so, subject to the following conditions:
72
+
73
+ The above copyright notice and this permission notice shall be included in all
74
+ copies or substantial portions of the Software.
75
+
76
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
77
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
78
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
79
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
80
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
81
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
82
+ SOFTWARE.
83
+
84
+ **Funding**
85
+
86
+ This research was funded by the project "Nós: Galician in the society and economy of artificial intelligence", agreement between Xunta de Galicia and University of Santiago de Compostela, and grant ED431G2019/04 by the Galician Ministry of Education, University and Professional Training, and the European Regional Development Fund (ERDF/FEDER program), and Groups of Reference: ED431C 2020/21.
87
+
88
+ **Citation Information**
89
+
90
+ Gamallo, Pablo; Bardanca, Daniel; Pichel, José Ramom; García, Marcos; Rodríguez-Rey, Sandra; de-Dios-Flores, Iria. 2023. NOS-MT-OpenNMT-gl-es. Url: https://huggingface.co/proxectonos/NOS-MT-OpenNMT-gl-es