radinplaid commited on
Commit
d00d112
·
verified ·
1 Parent(s): 14220d8

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,3 +1,96 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - fr
5
+ tags:
6
+ - translation
7
+ license: cc-by-4.0
8
+ datasets:
9
+ - quickmt/quickmt-train.fr-en
10
+ model-index:
11
+ - name: quickmt-fr-en
12
+ results:
13
+ - task:
14
+ name: Translation fra-eng
15
+ type: translation
16
+ args: fra-eng
17
+ dataset:
18
+ name: flores101-devtest
19
+ type: flores_101
20
+ args: fra_Latn eng_Latn devtest
21
+ metrics:
22
+ - name: CHRF
23
+ type: chrf
24
+ value: 66.77
25
+ - name: BLEU
26
+ type: bleu
27
+ value: 42.17
28
+ - name: COMET
29
+ type: comet
30
+ value: 58.10
31
+ ---
32
+
33
+
34
+ # `quickmt-fr-en` Neural Machine Translation Model
35
+
36
+ `quickmt-fr-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `fr` into `en`.
37
+
38
+
39
+ ## Model Information
40
+
41
+ * Trained using [`eole`](https://github.com/eole-nlp/eole)
42
+ * 185M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
43
+ * 50k joint Sentencepiece vocabulary
44
+ * Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
45
+ * Training data: https://huggingface.co/datasets/quickmt/quickmt-train.fr-en/tree/main
46
+
47
+ See the `eole` model configuration in this repository for further details.
48
+
49
+
50
+ ## Usage with `quickmt`
51
+
52
+ You must install the Nvidia cuda toolkit first, if you want to do GPU inference.
53
+
54
+ Next, install the `quickmt` python library and download the model:
55
+
56
+ ```bash
57
+ git clone https://github.com/quickmt/quickmt.git
58
+ pip install ./quickmt/
59
+
60
+ quickmt-model-download quickmt/quickmt-fr-en ./quickmt-fr-en
61
+ ```
62
+
63
+ Finally use the model in python:
64
+
65
+ ```python
66
+ from quickmt import Translator
67
+
68
+ # Auto-detects GPU, set to "cpu" to force CPU inference
69
+ t = Translator("./quickmt-fr-en/", device="auto")
70
+
71
+ # Translate - set beam size to 5 for higher quality (but slower speed)
72
+ sample_text = "Résigny est une commune française située dans le département de l'Aisne, en région Hauts-de-France. "
73
+ t(sample_text, beam_size=1)
74
+
75
+ # Get alternative translations by sampling
76
+ # You can pass any cTranslate2 `translate_batch` arguments
77
+ t([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
78
+ ```
79
+
80
+ The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`.
81
+
82
+
83
+ ## Metrics
84
+
85
+ `bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("fra_Latn"->"eng_Latn"). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate (using `ctranslate2`) the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32 (faster speed is possible using a large batch size).
86
+
87
+ | Model | chrf2 | bleu | comet22 | Time (s) |
88
+ | -------------------------------- | ----- | ------- | ------- | -------- |
89
+ | quickmt/quickmt-fr-en | 68.22 | 44.28 | 88.86 | 1.1 |
90
+ | Helsinki-NLP/opus-mt-fr-en | 66.85 | 41.71 | 88.31 | 3.6 |
91
+ | facebook/m2m100_418M | 64.39 | 36.49 | 85.87 | 18.0 |
92
+ | facebook/m2m100_1.2B | 66.51 | 41.69 | 88.00 | 34.6 |
93
+ | facebook/nllb-200-distilled-600M | 67.82 | 44.04 | 88.47 | 21.7 |
94
+ | facebook/nllb-200-distilled-1.3B | 69.30 | 46.22 | 89.24 | 37.1 |
95
+
96
+ `quickmt-fr-en` is the fastest and is higher quality than `opus-mt-fr-en`, `m2m100_418m`, `m2m100_1.2B` and `nllb-200-distilled-600M`.
config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_source_bos": false,
3
+ "add_source_eos": false,
4
+ "bos_token": "<s>",
5
+ "decoder_start_token": "<s>",
6
+ "eos_token": "</s>",
7
+ "layer_norm_epsilon": 1e-06,
8
+ "multi_query_attention": false,
9
+ "unk_token": "<unk>"
10
+ }
eole-config.yaml ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## IO
2
+ save_data: fr-en/data_spm
3
+ overwrite: True
4
+ seed: 1234
5
+ report_every: 100
6
+ valid_metrics: ["BLEU"]
7
+ tensorboard: true
8
+ tensorboard_log_dir: tensorboard
9
+
10
+ ### Vocab
11
+ src_vocab: fr-en/joint.eole.vocab
12
+ tgt_vocab: fr-en/joint.eole.vocab
13
+ src_vocab_size: 50000
14
+ tgt_vocab_size: 50000
15
+ vocab_size_multiple: 8
16
+ share_vocab: True
17
+ n_sample: 0
18
+
19
+ data:
20
+ corpus_1:
21
+ path_src: hf://quickmt/quickmt-train.fr-en/fr
22
+ path_tgt: hf://quickmt/quickmt-train.fr-en/en
23
+ path_sco: hf://quickmt/quickmt-train.fr-en/sco
24
+ valid:
25
+ path_src: fr-en/dev.src
26
+ path_tgt: fr-en/dev.tgt
27
+
28
+ transforms: [sentencepiece, filtertoolong]
29
+ transforms_configs:
30
+ sentencepiece:
31
+ src_subword_model: "fr-en/joint.spm.model"
32
+ tgt_subword_model: "fr-en/joint.spm.model"
33
+ filtertoolong:
34
+ src_seq_length: 256
35
+ tgt_seq_length: 256
36
+
37
+ training:
38
+ # Run configuration
39
+ model_path: fr-en/model
40
+ keep_checkpoint: 4
41
+ save_checkpoint_steps: 2000
42
+ train_steps: 100000
43
+ valid_steps: 2000
44
+
45
+ # Train on a single GPU
46
+ world_size: 1
47
+ gpu_ranks: [0]
48
+
49
+ # Batching
50
+ batch_type: "tokens"
51
+ batch_size: 8192
52
+ valid_batch_size: 8192
53
+ batch_size_multiple: 8
54
+ accum_count: [16]
55
+ accum_steps: [0]
56
+
57
+ # Optimizer & Compute
58
+ compute_dtype: "bf16"
59
+ optim: "pagedadamw8bit"
60
+ #optim: "adamw"
61
+ learning_rate: 2.0
62
+ warmup_steps: 10000
63
+ decay_method: "noam"
64
+ adam_beta2: 0.998
65
+
66
+ # Data loading
67
+ bucket_size: 128000
68
+ num_workers: 4
69
+ prefetch_factor: 100
70
+
71
+ # Hyperparams
72
+ dropout_steps: [0]
73
+ dropout: [0.1]
74
+ attention_dropout: [0.1]
75
+ max_grad_norm: 2
76
+ label_smoothing: 0.1
77
+ average_decay: 0.0001
78
+ param_init_method: xavier_uniform
79
+ normalization: "tokens"
80
+
81
+ model:
82
+ architecture: "transformer"
83
+ layer_norm: standard
84
+ share_embeddings: true
85
+ share_decoder_embeddings: true
86
+ add_ffnbias: true
87
+ mlp_activation_fn: gelu
88
+ add_estimator: false
89
+ add_qkvbias: false
90
+ norm_eps: 1e-6
91
+ hidden_size: 1024
92
+ encoder:
93
+ layers: 8
94
+ decoder:
95
+ layers: 2
96
+ heads: 8
97
+ transformer_ff: 4096
98
+ embeddings:
99
+ word_vec_size: 1024
100
+ position_encoding_type: "SinusoidalInterleaved"
101
+
eole-model/config.json ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "seed": 1234,
3
+ "transforms": [
4
+ "sentencepiece",
5
+ "filtertoolong"
6
+ ],
7
+ "report_every": 100,
8
+ "save_data": "fr-en/data_spm",
9
+ "src_vocab_size": 50000,
10
+ "share_vocab": true,
11
+ "overwrite": true,
12
+ "tgt_vocab": "fr-en/joint.eole.vocab",
13
+ "valid_metrics": [
14
+ "BLEU"
15
+ ],
16
+ "tensorboard_log_dir_dated": "tensorboard/Feb-17_09-24-56",
17
+ "src_vocab": "fr-en/joint.eole.vocab",
18
+ "tensorboard_log_dir": "tensorboard",
19
+ "tensorboard": true,
20
+ "n_sample": 0,
21
+ "tgt_vocab_size": 50000,
22
+ "vocab_size_multiple": 8,
23
+ "training": {
24
+ "adam_beta2": 0.998,
25
+ "dropout_steps": [
26
+ 0
27
+ ],
28
+ "param_init_method": "xavier_uniform",
29
+ "accum_steps": [
30
+ 0
31
+ ],
32
+ "batch_size": 8192,
33
+ "batch_size_multiple": 8,
34
+ "gpu_ranks": [
35
+ 0
36
+ ],
37
+ "model_path": "fr-en/model3",
38
+ "learning_rate": 2.0,
39
+ "bucket_size": 128000,
40
+ "train_steps": 100000,
41
+ "label_smoothing": 0.1,
42
+ "num_workers": 0,
43
+ "world_size": 1,
44
+ "compute_dtype": "torch.bfloat16",
45
+ "save_checkpoint_steps": 2000,
46
+ "dropout": [
47
+ 0.1
48
+ ],
49
+ "decay_method": "noam",
50
+ "keep_checkpoint": 4,
51
+ "optim": "pagedadamw8bit",
52
+ "normalization": "tokens",
53
+ "valid_batch_size": 8192,
54
+ "batch_type": "tokens",
55
+ "warmup_steps": 10000,
56
+ "average_decay": 0.0001,
57
+ "prefetch_factor": 100,
58
+ "valid_steps": 2000,
59
+ "accum_count": [
60
+ 16
61
+ ],
62
+ "attention_dropout": [
63
+ 0.1
64
+ ],
65
+ "max_grad_norm": 2.0
66
+ },
67
+ "model": {
68
+ "share_decoder_embeddings": true,
69
+ "hidden_size": 1024,
70
+ "mlp_activation_fn": "gelu",
71
+ "add_estimator": false,
72
+ "add_ffnbias": true,
73
+ "share_embeddings": true,
74
+ "norm_eps": 1e-06,
75
+ "transformer_ff": 4096,
76
+ "position_encoding_type": "SinusoidalInterleaved",
77
+ "layer_norm": "standard",
78
+ "architecture": "transformer",
79
+ "add_qkvbias": false,
80
+ "heads": 8,
81
+ "encoder": {
82
+ "layer_norm": "standard",
83
+ "rope_config": null,
84
+ "encoder_type": "transformer",
85
+ "hidden_size": 1024,
86
+ "add_qkvbias": false,
87
+ "layers": 8,
88
+ "src_word_vec_size": 1024,
89
+ "add_ffnbias": true,
90
+ "n_positions": null,
91
+ "norm_eps": 1e-06,
92
+ "mlp_activation_fn": "gelu",
93
+ "heads": 8,
94
+ "transformer_ff": 4096,
95
+ "position_encoding_type": "SinusoidalInterleaved"
96
+ },
97
+ "embeddings": {
98
+ "word_vec_size": 1024,
99
+ "position_encoding_type": "SinusoidalInterleaved",
100
+ "src_word_vec_size": 1024,
101
+ "tgt_word_vec_size": 1024
102
+ },
103
+ "decoder": {
104
+ "layer_norm": "standard",
105
+ "decoder_type": "transformer",
106
+ "rope_config": null,
107
+ "tgt_word_vec_size": 1024,
108
+ "hidden_size": 1024,
109
+ "add_qkvbias": false,
110
+ "layers": 2,
111
+ "add_ffnbias": true,
112
+ "n_positions": null,
113
+ "norm_eps": 1e-06,
114
+ "mlp_activation_fn": "gelu",
115
+ "heads": 8,
116
+ "transformer_ff": 4096,
117
+ "position_encoding_type": "SinusoidalInterleaved"
118
+ }
119
+ },
120
+ "transforms_configs": {
121
+ "sentencepiece": {
122
+ "tgt_subword_model": "${MODEL_PATH}/joint.spm.model",
123
+ "src_subword_model": "${MODEL_PATH}/joint.spm.model"
124
+ },
125
+ "filtertoolong": {
126
+ "src_seq_length": 256,
127
+ "tgt_seq_length": 256
128
+ }
129
+ },
130
+ "data": {
131
+ "corpus_1": {
132
+ "transforms": [
133
+ "sentencepiece",
134
+ "filtertoolong"
135
+ ],
136
+ "path_align": null,
137
+ "path_src": "fr-en/train.cleaned.src",
138
+ "path_tgt": "fr-en/train.cleaned.tgt"
139
+ },
140
+ "valid": {
141
+ "transforms": [
142
+ "sentencepiece",
143
+ "filtertoolong"
144
+ ],
145
+ "path_align": null,
146
+ "path_src": "fr-en/dev.src",
147
+ "path_tgt": "fr-en/dev.tgt"
148
+ }
149
+ }
150
+ }
eole-model/joint.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:19bab02bdbc41207bd3fabf86e20e691e978f78725d898c42de586b67cdaed02
3
+ size 1146015
eole-model/model.00.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7a28ad097a4ed4a2bd3f4b6da5731c5f4d2cf664cc28a54d62d2b150b1f68e0c
3
+ size 762769904
eole-model/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
joint.eole.vocab ADDED
The diff for this file is too large to render. See raw diff
 
joint.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:19bab02bdbc41207bd3fabf86e20e691e978f78725d898c42de586b67cdaed02
3
+ size 1146015
joint.spm.vocab ADDED
The diff for this file is too large to render. See raw diff
 
model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7a5c0a467de1c354122644c49dc5fad47b4c38b1eeb6ddfa841f3d6e3b2a699b
3
+ size 381336824
shared_vocabulary.json ADDED
The diff for this file is too large to render. See raw diff