jordimas commited on
Commit
0515c38
1 Parent(s): a01a345

Initial version

Browse files
Files changed (4) hide show
  1. README.md +67 -0
  2. model.bin +3 -0
  3. shared_vocabulary.txt +0 -0
  4. sp_m.model +3 -0
README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+
5
+ tags:
6
+ - gec
7
+
8
+ library_name: opennmt
9
+ license: mit
10
+ metrics:
11
+ - bleu
12
+
13
+ inference: false
14
+ ---
15
+
16
+ ### Introduction
17
+
18
+ This repository contains a description on how to use OpenNMT on the Grammar Error Correction (GEC) task. The idea is to approch GEC as a translation task
19
+
20
+ ### Usage
21
+
22
+ Install the necessary dependencies:
23
+
24
+
25
+ ```bash
26
+ pip3 install ctranslate2 pyonmttok
27
+ ```
28
+
29
+
30
+ Simple tokenization & translation using Python:
31
+
32
+
33
+ ```python
34
+ import ctranslate2
35
+ import pyonmttok
36
+ from huggingface_hub import snapshot_download
37
+ model_dir = snapshot_download(repo_id="jordimas/gec-opennmt-english", revision="main")
38
+
39
+ tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/sp_m.model")
40
+ tokenized=tokenizer.tokenize("The water are hot. My friends are going to be late. Today mine mother is in Barcelona.")
41
+
42
+
43
+ translator = ctranslate2.Translator(model_dir)
44
+ translated = translator.translate_batch([tokenized[0]])
45
+ print(tokenizer.detokenize(translated[0][0]['tokens']))
46
+ ```
47
+
48
+ # Model
49
+
50
+ The model has been training using the [clang8](https://github.com/google-research-datasets/clang8) corpus for English language.
51
+
52
+ Details:
53
+ * Model: TransformerBase
54
+ * Tokenizer: SentencePiece
55
+ * BLEU = 85.50
56
+
57
+ # Papers
58
+
59
+ Relevant papers:
60
+
61
+ * [Approaching Neural Grammatical Error Correction as a Low-Resource Machine Translation Task](https://aclanthology.org/N18-1055.pdf)
62
+ * [A Simple Recipe for Multilingual Grammatical Error Correction](https://arxiv.org/pdf/2106.03830.pdf)
63
+
64
+
65
+ # Contact
66
+
67
+ Email address: Jordi Mas: jmas@softcatala.org
model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4bb56a2291e653a7ddc3d445af09735488dba4b196baf228cd454eedbadd21f2
3
+ size 122328622
shared_vocabulary.txt ADDED
The diff for this file is too large to render. See raw diff
 
sp_m.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:661943a6befb807ca696fc5d0656a1afae2a18e21dd2c823cb0c3be25d8dd441
3
+ size 1131052