cointegrated commited on
Commit
6101dd6
1 Parent(s): d9aa21f

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -0
README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ba
4
+ license: apache-2.0
5
+ tags:
6
+ - grammatical error correction
7
+ ---
8
+
9
+ # Canine-c Bashkir Spelling Correction v1
10
+
11
+ This model is a version of [google/canine-c](https://huggingface.co/openai/whisper-small) fine-tuned to fix corrupted texts.
12
+ It was trained on a mixture of two parallel datasets in the Bashkir language:
13
+ - sentences post-edited by humans after OCR
14
+ - artificially randomly corrupted sentences along with their original versions
15
+
16
+ For each character, the model predicts whether to replace it and whether to insert another character next to it.
17
+
18
+ In this way, the model can be used to fix spelling or OCR errors.
19
+
20
+ On a held-out set, it reduces the number of required edits by 40%.
21
+
22
+ ## How to use
23
+
24
+ You can use the model by feeding sentences to the following code:
25
+
26
+ ```Python
27
+ import torch
28
+ from transformers import CanineTokenizer, CanineForTokenClassification
29
+
30
+ tokenizer = CanineTokenizer.from_pretrained('slone/canine-c-bashkir-gec-v1')
31
+ model = CanineForTokenClassification.from_pretrained('slone/canine-c-bashkir-gec-v1')
32
+ if torch.cuda.is_available():
33
+ model.cuda()
34
+
35
+ LABELS_THIS = [c[5:] for c in model.config.id2label.values() if c.startswith('THIS_')]
36
+ LABELS_NEXT = [c[5:] for c in model.config.id2label.values() if c.startswith('NEXT_')]
37
+
38
+ def fix_text(text, boost=0):
39
+ """Apply the model to edit the text. `boost` is a parameter to control edit aggressiveness."""
40
+ bx = tokenizer(text, return_tensors='pt', padding=True)
41
+ with torch.inference_mode():
42
+ out = model(**bx.to(model.device))
43
+ n1, n2 = len(LABELS_THIS), len(LABELS_NEXT)
44
+ logits1 = out.logits[0, :, :n1].view(-1, n1)
45
+ logits2 = out.logits[0, :, n1:].view(-1, n2)
46
+ if boost:
47
+ logits1[1:, 0] -= boost
48
+ logits2[:, 0] -= boost
49
+ ids1, ids2 = logits1.argmax(-1).tolist(), logits2.argmax(-1).tolist()
50
+ result = []
51
+ for c, id1, id2 in zip(' ' + text, ids1, ids2):
52
+ l1, l2 = LABELS_THIS[id1], LABELS_NEXT[id2]
53
+ if l1 == 'KEEP':
54
+ result.append(c)
55
+ elif l1 != 'DELETE':
56
+ result.append(l1)
57
+ if l2 != 'PASS':
58
+ result.append(l2)
59
+ return ''.join(result)
60
+
61
+ text = 'У йыл дан д ың йөҙө һoрөмлэнде.'
62
+ print(fix_text(text)) # Уйылдандың йөҙө һөрөмләнде.
63
+ ```
64
+
65
+ The parameter `boost` can be used to control the aggressiveness of editing:
66
+ positive values increase the probability of changing the text, negative values decrease it.
67
+