|
--- |
|
language: |
|
- ba |
|
license: apache-2.0 |
|
tags: |
|
- grammatical error correction |
|
--- |
|
|
|
# Canine-c Bashkir Spelling Correction v1 |
|
|
|
This model is a version of [google/canine-c](https://huggingface.co/openai/whisper-small) fine-tuned to fix corrupted texts. |
|
It was trained on a mixture of two parallel datasets in the Bashkir language: |
|
- sentences post-edited by humans after OCR |
|
- artificially randomly corrupted sentences along with their original versions |
|
|
|
For each character, the model predicts whether to replace it and whether to insert another character next to it. |
|
|
|
In this way, the model can be used to fix spelling or OCR errors. |
|
|
|
On a held-out set, it reduces the number of required edits by 40%. |
|
|
|
## How to use |
|
|
|
You can use the model by feeding sentences to the following code: |
|
|
|
```Python |
|
import torch |
|
from transformers import CanineTokenizer, CanineForTokenClassification |
|
|
|
tokenizer = CanineTokenizer.from_pretrained('slone/canine-c-bashkir-gec-v1') |
|
model = CanineForTokenClassification.from_pretrained('slone/canine-c-bashkir-gec-v1') |
|
if torch.cuda.is_available(): |
|
model.cuda() |
|
|
|
LABELS_THIS = [c[5:] for c in model.config.id2label.values() if c.startswith('THIS_')] |
|
LABELS_NEXT = [c[5:] for c in model.config.id2label.values() if c.startswith('NEXT_')] |
|
|
|
def fix_text(text, boost=0): |
|
"""Apply the model to edit the text. `boost` is a parameter to control edit aggressiveness.""" |
|
bx = tokenizer(text, return_tensors='pt', padding=True) |
|
with torch.inference_mode(): |
|
out = model(**bx.to(model.device)) |
|
n1, n2 = len(LABELS_THIS), len(LABELS_NEXT) |
|
logits1 = out.logits[0, :, :n1].view(-1, n1) |
|
logits2 = out.logits[0, :, n1:].view(-1, n2) |
|
if boost: |
|
logits1[1:, 0] -= boost |
|
logits2[:, 0] -= boost |
|
ids1, ids2 = logits1.argmax(-1).tolist(), logits2.argmax(-1).tolist() |
|
result = [] |
|
for c, id1, id2 in zip(' ' + text, ids1, ids2): |
|
l1, l2 = LABELS_THIS[id1], LABELS_NEXT[id2] |
|
if l1 == 'KEEP': |
|
result.append(c) |
|
elif l1 != 'DELETE': |
|
result.append(l1) |
|
if l2 != 'PASS': |
|
result.append(l2) |
|
return ''.join(result) |
|
|
|
text = 'У йыл дан д ың йөҙө һoрөмлэнде.' |
|
print(fix_text(text)) # Уйылдандың йөҙө һөрөмләнде. |
|
``` |
|
|
|
The parameter `boost` can be used to control the aggressiveness of editing: |
|
positive values increase the probability of changing the text, negative values decrease it. |
|
|
|
|