Canine-c Bashkir Spelling Correction v1
This model is a version of google/canine-c fine-tuned to fix corrupted texts. It was trained on a mixture of two parallel datasets in the Bashkir language:
- sentences post-edited by humans after OCR
- artificially randomly corrupted sentences along with their original versions
For each character, the model predicts whether to replace it and whether to insert another character next to it.
In this way, the model can be used to fix spelling or OCR errors.
On a held-out set, it reduces the number of required edits by 40%.
How to use
You can use the model by feeding sentences to the following code:
import torch
from transformers import CanineTokenizer, CanineForTokenClassification
tokenizer = CanineTokenizer.from_pretrained('slone/canine-c-bashkir-gec-v1')
model = CanineForTokenClassification.from_pretrained('slone/canine-c-bashkir-gec-v1')
if torch.cuda.is_available():
model.cuda()
LABELS_THIS = [c[5:] for c in model.config.id2label.values() if c.startswith('THIS_')]
LABELS_NEXT = [c[5:] for c in model.config.id2label.values() if c.startswith('NEXT_')]
def fix_text(text, boost=0):
"""Apply the model to edit the text. `boost` is a parameter to control edit aggressiveness."""
bx = tokenizer(text, return_tensors='pt', padding=True)
with torch.inference_mode():
out = model(**bx.to(model.device))
n1, n2 = len(LABELS_THIS), len(LABELS_NEXT)
logits1 = out.logits[0, :, :n1].view(-1, n1)
logits2 = out.logits[0, :, n1:].view(-1, n2)
if boost:
logits1[1:, 0] -= boost
logits2[:, 0] -= boost
ids1, ids2 = logits1.argmax(-1).tolist(), logits2.argmax(-1).tolist()
result = []
for c, id1, id2 in zip(' ' + text, ids1, ids2):
l1, l2 = LABELS_THIS[id1], LABELS_NEXT[id2]
if l1 == 'KEEP':
result.append(c)
elif l1 != 'DELETE':
result.append(l1)
if l2 != 'PASS':
result.append(l2)
return ''.join(result)
text = 'У йыл дан д ың йөҙө һoрөмлэнде.'
print(fix_text(text)) # Уйылдандың йөҙө һөрөмләнде.
The parameter boost
can be used to control the aggressiveness of editing:
positive values increase the probability of changing the text, negative values decrease it.
- Downloads last month
- 8
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.