slone/bert-tiny-char-ctc-bak-denoise

This is a tiny BERT model for Bashkir, intended for fixing OCR errors.

Here is the code to run it (it uses a custom tokenizer, with the code downloaded in the runtime):

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

MODEL_NAME = 'slone/bert-tiny-char-ctc-bak-denoise'
model = AutoModelForMaskedLM.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True, revision='194109')

def fix_text(text, verbose=False, spaces=2):
    with torch.inference_mode():
        batch = tokenizer(text, return_tensors='pt', spaces=spaces, padding=True, truncation=True, return_token_type_ids=False).to(model.device)
        logits = torch.log_softmax(model(**batch).logits, axis=-1)
    return tokenizer.decode(logits[0].argmax(-1), skip_special_tokens=True)

print(fix_text("Э Ҡаратау ҙы белмәйем."))
# Ә Ҡаратауҙы белмәйем.

The model works by:

inserting special characters (spaces) between each input character,
performing token classification (when for most tokens, predicted output equals input, but some may modify it),
and removing the special characters from the output.

It was trained on a parallel corpus (corrupted + fixed sentence) with CTC loss. On our test dataset, it reduces OCR errors by 41%.

Training code: here. Training details: in this post (in Russian).