cointegrated commited on
Commit
c56158d
·
1 Parent(s): 1941090

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -0
README.md CHANGED
@@ -1,3 +1,37 @@
1
  ---
2
  license: cc-by-4.0
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-4.0
3
+ language:
4
+ - ba
5
+ tags:
6
+ - grammatical-error-correction
7
  ---
8
+
9
+ This is a tiny BERT model for Bashkir, intended for fixing OCR errors.
10
+
11
+ Here is the code to run it (it uses a custom tokenizer, with the code downloaded in the runtime):
12
+ ```Python
13
+ import torch
14
+ from transformers import AutoModelForMaskedLM, AutoTokenizer
15
+
16
+ MODEL_NAME = 'slone/bert-tiny-char-ctc-bak-denoise'
17
+ model = AutoModelForMaskedLM.from_pretrained(MODEL_NAME)
18
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True, revision='194109')
19
+
20
+ def fix_text(text, verbose=False, spaces=2):
21
+ with torch.inference_mode():
22
+ batch = tokenizer(text, return_tensors='pt', spaces=spaces, padding=True, truncation=True, return_token_type_ids=False).to(model.device)
23
+ logits = torch.log_softmax(model(**batch).logits, axis=-1)
24
+ return tokenizer.decode(logits[0].argmax(-1), skip_special_tokens=True)
25
+
26
+ print(fix_text("Э Ҡаратау ҙы белмәйем."))
27
+ # Ә Ҡаратауҙы белмәйем.
28
+ ```
29
+ The model works by:
30
+ - inserting special characters (`spaces`) between each input character,
31
+ - performing token classification (when for most tokens, predicted output equals input, but some may modify it),
32
+ - and removing the special characters from the output.
33
+
34
+ It was trained on a parallel corpus (corrupted + fixed sentence) with CTC loss.
35
+ On our test dataset, it reduces OCR errors by 41%.
36
+
37
+ Training details: in [this post](https://habr.com/ru/articles/744972/) (in Russian).