File size: 4,623 Bytes
547b697
 
 
 
 
 
 
 
 
 
 
 
d2a9583
547b697
 
 
 
7c27f2b
547b697
3a95713
 
 
 
547b697
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6b6b533
547b697
 
 
 
 
 
6b6b533
547b697
 
 
 
 
 
 
 
 
 
7c27f2b
547b697
 
 
 
 
 
 
3a95713
547b697
71cb96e
 
 
 
 
 
 
547b697
3a95713
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
license: mit
language:
- en
tags:
- spellchecking
- NLP
- T5
- pytorch
- natural language generation
---

# T5-large-spell model

### Summary
The model corrects spelling errors and typos by bringing all words in the text to the standard English language.
The proofreader was trained based on the [T5-large](https://huggingface.co/t5-large) model. 
An extensive dataset with “artificial” errors was taken as a training corpus: the corpus was assembled on the basis of the English-language Wikipedia and News blogs, then typos and spelling errors were automatically introduced into it using the functionality of the [SAGE library](https://github.com/ai-forever/sage).

### Public references
- [SAGE library announcement](https://youtu.be/yFfkV0Qjuu0), DataFest 2023
- [Paper about synthetic error generation methods](https://www.dialog-21.ru/media/5914/martynovnplusetal056.pdf), Dialogue 2023
- [Paper about SAGE and our best solution](https://arxiv.org/abs/2308.09435), Review EACL 2024

### Examples
| Input | Output |
| --- | --- |
| Th festeivаl was excelzecnt in many ways, and in particular it beinganinternational festjival sss a chаllenging, bet brilli an t ea. | The festival was excellent in many ways, and in particular it beinganinternational festival is a challenging, but brilliant one to see. |
| That 's why I believe in the solution which is the closest to human nature and can help us to avoid boredome. I am sure that eventually we will take off our clothes and in the future we will be undressed and free. There wo n't be any problem with being up - do - date . | That's why I believe in the solution which is the closest to human nature and can help us to avoid boredom. I am sure that eventually we will take off our clothes and in the future we will be undressed and free. There won't be any problem with being up - do - date. |
| If you bought something goregous, you well be very happy. | If you bought something gorgeous, you will be very happy. |

## Metrics
### Quality
Below are automatic metrics for determining the correctness of the spell checkers. 
We present a comparison of our solution both with open automatic spell checkers and with the ChatGPT family of models on two available datasets:
- **BEA60K**: English spelling errors collected from several domains;
- **JFLEG**: 1601 sentences in English, which contain about 2 thousand spelling errors;

**BEA60K**
| Model | Precision | Recall | F1 |
| --- | --- | --- | --- |
| T5-large-spell | 66.5 | 83.1 | 73.9 |
| ChatGPT gpt-3.5-turbo-0301 | 66.9 | 84.1 | 74.5 |
| ChatGPT gpt-4-0314 | 68.6 | 85.2 | 76.0 |
| ChatGPT text-davinci-003 | 67.8 | 83.9 | 75.0 |
| Bert (https://github.com/neuspell/neuspell) | 65.8 | 79.6 | 72.0 |
| SC-LSTM (https://github.com/neuspell/neuspell) | 62.2 | 80.3 | 72.0 |

**JFLEG**
| Model | Precision | Recall | F1 |
| --- | --- | --- | --- |
| T5-large-spell | 83.4 | 84.3 | 83.8 |
| ChatGPT gpt-3.5-turbo-0301 | 77.8 | 88.6 | 82.9 |
| ChatGPT gpt-4-0314 | 77.9 | 88.3 | 82.8 |
| ChatGPT text-davinci-003 | 76.8 | 88.5 | 82.2 |
| Bert (https://github.com/neuspell/neuspell) | 78.5 | 85.4 | 81.8 |
| SC-LSTM (https://github.com/neuspell/neuspell) | 80.6 | 86.1 | 83.2 |

## How to use
```python
from transformers import T5ForConditionalGeneration, AutoTokenizer

path_to_model = "ai-forever/T5-large-spell"

model = T5ForConditionalGeneration.from_pretrained(path_to_model)
tokenizer = AutoTokenizer.from_pretrained(path_to_model)
prefix = "grammar: "

sentence = "If you bought something goregous, you well be very happy."
sentence = prefix + sentence

encodings = tokenizer(sentence, return_tensors="pt")
generated_tokens = model.generate(**encodings)
answer = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(answer)

# ["If you bought something gorgeous, you will be very happy."]
```

## Resources
- [SAGE library](https://github.com/ai-forever/sage), GitHub
- [ruM2M100-1.2B](https://huggingface.co/ai-forever/RuM2M100-1.2B), HuggingFace
- [ruM2M100-418M](https://huggingface.co/ai-forever/RuM2M100-420M), HuggingFace
- [FredT5-large-spell](https://huggingface.co/ai-forever/FRED-T5-large-spell), HuggingFace
- [T5-large-spell](https://huggingface.co/ai-forever/T5-large-spell), HuggingFace

## License
The [T5-large](https://huggingface.co/t5-large) model, on which our solution is based, and its source code are supplied under the APACHE-2.0 license. 
Our solution is supplied under MIT license.

## Specifications
- File size: 3 Gb;
- Framework: pytorch
- Format: AI Service
- Version: v1.0
- Developer: SberDevices, AGI NLP

## Contacts
nikita.martynov.98@list.ru