File size: 4,341 Bytes
3d59b10
 
 
 
8ecb874
 
 
3d59b10
 
 
 
8ecb874
 
 
fd3823b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8ecb874
 
 
 
 
 
 
 
 
 
590f9b2
8ecb874
 
 
 
 
 
 
fbd6a4a
8ecb874
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fd3823b
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
language: en
widget:
 - text: "He had also stgruggled with addiction during his time in Congress ."
 - text: "The review thoroughla assessed all aspects of JLENS SuR and CPG esign maturit and confidence ."
 - text: "Letterma also apologized two his staff for the satyation ."
 - text: "Vincent Jay had earlier won France 's first gold in gthe 10km biathlon sprint ."
 - text: "It is left to the directors to figure out hpw to bring the stry across to tye audience ."
 
---

# Typo Detector


## Dataset Information

For this specific task, I used [NeuSpell](https://github.com/neuspell/neuspell) corpus as my raw data.

## Evaluation

The following tables summarize the scores obtained by model overall and per each class.

|       #      | precision |  recall  | f1-score |  support |
|:------------:|:---------:|:--------:|:--------:|:--------:|
|     TYPO     |  0.992332 | 0.985997 | 0.989154 | 416054.0 |
|   micro avg  |  0.992332 | 0.985997 | 0.989154 | 416054.0 |
|   macro avg  |  0.992332 | 0.985997 | 0.989154 | 416054.0 |
| weighted avg |  0.992332 | 0.985997 | 0.989154 | 416054.0 |


## How to use

You use this model with Transformers pipeline for NER (token-classification).

### Installing requirements

```bash
pip install transformers
```

### Prediction using pipeline

```python
import torch
from transformers import AutoConfig, AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline


model_name_or_path = "m3hrdadfi/typo-detector-distilbert-en"
config = AutoConfig.from_pretrained(model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForTokenClassification.from_pretrained(model_name_or_path, config=config)
nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="average")
```

```python
sentences = [
 "He had also stgruggled with addiction during his time in Congress .",
 "The review thoroughla assessed all aspects of JLENS SuR and CPG esign maturit and confidence .",
 "Letterma also apologized two his staff for the satyation .",
 "Vincent Jay had earlier won France 's first gold in gthe 10km biathlon sprint .",
 "It is left to the directors to figure out hpw to bring the stry across to tye audience .",
]

for sentence in sentences:
    typos = [sentence[r["start"]: r["end"]] for r in nlp(sentence)]

    detected = sentence
    for typo in typos:
        detected = detected.replace(typo, f'<i>{typo}</i>')

    print("   [Input]: ", sentence)
    print("[Detected]: ", detected)
    print("-" * 130)
```

Output:
```text
   [Input]:  He had also stgruggled with addiction during his time in Congress .
[Detected]:  He had also <i>stgruggled</i> with addiction during his time in Congress .
----------------------------------------------------------------------------------------------------------------------------------
   [Input]:  The review thoroughla assessed all aspects of JLENS SuR and CPG esign maturit and confidence .
[Detected]:  The review <i>thoroughla</i> assessed all aspects of JLENS SuR and CPG <i>esign</i> <i>maturit</i> and confidence .
----------------------------------------------------------------------------------------------------------------------------------
   [Input]:  Letterma also apologized two his staff for the satyation .
[Detected]:  <i>Letterma</i> also apologized <i>two</i> his staff for the <i>satyation</i> .
----------------------------------------------------------------------------------------------------------------------------------
   [Input]:  Vincent Jay had earlier won France 's first gold in gthe 10km biathlon sprint .
[Detected]:  Vincent Jay had earlier won France 's first gold in <i>gthe</i> 10km biathlon sprint .
----------------------------------------------------------------------------------------------------------------------------------
   [Input]:  It is left to the directors to figure out hpw to bring the stry across to tye audience .
[Detected]:  It is left to the directors to figure out <i>hpw</i> to bring the <i>stry</i> across to <i>tye</i> audience .
----------------------------------------------------------------------------------------------------------------------------------
```

## Questions?
Post a Github issue on the [TypoDetector Issues](https://github.com/m3hrdadfi/typo-detector/issues) repo.