add readme file
Browse files
README.md
ADDED
@@ -0,0 +1,60 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
|
3 |
+
## 🧐 About <a name = "about"></a>
|
4 |
+
|
5 |
+
tunbert_zied is language model for the tunisian dialect based on a similar architecture to the RoBERTa model created BY zied sbabti.
|
6 |
+
|
7 |
+
The model was trained for over 600 000 phrases written in the tunisian dialect.
|
8 |
+
|
9 |
+
## 🏁 Getting Started <a name = "getting_started"></a>
|
10 |
+
|
11 |
+
Load <strong>tunbert_zied</strong> and its sub-word tokenizer
|
12 |
+
|
13 |
+
Don'use the <em>AutoTokenizer.from_pretrained(...)</em> method to load the tokenizer, instead use <em>BertTokeinzer.from_pretrained(...)</em> method. (this is because I haven't use the bultin tokenizer of roberta model which is the GPT tokenizer, instead i have used BertTokenizer)
|
14 |
+
|
15 |
+
### Example
|
16 |
+
|
17 |
+
|
18 |
+
|
19 |
+
```
|
20 |
+
import transformers as tr
|
21 |
+
|
22 |
+
tokenizer = tr.BertTokenizer.from_pretrained("ziedsb19/tunbert_zied")
|
23 |
+
|
24 |
+
model = tr.AutoModelForMaskedLM.from_pretrained("ziedsb19/tunbert_zied")
|
25 |
+
|
26 |
+
pipeline = tr.pipeline("fill-mask", model= model, tokenizer=tokenizer)
|
27 |
+
|
28 |
+
#test the model by masking a word in a phrase with [MASK]
|
29 |
+
|
30 |
+
pipeline("Ahla winek [MASK] lioum ?")
|
31 |
+
|
32 |
+
#results
|
33 |
+
"""
|
34 |
+
[{'sequence': 'ahla winek cv lioum?',
|
35 |
+
'score': 0.07968682795763016,
|
36 |
+
'token': 869,
|
37 |
+
'token_str': 'c v'},
|
38 |
+
{'sequence': 'ahla winek enty lioum?',
|
39 |
+
'score': 0.06116843968629837,
|
40 |
+
'token': 448,
|
41 |
+
'token_str': 'e n t y'},
|
42 |
+
{'sequence': 'ahla winek ch3amla lioum?',
|
43 |
+
'score': 0.057379286736249924,
|
44 |
+
'token': 7342,
|
45 |
+
'token_str': 'c h 3 a m l a'},
|
46 |
+
{'sequence': 'ahla winek cha3malt lioum?',
|
47 |
+
'score': 0.028112901374697685,
|
48 |
+
'token': 4663,
|
49 |
+
'token_str': 'c h a 3 m a l t'},
|
50 |
+
{'sequence': 'ahla winek enti lioum?',
|
51 |
+
'score': 0.025781650096178055,
|
52 |
+
'token': 436,
|
53 |
+
'token_str': 'e n t i'}]
|
54 |
+
"""
|
55 |
+
```
|
56 |
+
|
57 |
+
## ✍️ Authors <a name = "authors"></a>
|
58 |
+
|
59 |
+
- [zied sbabti](https://www.linkedin.com/in/zied-sbabti-a58a56139) - Idea & Initial work
|
60 |
+
|