ziedsb19 commited on
Commit
a8f4668
1 Parent(s): c52c9f3

add readme file

Browse files
Files changed (1) hide show
  1. README.md +60 -0
README.md ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
+ ## 🧐 About <a name = "about"></a>
4
+
5
+ tunbert_zied is language model for the tunisian dialect based on a similar architecture to the RoBERTa model created BY zied sbabti.
6
+
7
+ The model was trained for over 600 000 phrases written in the tunisian dialect.
8
+
9
+ ## 🏁 Getting Started <a name = "getting_started"></a>
10
+
11
+ Load <strong>tunbert_zied</strong> and its sub-word tokenizer
12
+
13
+ Don'use the <em>AutoTokenizer.from_pretrained(...)</em> method to load the tokenizer, instead use <em>BertTokeinzer.from_pretrained(...)</em> method. (this is because I haven't use the bultin tokenizer of roberta model which is the GPT tokenizer, instead i have used BertTokenizer)
14
+
15
+ ### Example
16
+
17
+
18
+
19
+ ```
20
+ import transformers as tr
21
+
22
+ tokenizer = tr.BertTokenizer.from_pretrained("ziedsb19/tunbert_zied")
23
+
24
+ model = tr.AutoModelForMaskedLM.from_pretrained("ziedsb19/tunbert_zied")
25
+
26
+ pipeline = tr.pipeline("fill-mask", model= model, tokenizer=tokenizer)
27
+
28
+ #test the model by masking a word in a phrase with [MASK]
29
+
30
+ pipeline("Ahla winek [MASK] lioum ?")
31
+
32
+ #results
33
+ """
34
+ [{'sequence': 'ahla winek cv lioum?',
35
+ 'score': 0.07968682795763016,
36
+ 'token': 869,
37
+ 'token_str': 'c v'},
38
+ {'sequence': 'ahla winek enty lioum?',
39
+ 'score': 0.06116843968629837,
40
+ 'token': 448,
41
+ 'token_str': 'e n t y'},
42
+ {'sequence': 'ahla winek ch3amla lioum?',
43
+ 'score': 0.057379286736249924,
44
+ 'token': 7342,
45
+ 'token_str': 'c h 3 a m l a'},
46
+ {'sequence': 'ahla winek cha3malt lioum?',
47
+ 'score': 0.028112901374697685,
48
+ 'token': 4663,
49
+ 'token_str': 'c h a 3 m a l t'},
50
+ {'sequence': 'ahla winek enti lioum?',
51
+ 'score': 0.025781650096178055,
52
+ 'token': 436,
53
+ 'token_str': 'e n t i'}]
54
+ """
55
+ ```
56
+
57
+ ## ✍️ Authors <a name = "authors"></a>
58
+
59
+ - [zied sbabti](https://www.linkedin.com/in/zied-sbabti-a58a56139) - Idea & Initial work
60
+