idoco commited on
Commit
7b96467
1 Parent(s): 6088e67

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -0
README.md ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - he
4
+ pipeline_tag: token-classification
5
+ tags:
6
+ - Transformers
7
+ - PyTorch
8
+ ---
9
+
10
+ <!-- Provide a quick summary of what the model is/does. -->
11
+
12
+ ## MenakBERT
13
+
14
+ A Hebrew BERT-style masked language model operating over characters, pre-trained by masking spans of characters, similarly to SpanBERT (Joshi et al., 2020).
15
+ A Hebrew diacritizer based on a BERT-style char-level backbone. Predicts diacritical marks in a seq2seq fashion.
16
+
17
+ ### Model Description
18
+
19
+ This model is takes tau/tavbert-he and adds a three headed classification head that outputs 3 sequences corresponding to 3 types of Hebrew Niqqud (diacritics).
20
+ It was finetuned on the dataset generously provided by Elazar Gershuni of Nakdimon.
21
+
22
+
23
+ - **Developed by:** Jacob Gidron, Ido Cohen and Idan Pinto
24
+ - **Model type:** Bert
25
+ - **Language:** Hebrew
26
+ - **Finetuned from model:** tau/tavbert-he
27
+
28
+ <!-- ### Model Sources [optional] -->
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** https://github.com/jacobgidron/MenakBert
33
+ <!-- - **Paper [optional]:** [More Information Needed] -->
34
+ <!-- - **Demo [optional]:** [More Information Needed] -->
35
+
36
+ ## Use
37
+
38
+ The model expects undotted Hebrew text, that may contain numbers and punctuation.
39
+
40
+ The output is three sequences of diacritical marks, corresponding with:
41
+ 1. Dot distinguishing the letters Shin vs Sin.
42
+ 2. The dot in the center of a letter that in some case changes pronunciation of certain letters, and in other cases creating a similar affect as an emphasis on the letter, or gemination.
43
+ 3. All the rest of the marks, used mostly for vocalization.
44
+
45
+ The length of each sequence is the same as the input - each mark corresponding with the char at the same possition in the input.
46
+
47
+ The provided script weaves the sequences together.
48
+
49
+ ## How to Get Started with the Model
50
+
51
+ Use the code below to get started with the model.
52
+
53
+ [More Information Needed]
54
+
55
+ ### Training Data
56
+
57
+ The backbone tau/tavber-he was trained on OSCAR (Ortiz, 2019) Hebrew section (10 GB text, 20 million sentences).
58
+ The fine tuning was done on the Nakdimon dataset, which can be found at https://github.com/elazarg/hebrew_diacritized and contains 274,436 dotted Hebrew tokens across 413 documents.
59
+ For more information see https://arxiv.org/abs/2105.05209
60
+
61
+ <!-- #### Metrics -->
62
+
63
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
64
+
65
+ <!-- [More Information Needed] -->
66
+
67
+ <!-- ### Results -->
68
+
69
+ <!-- [More Information Needed] -->
70
+
71
+
72
+ ## Model Card Contact
73
+
74
+ Ido Cohen - its.ido@gmail.com