Kamel commited on
Commit
55fc5d7
1 Parent(s): 6de7b49

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -0
README.md ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ **DBERT** is the first BERT model for the Moroccan Arabic dialect called “Darija”. It is based on the same architecture as BERT-base, but without the Next Sentence Prediction (NSP) objective. This model was trained on a total of ~3 Million sequences of Darija dialect representing 691MB of text or a total of ~100M tokens.
2
+
3
+ The model was trained on a dataset issued from three different sources:
4
+ * Stories written in Darija scrapped from a dedicated website
5
+ * Youtube comments from 40 different Moroccan channels
6
+ * Tweets crawled based on a list of Darija keywords.
7
+
8
+ More details about DarijaBert are available in the dedicated GitHub repository
9
+
10
+ **Loading the model**
11
+
12
+ The model can be loaded directly using the Huggingface library:
13
+
14
+ ```python
15
+ from transformers import AutoTokenizer, AutoModel
16
+ DBERT_tokenizer = AutoTokenizer.from_pretrained("Kamel/DBERT")
17
+ DBERT_Bert_model = AutoModel.from_pretrained("Kamel/DBERT")
18
+ ```
19
+
20
+ **Acknowledgments**
21
+
22
+ We gratefully acknowledge Google’s TensorFlow Research Cloud (TRC) program for providing us with free Cloud TPUs.
23
+
24
+