elmadany commited on
Commit
ef5bf8d
1 Parent(s): 14e7195

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -2
README.md CHANGED
@@ -1,4 +1,3 @@
1
- <img src="https://raw.githubusercontent.com/UBC-NLP/marbert/main/ARBERT_MARBERT.jpg" alt="drawing" width="10%" height="10%" align="right"/>
2
  ---
3
  language:
4
  - ar
@@ -10,7 +9,7 @@ tags:
10
  widget:
11
  - text: "اللغة العربية هي لغة [MASK]."
12
  ---
13
-
14
  **MARBERT** is one of three models described in our **ACL 2021 paper** **["ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic"](https://aclanthology.org/2021.acl-long.551.pdf)**. MARBERT is a large-scale pre-trained masked language model focused on both Dialectal Arabic (DA) and MSA. Arabic has multiple varieties. To train MARBERT, we randomly sample 1B Arabic tweets from a large in-house dataset of about 6B tweets. We only include tweets with at least 3 Arabic words, based on character string matching, regardless whether the tweet has non-Arabic string or not. That is, we do not remove non-Arabic so long as the tweet meets the 3 Arabic word criterion. The dataset makes up **128GB of text** (**15.6B tokens**). We use the same network architecture as ARBERT (BERT-base), but without the next sentence prediction (NSP) objective since tweets are short. See our [repo](https://github.com/UBC-NLP/LMBERT) for modifying BERT code to remove NSP. For more information about MARBERT, please visit our own GitHub [repo](https://github.com/UBC-NLP/marbert).
15
 
16
 
 
 
1
  ---
2
  language:
3
  - ar
 
9
  widget:
10
  - text: "اللغة العربية هي لغة [MASK]."
11
  ---
12
+ <img src="https://raw.githubusercontent.com/UBC-NLP/marbert/main/ARBERT_MARBERT.jpg" alt="drawing" width="200" height="200" align="right"/>
13
  **MARBERT** is one of three models described in our **ACL 2021 paper** **["ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic"](https://aclanthology.org/2021.acl-long.551.pdf)**. MARBERT is a large-scale pre-trained masked language model focused on both Dialectal Arabic (DA) and MSA. Arabic has multiple varieties. To train MARBERT, we randomly sample 1B Arabic tweets from a large in-house dataset of about 6B tweets. We only include tweets with at least 3 Arabic words, based on character string matching, regardless whether the tweet has non-Arabic string or not. That is, we do not remove non-Arabic so long as the tweet meets the 3 Arabic word criterion. The dataset makes up **128GB of text** (**15.6B tokens**). We use the same network architecture as ARBERT (BERT-base), but without the next sentence prediction (NSP) objective since tweets are short. See our [repo](https://github.com/UBC-NLP/LMBERT) for modifying BERT code to remove NSP. For more information about MARBERT, please visit our own GitHub [repo](https://github.com/UBC-NLP/marbert).
14
 
15