language: | |
- he | |
tags: | |
- language model | |
license: apache-2.0 | |
datasets: | |
- oscar | |
- wikipedia | |
# AlephBERT | |
## Hebrew Language Model | |
State-of-the-art language model for Hebrew. Based on BERT. | |
#### How to use | |
```python | |
from transformers import BertModel, BertTokenizerFast | |
alephbert_tokenizer = BertTokenizerFast.from_pretrained('onlplab/alephbert-base') | |
alephbert = BertModel.from_pretrained('onlplab/alephbert-base') | |
# if not finetuning - disable dropout | |
alephbert.eval() | |
``` | |
## Training data | |
- OSCAR (10G text, 20M sentences) | |
- Wikipedia dump (0.6G text, 3M sentences) | |
- Tweets (7G text, 70M sentences) | |
## Training procedure | |
Trained on a DGX machine (8 V100 GPUs) using the standard huggingface training procedure. | |
To optimize training time we split the data into 4 sections based on max number of tokens: | |
1. num tokens < 32 (70M sentences) | |
2. 32 <= num tokens < 64 (12M sentences) | |
3. 64 <= num tokens < 128 (10M sentences) | |
4. 128 <= num tokens < 512 (70M sentences) | |
Each section was trained for 5 epochs with an initial learning rate set to 1e-4. | |
Total training time was 5 days. | |
## Eval | |