--- language: - he tags: - language model license: apache-2.0 datasets: - oscar - wikipedia - twitter --- # AlephBERT ## Hebrew Language Model State-of-the-art language model for Hebrew. Based on BERT. #### How to use ```python from transformers import BertModel, BertTokenizerFast alephbert_tokenizer = BertTokenizerFast.from_pretrained('onlplab/alephbert-base') alephbert = BertModel.from_pretrained('onlplab/alephbert-base') # if not finetuning - disable dropout alephbert.eval() ``` ## Training data - OSCAR (10G text, 20M sentences) - Wikipedia dump (0.6G text, 3M sentences) - Tweets (7G text, 70M sentences) ## Training procedure Trained on a DGX machine (8 V100 GPUs) using the standard huggingface training procedure. To optimize training time we split the data into 4 sections based on max number of tokens: 1. num tokens < 32 (70M sentences) 2. 32 <= num tokens < 64 (12M sentences) 3. 64 <= num tokens < 128 (10M sentences) 4. 128 <= num tokens < 512 (70M sentences) Each section was trained for 5 epochs with an initial learning rate set to 1e-4. Total training time was 5 days. ## Eval