--- tags: - generated_from_trainer datasets: - Graphcore/wikipedia-bert-128 - Graphcore/wikipedia-bert-512 model-index: - name: Graphcore/bert-base-uncased results: [] --- # Graphcore/bert-base-uncased This model is a pre-trained BERT-Base trained in two phases on the [Graphcore/wikipedia-bert-128](https://huggingface.co/datasets/Graphcore/wikipedia-bert-128) and [Graphcore/wikipedia-bert-512](https://huggingface.co/datasets/Graphcore/wikipedia-bert-512) datasets. ## Model description Pre-trained BERT Base model trained on Wikipedia data. ## Intended uses & limitations More information needed ## Training and evaluation data Trained on wikipedia datasets: - [Graphcore/wikipedia-bert-128](https://huggingface.co/datasets/Graphcore/wikipedia-bert-128) - [Graphcore/wikipedia-bert-512](https://huggingface.co/datasets/Graphcore/wikipedia-bert-512) ## Training procedure Trained MLM and NSP pre-training scheme from [Large Batch Optimization for Deep Learning: Training BERT in 76 minutes](https://arxiv.org/abs/1904.00962). Trained on 16 Graphcore Mk2 IPUs. ### Training hyperparameters The following hyperparameters were used during phase 1 training: - learning_rate: 0.006 - train_batch_size: 32 - eval_batch_size: 8 - seed: 42 - distributed_type: IPU - gradient_accumulation_steps: 512 - total_train_batch_size: 65536 - total_eval_batch_size: 128 - optimizer: LAMB - lr_scheduler_type: linear - lr_scheduler_warmup_ratio: 0.28 - training_steps: 10500 - training precision: Mixed Precision The following hyperparameters were used during phase 2 training: - learning_rate: 0.002828 - train_batch_size: 8 - eval_batch_size: 8 - seed: 42 - distributed_type: IPU - gradient_accumulation_steps: 512 - total_train_batch_size: 16384 - total_eval_batch_size: 128 - optimizer: LAMB - lr_scheduler_type: linear - lr_scheduler_warmup_ratio: 0.128 - training_steps: 2038 - training precision: Mixed Precision ### Framework versions - Transformers 4.17.0.dev0 - Pytorch 1.10.0+cpu - Datasets 1.18.3.dev0 - Tokenizers 0.10.3