File size: 5,931 Bytes
387585b 4fc3d82 387585b 46a535e 387585b a9c367b 8d00106 a9c367b 8d00106 387585b 46a535e 387585b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
---
license: apache-2.0
tags:
- generated_from_trainer
datasets:
- Graphcore/wikipedia-bert-128
- Graphcore/wikipedia-bert-512
model-index:
- name: Graphcore/bert-large-uncased
results: []
---
# Graphcore/bert-large-uncased
Optimum Graphcore is a new open-source library and toolkit that enables developers to access IPU-optimized models certified by Hugging Face. It is an extension of Transformers, providing a set of performance optimization tools enabling maximum efficiency to train and run models on Graphcore’s IPUs - a completely new kind of massively parallel processor to accelerate machine intelligence. Learn more about how to take train Transformer models faster with IPUs at [hf.co/hardware/graphcore](https://huggingface.co/hardware/graphcore).
Through HuggingFace Optimum, Graphcore released ready-to-use IPU-trained model checkpoints and IPU configuration files to make it easy to train models with maximum efficiency in the IPU. Optimum shortens the development lifecycle of your AI models by letting you plug-and-play any public dataset and allows a seamless integration to our State-of-the-art hardware giving you a quicker time-to-value for your AI project.
## Model description
BERT (Bidirectional Encoder Representations from Transformers) is a transformers model which is designed to pretrain bidirectional representations from unlabelled texts. It enables easy and fast fine-tuning for different downstream tasks such as Sequence Classification, Named Entity Recognition, Question Answering, Multiple Choice and MaskedLM.
It was trained with two objectives in pretraining : Masked language modelling (MLM) and Next sentence prediction(NSP). First, MLM is different from traditional LM which sees the words one after another while BERT allows the model to learn a bidirectional representation. In addition to MLM, NSP is used for jointly pertaining text-pair representations.
It reduces the need of many engineering efforts for building task specific architectures through pre-trained representation. And achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks.
## Intended uses & limitations
This model is a pre-trained BERT-Large trained in two phases on the [Graphcore/wikipedia-bert-128](https://huggingface.co/datasets/Graphcore/wikipedia-bert-128) and [Graphcore/wikipedia-bert-512](https://huggingface.co/datasets/Graphcore/wikipedia-bert-512) datasets.
## Training and evaluation data
Trained on wikipedia datasets:
- [Graphcore/wikipedia-bert-128](https://huggingface.co/datasets/Graphcore/wikipedia-bert-128)
- [Graphcore/wikipedia-bert-512](https://huggingface.co/datasets/Graphcore/wikipedia-bert-512)
## Training procedure
Trained MLM and NSP pre-training scheme from [Large Batch Optimization for Deep Learning: Training BERT in 76 minutes](https://arxiv.org/abs/1904.00962).
Trained on 64 Graphcore Mk2 IPUs using [`optimum-graphcore`](https://github.com/huggingface/optimum-graphcore)
Command lines:
Phase 1:
```
python examples/language-modeling/run_pretraining.py \
--config_name bert-large-uncased \
--tokenizer_name bert-large-uncased \
--ipu_config_name Graphcore/bert-large-ipu \
--dataset_name Graphcore/wikipedia-bert-128 \
--do_train \
--logging_steps 5 \
--max_seq_length 128 \
--max_steps 10550 \
--is_already_preprocessed \
--dataloader_num_workers 64 \
--dataloader_mode async_rebatched \
--lamb \
--lamb_no_bias_correction \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 512 \
--pod_type pod64 \
--learning_rate 0.006 \
--lr_scheduler_type linear \
--loss_scaling 32768 \
--weight_decay 0.01 \
--warmup_ratio 0.28 \
--config_overrides "layer_norm_eps=0.001" \
--ipu_config_overrides "matmul_proportion=[0.14 0.19 0.19 0.19]" \
--output_dir output-pretrain-bert-large-phase1
```
Phase 2:
```
python examples/language-modeling/run_pretraining.py \
--config_name bert-large-uncased \
--tokenizer_name bert-large-uncased \
--model_name_or_path ./output-pretrain-bert-large-phase1 \
--ipu_config_name Graphcore/bert-large-ipu \
--dataset_name Graphcore/wikipedia-bert-512 \
--do_train \
--logging_steps 5 \
--max_seq_length 512 \
--max_steps 2038 \
--is_already_preprocessed \
--dataloader_num_workers 96 \
--dataloader_mode async_rebatched \
--lamb \
--lamb_no_bias_correction \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 512 \
--pod_type pod64 \
--learning_rate 0.002828 \
--lr_scheduler_type linear \
--loss_scaling 16384 \
--weight_decay 0.01 \
--warmup_ratio 0.128 \
--config_overrides "layer_norm_eps=0.001" \
--ipu_config_overrides "matmul_proportion=[0.14 0.19 0.19 0.19]" \
--output_dir output-pretrain-bert-large-phase2
```
### Training hyperparameters
The following hyperparameters were used during phase 1 training:
- learning_rate: 0.006
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- distributed_type: IPU
- gradient_accumulation_steps: 512
- total_train_batch_size: 65536
- total_eval_batch_size: 512
- optimizer: LAMB
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.28
- training_steps: 10550
- training precision: Mixed Precision
The following hyperparameters were used during phase 2 training:
- learning_rate: 0.002828
- train_batch_size: 2
- eval_batch_size: 8
- seed: 42
- distributed_type: IPU
- gradient_accumulation_steps: 512
- total_train_batch_size: 16384
- total_eval_batch_size: 512
- optimizer: LAMB
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.128
- training_steps: 2038
- training precision: Mixed Precision
### Training results
```
train/epoch: 2.04
train/global_step: 2038
train/loss: 1.2002
train/train_runtime: 12022.3897
train/train_steps_per_second: 0.17
train/train_samples_per_second: 2777.367
```
### Framework versions
- Transformers 4.17.0
- Pytorch 1.10.0+cpu
- Datasets 2.0.0
- Tokenizers 0.11.6
|