Pretrained models¶
Here is the full list of the currently provided pretrained models together with a short presentation of each model.
Architecture |
Shortcut name |
Details of the model |
---|---|---|
BERT |
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on lower-cased English text.
|
|
24-layer, 1024-hidden, 16-heads, 340M parameters.
Trained on lower-cased English text.
|
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased English text.
|
|
|
24-layer, 1024-hidden, 16-heads, 340M parameters.
Trained on cased English text.
|
|
|
(Original, not recommended) 12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on lower-cased text in the top 102 languages with the largest Wikipedias
(see details). |
|
|
(New, recommended) 12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased text in the top 104 languages with the largest Wikipedias
(see details). |
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased Chinese Simplified and Traditional text.
|
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased German text by Deepset.ai
|
|
|
24-layer, 1024-hidden, 16-heads, 340M parameters.
Trained on lower-cased English text using Whole-Word-Masking
(see details). |
|
|
24-layer, 1024-hidden, 16-heads, 340M parameters.
Trained on cased English text using Whole-Word-Masking
(see details). |
|
|
24-layer, 1024-hidden, 16-heads, 340M parameters.
The
bert-large-uncased-whole-word-masking model fine-tuned on SQuAD(see details of fine-tuning in the example section). |
|
|
24-layer, 1024-hidden, 16-heads, 340M parameters
The
bert-large-cased-whole-word-masking model fine-tuned on SQuAD |
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
The
bert-base-cased model fine-tuned on MRPC |
|
GPT |
|
12-layer, 768-hidden, 12-heads, 110M parameters.
OpenAI GPT English model
|
GPT-2 |
|
12-layer, 768-hidden, 12-heads, 117M parameters.
OpenAI GPT-2 English model
|
|
24-layer, 1024-hidden, 16-heads, 345M parameters.
OpenAI’s Medium-sized GPT-2 English model
|
|
|
36-layer, 1280-hidden, 20-heads, 774M parameters.
OpenAI’s Large-sized GPT-2 English model
|
|
Transformer-XL |
|
18-layer, 1024-hidden, 16-heads, 257M parameters.
English model trained on wikitext-103
|
XLNet |
|
12-layer, 768-hidden, 12-heads, 110M parameters.
XLNet English model
|
|
24-layer, 1024-hidden, 16-heads, 340M parameters.
XLNet Large English model
|
|
XLM |
|
12-layer, 2048-hidden, 16-heads
XLM English model
|
|
6-layer, 1024-hidden, 8-heads
XLM English-German model trained on the concatenation of English and German wikipedia
|
|
|
6-layer, 1024-hidden, 8-heads
XLM English-French model trained on the concatenation of English and French wikipedia
|
|
|
6-layer, 1024-hidden, 8-heads
XLM English-Romanian Multi-language model
|
|
|
12-layer, 1024-hidden, 8-heads
XLM Model pre-trained with MLM on the 15 XNLI languages.
|
|
|
12-layer, 1024-hidden, 8-heads
XLM Model pre-trained with MLM + TLM on the 15 XNLI languages.
|
|
|
6-layer, 1024-hidden, 8-heads
XLM English-French model trained with CLM (Causal Language Modeling) on the concatenation of English and French wikipedia
|
|
|
6-layer, 1024-hidden, 8-heads
XLM English-German model trained with CLM (Causal Language Modeling) on the concatenation of English and German wikipedia
|
|
RoBERTa |
|
12-layer, 768-hidden, 12-heads, 125M parameters
RoBERTa using the BERT-base architecture
(see details) |
|
24-layer, 1024-hidden, 16-heads, 355M parameters
RoBERTa using the BERT-large architecture
(see details) |
|
|
24-layer, 1024-hidden, 16-heads, 355M parameters
roberta-large fine-tuned on MNLI.(see details) |
|
DistilBERT |
|
6-layer, 768-hidden, 12-heads, 66M parameters
The DistilBERT model distilled from the BERT model bert-base-uncased checkpoint
(see details) |
|
6-layer, 768-hidden, 12-heads, 66M parameters
The DistilBERT model distilled from the BERT model bert-base-uncased checkpoint, with an additional linear layer.
(see details) |