Pretrained models¶
Here is the full list of the currently provided pretrained models together with a short presentation of each model.
Architecture |
Shortcut name |
Details of the model |
---|---|---|
BERT |
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on lower-cased English text.
|
|
24-layer, 1024-hidden, 16-heads, 340M parameters.
Trained on lower-cased English text.
|
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased English text.
|
|
|
24-layer, 1024-hidden, 16-heads, 340M parameters.
Trained on cased English text.
|
|
|
(Original, not recommended) 12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on lower-cased text in the top 102 languages with the largest Wikipedias
(see details). |
|
|
(New, recommended) 12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased text in the top 104 languages with the largest Wikipedias
(see details). |
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased Chinese Simplified and Traditional text.
|
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased German text by Deepset.ai
|
|
|
24-layer, 1024-hidden, 16-heads, 340M parameters.
Trained on lower-cased English text using Whole-Word-Masking
(see details). |
|
|
24-layer, 1024-hidden, 16-heads, 340M parameters.
Trained on cased English text using Whole-Word-Masking
(see details). |
|
|
24-layer, 1024-hidden, 16-heads, 340M parameters.
The
bert-large-uncased-whole-word-masking model fine-tuned on SQuAD(see details of fine-tuning in the example section). |
|
|
24-layer, 1024-hidden, 16-heads, 340M parameters
The
bert-large-cased-whole-word-masking model fine-tuned on SQuAD |
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
The
bert-base-cased model fine-tuned on MRPC |
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased German text by DBMDZ
(see details on dbmdz repository). |
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on uncased German text by DBMDZ
(see details on dbmdz repository). |
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on Japanese text. Text is tokenized with MeCab and WordPiece.
MeCab is required for tokenization.
|
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on Japanese text using Whole-Word-Masking. Text is tokenized with MeCab and WordPiece.
MeCab is required for tokenization.
|
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on Japanese text. Text is tokenized into characters.
|
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on Japanese text using Whole-Word-Masking. Text is tokenized into characters.
|
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased Finnish text.
(see details on turkunlp.org). |
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on uncased Finnish text.
(see details on turkunlp.org). |
|
GPT |
|
12-layer, 768-hidden, 12-heads, 110M parameters.
OpenAI GPT English model
|
GPT-2 |
|
12-layer, 768-hidden, 12-heads, 117M parameters.
OpenAI GPT-2 English model
|
|
24-layer, 1024-hidden, 16-heads, 345M parameters.
OpenAI’s Medium-sized GPT-2 English model
|
|
|
36-layer, 1280-hidden, 20-heads, 774M parameters.
OpenAI’s Large-sized GPT-2 English model
|
|
|
48-layer, 1600-hidden, 25-heads, 1558M parameters.
OpenAI’s XL-sized GPT-2 English model
|
|
Transformer-XL |
|
18-layer, 1024-hidden, 16-heads, 257M parameters.
English model trained on wikitext-103
|
XLNet |
|
12-layer, 768-hidden, 12-heads, 110M parameters.
XLNet English model
|
|
24-layer, 1024-hidden, 16-heads, 340M parameters.
XLNet Large English model
|
|
XLM |
|
12-layer, 2048-hidden, 16-heads
XLM English model
|
|
6-layer, 1024-hidden, 8-heads
XLM English-German model trained on the concatenation of English and German wikipedia
|
|
|
6-layer, 1024-hidden, 8-heads
XLM English-French model trained on the concatenation of English and French wikipedia
|
|
|
6-layer, 1024-hidden, 8-heads
XLM English-Romanian Multi-language model
|
|
|
12-layer, 1024-hidden, 8-heads
XLM Model pre-trained with MLM on the 15 XNLI languages.
|
|
|
12-layer, 1024-hidden, 8-heads
XLM Model pre-trained with MLM + TLM on the 15 XNLI languages.
|
|
|
6-layer, 1024-hidden, 8-heads
XLM English-French model trained with CLM (Causal Language Modeling) on the concatenation of English and French wikipedia
|
|
|
6-layer, 1024-hidden, 8-heads
XLM English-German model trained with CLM (Causal Language Modeling) on the concatenation of English and German wikipedia
|
|
|
16-layer, 1280-hidden, 16-heads
XLM model trained with MLM (Masked Language Modeling) on 17 languages.
|
|
|
16-layer, 1280-hidden, 16-heads
XLM model trained with MLM (Masked Language Modeling) on 100 languages.
|
|
RoBERTa |
|
12-layer, 768-hidden, 12-heads, 125M parameters
RoBERTa using the BERT-base architecture
(see details) |
|
24-layer, 1024-hidden, 16-heads, 355M parameters
RoBERTa using the BERT-large architecture
(see details) |
|
|
24-layer, 1024-hidden, 16-heads, 355M parameters
roberta-large fine-tuned on MNLI.(see details) |
|
|
12-layer, 768-hidden, 12-heads, 125M parameters
roberta-base fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model.(see details) |
|
|
24-layer, 1024-hidden, 16-heads, 355M parameters
roberta-large fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model.(see details) |
|
DistilBERT |
|
6-layer, 768-hidden, 12-heads, 66M parameters
The DistilBERT model distilled from the BERT model bert-base-uncased checkpoint
(see details) |
|
6-layer, 768-hidden, 12-heads, 66M parameters
The DistilBERT model distilled from the BERT model bert-base-uncased checkpoint, with an additional linear layer.
(see details) |
|
|
6-layer, 768-hidden, 12-heads, 82M parameters
The DistilGPT2 model distilled from the GPT2 model gpt2 checkpoint.
(see details) |
|
|
6-layer, 768-hidden, 12-heads, 82M parameters
The DistilRoBERTa model distilled from the RoBERTa model roberta-base checkpoint.
(see details) |
|
|
6-layer, 768-hidden, 12-heads, 66M parameters
The German DistilBERT model distilled from the German DBMDZ BERT model bert-base-german-dbmdz-cased checkpoint.
(see details) |
|
|
6-layer, 768-hidden, 12-heads, 134M parameters
The multilingual DistilBERT model distilled from the Multilingual BERT model bert-base-multilingual-cased checkpoint.
(see details) |
|
CTRL |
|
48-layer, 1280-hidden, 16-heads, 1.6B parameters
Salesforce’s Large-sized CTRL English model
|
CamemBERT |
|
12-layer, 768-hidden, 12-heads, 110M parameters
CamemBERT using the BERT-base architecture
(see details) |
ALBERT |
|
12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters
ALBERT base model
(see details) |
|
24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters
ALBERT large model
(see details) |
|
|
24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters
ALBERT xlarge model
(see details) |
|
|
12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters
ALBERT xxlarge model
(see details) |
|
|
12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters
ALBERT base model with no dropout, additional training data and longer training
(see details) |
|
|
24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters
ALBERT large model with no dropout, additional training data and longer training
(see details) |
|
|
24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters
ALBERT xlarge model with no dropout, additional training data and longer training
(see details) |
|
|
12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters
ALBERT xxlarge model with no dropout, additional training data and longer training
(see details) |
|
T5 |
|
~60M parameters with 6-layers, 512-hidden-state, 2048 feed-forward hidden-state, 8-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)
|
|
~220M parameters with 12-layers, 768-hidden-state, 3072 feed-forward hidden-state, 12-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)
|
|
|
~770M parameters with 24-layers, 1024-hidden-state, 4096 feed-forward hidden-state, 16-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)
|
|
|
~2.8B parameters with 24-layers, 1024-hidden-state, 16384 feed-forward hidden-state, 32-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)
|
|
|
~11B parameters with 24-layers, 1024-hidden-state, 65536 feed-forward hidden-state, 128-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)
|
|
XLM-RoBERTa |
|
~125M parameters with 12-layers, 768-hidden-state, 3072 feed-forward hidden-state, 8-heads,
Trained on on 2.5 TB of newly created clean CommonCrawl data in 100 languages
|
|
~355M parameters with 24-layers, 1027-hidden-state, 4096 feed-forward hidden-state, 16-heads,
Trained on 2.5 TB of newly created clean CommonCrawl data in 100 languages
|