Pretrained models¶
Here is the full list of the currently provided pretrained models together with a short presentation of each model.
For a list that includes community-uploaded models, refer to https://huggingface.co/models.
Architecture |
Shortcut name |
Details of the model |
---|---|---|
BERT |
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on lower-cased English text.
|
|
24-layer, 1024-hidden, 16-heads, 340M parameters.
Trained on lower-cased English text.
|
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased English text.
|
|
|
24-layer, 1024-hidden, 16-heads, 340M parameters.
Trained on cased English text.
|
|
|
(Original, not recommended) 12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on lower-cased text in the top 102 languages with the largest Wikipedias
(see details). |
|
|
(New, recommended) 12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased text in the top 104 languages with the largest Wikipedias
(see details). |
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased Chinese Simplified and Traditional text.
|
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased German text by Deepset.ai
|
|
|
24-layer, 1024-hidden, 16-heads, 340M parameters.
Trained on lower-cased English text using Whole-Word-Masking
(see details). |
|
|
24-layer, 1024-hidden, 16-heads, 340M parameters.
Trained on cased English text using Whole-Word-Masking
(see details). |
|
|
24-layer, 1024-hidden, 16-heads, 340M parameters.
The
bert-large-uncased-whole-word-masking model fine-tuned on SQuAD(see details of fine-tuning in the example section). |
|
|
24-layer, 1024-hidden, 16-heads, 340M parameters
The
bert-large-cased-whole-word-masking model fine-tuned on SQuAD |
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
The
bert-base-cased model fine-tuned on MRPC |
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased German text by DBMDZ
(see details on dbmdz repository). |
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on uncased German text by DBMDZ
(see details on dbmdz repository). |
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on Japanese text. Text is tokenized with MeCab and WordPiece.
MeCab is required for tokenization.
|
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on Japanese text using Whole-Word-Masking. Text is tokenized with MeCab and WordPiece.
MeCab is required for tokenization.
|
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on Japanese text. Text is tokenized into characters.
|
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on Japanese text using Whole-Word-Masking. Text is tokenized into characters.
|
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased Finnish text.
(see details on turkunlp.org). |
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on uncased Finnish text.
(see details on turkunlp.org). |
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased Dutch text.
|
|
GPT |
|
12-layer, 768-hidden, 12-heads, 110M parameters.
OpenAI GPT English model
|
GPT-2 |
|
12-layer, 768-hidden, 12-heads, 117M parameters.
OpenAI GPT-2 English model
|
|
24-layer, 1024-hidden, 16-heads, 345M parameters.
OpenAI’s Medium-sized GPT-2 English model
|
|
|
36-layer, 1280-hidden, 20-heads, 774M parameters.
OpenAI’s Large-sized GPT-2 English model
|
|
|
48-layer, 1600-hidden, 25-heads, 1558M parameters.
OpenAI’s XL-sized GPT-2 English model
|
|
Transformer-XL |
|
18-layer, 1024-hidden, 16-heads, 257M parameters.
English model trained on wikitext-103
|
XLNet |
|
12-layer, 768-hidden, 12-heads, 110M parameters.
XLNet English model
|
|
24-layer, 1024-hidden, 16-heads, 340M parameters.
XLNet Large English model
|
|
XLM |
|
12-layer, 2048-hidden, 16-heads
XLM English model
|
|
6-layer, 1024-hidden, 8-heads
XLM English-German model trained on the concatenation of English and German wikipedia
|
|
|
6-layer, 1024-hidden, 8-heads
XLM English-French model trained on the concatenation of English and French wikipedia
|
|
|
6-layer, 1024-hidden, 8-heads
XLM English-Romanian Multi-language model
|
|
|
12-layer, 1024-hidden, 8-heads
XLM Model pre-trained with MLM on the 15 XNLI languages.
|
|
|
12-layer, 1024-hidden, 8-heads
XLM Model pre-trained with MLM + TLM on the 15 XNLI languages.
|
|
|
6-layer, 1024-hidden, 8-heads
XLM English-French model trained with CLM (Causal Language Modeling) on the concatenation of English and French wikipedia
|
|
|
6-layer, 1024-hidden, 8-heads
XLM English-German model trained with CLM (Causal Language Modeling) on the concatenation of English and German wikipedia
|
|
|
16-layer, 1280-hidden, 16-heads
XLM model trained with MLM (Masked Language Modeling) on 17 languages.
|
|
|
16-layer, 1280-hidden, 16-heads
XLM model trained with MLM (Masked Language Modeling) on 100 languages.
|
|
RoBERTa |
|
12-layer, 768-hidden, 12-heads, 125M parameters
RoBERTa using the BERT-base architecture
(see details) |
|
24-layer, 1024-hidden, 16-heads, 355M parameters
RoBERTa using the BERT-large architecture
(see details) |
|
|
24-layer, 1024-hidden, 16-heads, 355M parameters
roberta-large fine-tuned on MNLI.(see details) |
|
|
6-layer, 768-hidden, 12-heads, 82M parameters
The DistilRoBERTa model distilled from the RoBERTa model roberta-base checkpoint.
(see details) |
|
|
12-layer, 768-hidden, 12-heads, 125M parameters
roberta-base fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model.(see details) |
|
|
24-layer, 1024-hidden, 16-heads, 355M parameters
roberta-large fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model.(see details) |
|
DistilBERT |
|
6-layer, 768-hidden, 12-heads, 66M parameters
The DistilBERT model distilled from the BERT model bert-base-uncased checkpoint
(see details) |
|
6-layer, 768-hidden, 12-heads, 66M parameters
The DistilBERT model distilled from the BERT model bert-base-uncased checkpoint, with an additional linear layer.
(see details) |
|
|
6-layer, 768-hidden, 12-heads, 65M parameters
The DistilBERT model distilled from the BERT model bert-base-cased checkpoint
(see details) |
|
|
6-layer, 768-hidden, 12-heads, 65M parameters
The DistilBERT model distilled from the BERT model bert-base-cased checkpoint, with an additional question answering layer.
(see details) |
|
|
6-layer, 768-hidden, 12-heads, 82M parameters
The DistilGPT2 model distilled from the GPT2 model gpt2 checkpoint.
(see details) |
|
|
6-layer, 768-hidden, 12-heads, 66M parameters
The German DistilBERT model distilled from the German DBMDZ BERT model bert-base-german-dbmdz-cased checkpoint.
(see details) |
|
|
6-layer, 768-hidden, 12-heads, 134M parameters
The multilingual DistilBERT model distilled from the Multilingual BERT model bert-base-multilingual-cased checkpoint.
(see details) |
|
CTRL |
|
48-layer, 1280-hidden, 16-heads, 1.6B parameters
Salesforce’s Large-sized CTRL English model
|
CamemBERT |
|
12-layer, 768-hidden, 12-heads, 110M parameters
CamemBERT using the BERT-base architecture
(see details) |
ALBERT |
|
12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters
ALBERT base model
(see details) |
|
24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters
ALBERT large model
(see details) |
|
|
24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters
ALBERT xlarge model
(see details) |
|
|
12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters
ALBERT xxlarge model
(see details) |
|
|
12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters
ALBERT base model with no dropout, additional training data and longer training
(see details) |
|
|
24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters
ALBERT large model with no dropout, additional training data and longer training
(see details) |
|
|
24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters
ALBERT xlarge model with no dropout, additional training data and longer training
(see details) |
|
|
12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters
ALBERT xxlarge model with no dropout, additional training data and longer training
(see details) |
|
T5 |
|
~60M parameters with 6-layers, 512-hidden-state, 2048 feed-forward hidden-state, 8-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)
|
|
~220M parameters with 12-layers, 768-hidden-state, 3072 feed-forward hidden-state, 12-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)
|
|
|
~770M parameters with 24-layers, 1024-hidden-state, 4096 feed-forward hidden-state, 16-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)
|
|
|
~2.8B parameters with 24-layers, 1024-hidden-state, 16384 feed-forward hidden-state, 32-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)
|
|
|
~11B parameters with 24-layers, 1024-hidden-state, 65536 feed-forward hidden-state, 128-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)
|
|
XLM-RoBERTa |
|
~125M parameters with 12-layers, 768-hidden-state, 3072 feed-forward hidden-state, 8-heads,
Trained on on 2.5 TB of newly created clean CommonCrawl data in 100 languages
|
|
~355M parameters with 24-layers, 1027-hidden-state, 4096 feed-forward hidden-state, 16-heads,
Trained on 2.5 TB of newly created clean CommonCrawl data in 100 languages
|
|
FlauBERT |
|
6-layer, 512-hidden, 8-heads, 54M parameters
FlauBERT small architecture
(see details) |
|
12-layer, 768-hidden, 12-heads, 137M parameters
FlauBERT base architecture with uncased vocabulary
(see details) |
|
|
12-layer, 768-hidden, 12-heads, 138M parameters
FlauBERT base architecture with cased vocabulary
(see details) |
|
|
24-layer, 1024-hidden, 16-heads, 373M parameters
FlauBERT large architecture
(see details) |
|
Bart |
|
24-layer, 1024-hidden, 16-heads, 406M parameters
(see details) |
|
12-layer, 768-hidden, 16-heads, 139M parameters
|
|
|
Adds a 2 layer classification head with 1 million parameters
bart-large base architecture with a classification head, finetuned on MNLI
|
|
|
12-layer, 1024-hidden, 16-heads, 406M parameters (same as base)
bart-large base architecture finetuned on cnn summarization task
|
|
|
12-layer, 1024-hidden, 16-heads, 880M parameters
bart-large architecture pretrained on cc25 multilingual data , finetuned on WMT english romanian translation.
|
|
DialoGPT |
|
12-layer, 768-hidden, 12-heads, 124M parameters
Trained on English text: 147M conversation-like exchanges extracted from Reddit.
|
|
24-layer, 1024-hidden, 16-heads, 355M parameters
Trained on English text: 147M conversation-like exchanges extracted from Reddit.
|
|
|
36-layer, 1280-hidden, 20-heads, 774M parameters
Trained on English text: 147M conversation-like exchanges extracted from Reddit.
|
|
Reformer |
|
12-layer, 1024-hidden, 8-heads, 149M parameters
Trained on English Wikipedia data - enwik8.
|
|
6-layer, 256-hidden, 2-heads, 3M parameters
Trained on English text: Crime and Punishment novel by Fyodor Dostoyevsky.
|
|
MarianMT |
|
12-layer, 512-hidden, 8-heads, ~74M parameter Machine translation models. Parameter counts vary depending on vocab size.
(see model list)
|
Longformer |
|
12-layer, 768-hidden, 12-heads, ~149M parameters
Starting from RoBERTa-base checkpoint, trained on documents of max length 4,096
|
|
24-layer, 1024-hidden, 16-heads, ~435M parameters
Starting from RoBERTa-large checkpoint, trained on documents of max length 4,096
|