Pretrained models¶
Here is the full list of the currently provided pretrained models together with a short presentation of each model.
For a list that includes all community-uploaded models, refer to https://huggingface.co/models.
Architecture |
Model id |
Details of the model |
---|---|---|
BERT |
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on lower-cased English text.
|
|
24-layer, 1024-hidden, 16-heads, 336M parameters.
Trained on lower-cased English text.
|
|
|
12-layer, 768-hidden, 12-heads, 109M parameters.
Trained on cased English text.
|
|
|
24-layer, 1024-hidden, 16-heads, 335M parameters.
Trained on cased English text.
|
|
|
(Original, not recommended) 12-layer, 768-hidden, 12-heads, 168M parameters.
Trained on lower-cased text in the top 102 languages with the largest Wikipedias
(see details). |
|
|
(New, recommended) 12-layer, 768-hidden, 12-heads, 179M parameters.
Trained on cased text in the top 104 languages with the largest Wikipedias
(see details). |
|
|
12-layer, 768-hidden, 12-heads, 103M parameters.
Trained on cased Chinese Simplified and Traditional text.
|
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased German text by Deepset.ai
|
|
|
24-layer, 1024-hidden, 16-heads, 336M parameters.
Trained on lower-cased English text using Whole-Word-Masking
(see details). |
|
|
24-layer, 1024-hidden, 16-heads, 335M parameters.
Trained on cased English text using Whole-Word-Masking
(see details). |
|
|
24-layer, 1024-hidden, 16-heads, 336M parameters.
The
bert-large-uncased-whole-word-masking model fine-tuned on SQuAD(see details of fine-tuning in the example section). |
|
|
24-layer, 1024-hidden, 16-heads, 335M parameters
The
bert-large-cased-whole-word-masking model fine-tuned on SQuAD |
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
The
bert-base-cased model fine-tuned on MRPC |
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased German text by DBMDZ
(see details on dbmdz repository). |
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on uncased German text by DBMDZ
(see details on dbmdz repository). |
|
|
12-layer, 768-hidden, 12-heads, 111M parameters.
Trained on Japanese text. Text is tokenized with MeCab and WordPiece and this requires some extra dependencies,
Use
pip install transformers["ja"] (or pip install -e .["ja"] if you install from source) to install them. |
|
|
12-layer, 768-hidden, 12-heads, 111M parameters.
Trained on Japanese text. Text is tokenized with MeCab and WordPiece and this requires some extra dependencies,
Use
pip install transformers["ja"] (or pip install -e .["ja"] if you install from source) to install them. |
|
|
12-layer, 768-hidden, 12-heads, 90M parameters.
Trained on Japanese text. Text is tokenized into characters.
|
|
|
12-layer, 768-hidden, 12-heads, 90M parameters.
Trained on Japanese text using Whole-Word-Masking. Text is tokenized into characters.
|
|
|
12-layer, 768-hidden, 12-heads, 125M parameters.
Trained on cased Finnish text.
(see details on turkunlp.org). |
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on uncased Finnish text.
(see details on turkunlp.org). |
|
|
12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased Dutch text.
|
|
GPT |
|
12-layer, 768-hidden, 12-heads, 110M parameters.
OpenAI GPT English model
|
GPT-2 |
|
12-layer, 768-hidden, 12-heads, 117M parameters.
OpenAI GPT-2 English model
|
|
24-layer, 1024-hidden, 16-heads, 345M parameters.
OpenAI’s Medium-sized GPT-2 English model
|
|
|
36-layer, 1280-hidden, 20-heads, 774M parameters.
OpenAI’s Large-sized GPT-2 English model
|
|
|
48-layer, 1600-hidden, 25-heads, 1558M parameters.
OpenAI’s XL-sized GPT-2 English model
|
|
Transformer-XL |
|
18-layer, 1024-hidden, 16-heads, 257M parameters.
English model trained on wikitext-103
|
XLNet |
|
12-layer, 768-hidden, 12-heads, 110M parameters.
XLNet English model
|
|
24-layer, 1024-hidden, 16-heads, 340M parameters.
XLNet Large English model
|
|
XLM |
|
12-layer, 2048-hidden, 16-heads
XLM English model
|
|
6-layer, 1024-hidden, 8-heads
XLM English-German model trained on the concatenation of English and German wikipedia
|
|
|
6-layer, 1024-hidden, 8-heads
XLM English-French model trained on the concatenation of English and French wikipedia
|
|
|
6-layer, 1024-hidden, 8-heads
XLM English-Romanian Multi-language model
|
|
|
12-layer, 1024-hidden, 8-heads
XLM Model pre-trained with MLM on the 15 XNLI languages.
|
|
|
12-layer, 1024-hidden, 8-heads
XLM Model pre-trained with MLM + TLM on the 15 XNLI languages.
|
|
|
6-layer, 1024-hidden, 8-heads
XLM English-French model trained with CLM (Causal Language Modeling) on the concatenation of English and French wikipedia
|
|
|
6-layer, 1024-hidden, 8-heads
XLM English-German model trained with CLM (Causal Language Modeling) on the concatenation of English and German wikipedia
|
|
|
16-layer, 1280-hidden, 16-heads
XLM model trained with MLM (Masked Language Modeling) on 17 languages.
|
|
|
16-layer, 1280-hidden, 16-heads
XLM model trained with MLM (Masked Language Modeling) on 100 languages.
|
|
RoBERTa |
|
12-layer, 768-hidden, 12-heads, 125M parameters
RoBERTa using the BERT-base architecture
(see details) |
|
24-layer, 1024-hidden, 16-heads, 355M parameters
RoBERTa using the BERT-large architecture
(see details) |
|
|
24-layer, 1024-hidden, 16-heads, 355M parameters
roberta-large fine-tuned on MNLI.(see details) |
|
|
6-layer, 768-hidden, 12-heads, 82M parameters
The DistilRoBERTa model distilled from the RoBERTa model roberta-base checkpoint.
(see details) |
|
|
12-layer, 768-hidden, 12-heads, 125M parameters
roberta-base fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model.(see details) |
|
|
24-layer, 1024-hidden, 16-heads, 355M parameters
roberta-large fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model.(see details) |
|
DistilBERT |
|
6-layer, 768-hidden, 12-heads, 66M parameters
The DistilBERT model distilled from the BERT model bert-base-uncased checkpoint
(see details) |
|
6-layer, 768-hidden, 12-heads, 66M parameters
The DistilBERT model distilled from the BERT model bert-base-uncased checkpoint, with an additional linear layer.
(see details) |
|
|
6-layer, 768-hidden, 12-heads, 65M parameters
The DistilBERT model distilled from the BERT model bert-base-cased checkpoint
(see details) |
|
|
6-layer, 768-hidden, 12-heads, 65M parameters
The DistilBERT model distilled from the BERT model bert-base-cased checkpoint, with an additional question answering layer.
(see details) |
|
|
6-layer, 768-hidden, 12-heads, 82M parameters
The DistilGPT2 model distilled from the GPT2 model gpt2 checkpoint.
(see details) |
|
|
6-layer, 768-hidden, 12-heads, 66M parameters
The German DistilBERT model distilled from the German DBMDZ BERT model bert-base-german-dbmdz-cased checkpoint.
(see details) |
|
|
6-layer, 768-hidden, 12-heads, 134M parameters
The multilingual DistilBERT model distilled from the Multilingual BERT model bert-base-multilingual-cased checkpoint.
(see details) |
|
CTRL |
|
48-layer, 1280-hidden, 16-heads, 1.6B parameters
Salesforce’s Large-sized CTRL English model
|
CamemBERT |
|
12-layer, 768-hidden, 12-heads, 110M parameters
CamemBERT using the BERT-base architecture
(see details) |
ALBERT |
|
12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters
ALBERT base model
(see details) |
|
24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters
ALBERT large model
(see details) |
|
|
24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters
ALBERT xlarge model
(see details) |
|
|
12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters
ALBERT xxlarge model
(see details) |
|
|
12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters
ALBERT base model with no dropout, additional training data and longer training
(see details) |
|
|
24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters
ALBERT large model with no dropout, additional training data and longer training
(see details) |
|
|
24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters
ALBERT xlarge model with no dropout, additional training data and longer training
(see details) |
|
|
12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters
ALBERT xxlarge model with no dropout, additional training data and longer training
(see details) |
|
T5 |
|
~60M parameters with 6-layers, 512-hidden-state, 2048 feed-forward hidden-state, 8-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)
|
|
~220M parameters with 12-layers, 768-hidden-state, 3072 feed-forward hidden-state, 12-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)
|
|
|
~770M parameters with 24-layers, 1024-hidden-state, 4096 feed-forward hidden-state, 16-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)
|
|
|
~2.8B parameters with 24-layers, 1024-hidden-state, 16384 feed-forward hidden-state, 32-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)
|
|
|
~11B parameters with 24-layers, 1024-hidden-state, 65536 feed-forward hidden-state, 128-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)
|
|
XLM-RoBERTa |
|
~270M parameters with 12-layers, 768-hidden-state, 3072 feed-forward hidden-state, 8-heads,
Trained on on 2.5 TB of newly created clean CommonCrawl data in 100 languages
|
|
~550M parameters with 24-layers, 1024-hidden-state, 4096 feed-forward hidden-state, 16-heads,
Trained on 2.5 TB of newly created clean CommonCrawl data in 100 languages
|
|
FlauBERT |
|
6-layer, 512-hidden, 8-heads, 54M parameters
FlauBERT small architecture
(see details) |
|
12-layer, 768-hidden, 12-heads, 137M parameters
FlauBERT base architecture with uncased vocabulary
(see details) |
|
|
12-layer, 768-hidden, 12-heads, 138M parameters
FlauBERT base architecture with cased vocabulary
(see details) |
|
|
24-layer, 1024-hidden, 16-heads, 373M parameters
FlauBERT large architecture
(see details) |
|
Bart |
|
24-layer, 1024-hidden, 16-heads, 406M parameters
(see details) |
|
12-layer, 768-hidden, 16-heads, 139M parameters
|
|
|
Adds a 2 layer classification head with 1 million parameters
bart-large base architecture with a classification head, finetuned on MNLI
|
|
|
24-layer, 1024-hidden, 16-heads, 406M parameters (same as large)
bart-large base architecture finetuned on cnn summarization task
|
|
DialoGPT |
|
12-layer, 768-hidden, 12-heads, 124M parameters
Trained on English text: 147M conversation-like exchanges extracted from Reddit.
|
|
24-layer, 1024-hidden, 16-heads, 355M parameters
Trained on English text: 147M conversation-like exchanges extracted from Reddit.
|
|
|
36-layer, 1280-hidden, 20-heads, 774M parameters
Trained on English text: 147M conversation-like exchanges extracted from Reddit.
|
|
Reformer |
|
12-layer, 1024-hidden, 8-heads, 149M parameters
Trained on English Wikipedia data - enwik8.
|
|
6-layer, 256-hidden, 2-heads, 3M parameters
Trained on English text: Crime and Punishment novel by Fyodor Dostoyevsky.
|
|
MarianMT |
|
12-layer, 512-hidden, 8-heads, ~74M parameter Machine translation models. Parameter counts vary depending on vocab size.
(see model list)
|
Pegasus |
|
16-layer, 1024-hidden, 16-heads, ~568M parameter, 2.2 GB for summary. model list
|
Longformer |
|
12-layer, 768-hidden, 12-heads, ~149M parameters
Starting from RoBERTa-base checkpoint, trained on documents of max length 4,096
|
|
24-layer, 1024-hidden, 16-heads, ~435M parameters
Starting from RoBERTa-large checkpoint, trained on documents of max length 4,096
|
|
MBart |
|
24-layer, 1024-hidden, 16-heads, 610M parameters
mBART (bart-large architecture) model trained on 25 languages’ monolingual corpus
|
|
24-layer, 1024-hidden, 16-heads, 610M parameters
mbart-large-cc25 model finetuned on WMT english romanian translation.
|
|
Lxmert |
|
9-language layers, 9-relationship layers, and 12-cross-modality layers
768-hidden, 12-heads (for each layer) ~ 228M parameters
Starting from lxmert-base checkpoint, trained on over 9 million image-text couplets from COCO, VisualGenome, GQA, VQA
|
Funnel Transformer |
|
14 layers: 3 blocks of 4 layers then 2 layers decoder, 768-hidden, 12-heads, 130M parameters
(see details) |
|
12 layers: 3 blocks of 4 layers (no decoder), 768-hidden, 12-heads, 115M parameters
(see details) |
|
|
14 layers: 3 blocks 6, 3x2, 3x2 layers then 2 layers decoder, 768-hidden, 12-heads, 130M parameters
(see details) |
|
|
12 layers: 3 blocks 6, 3x2, 3x2 layers(no decoder), 768-hidden, 12-heads, 115M parameters
(see details) |
|
|
20 layers: 3 blocks of 6 layers then 2 layers decoder, 768-hidden, 12-heads, 177M parameters
(see details) |
|
|
18 layers: 3 blocks of 6 layers (no decoder), 768-hidden, 12-heads, 161M parameters
(see details) |
|
|
26 layers: 3 blocks of 8 layers then 2 layers decoder, 1024-hidden, 12-heads, 386M parameters
(see details) |
|
|
24 layers: 3 blocks of 8 layers (no decoder), 1024-hidden, 12-heads, 358M parameters
(see details) |
|
|
32 layers: 3 blocks of 10 layers then 2 layers decoder, 1024-hidden, 12-heads, 468M parameters
(see details) |
|
|
30 layers: 3 blocks of 10 layers (no decoder), 1024-hidden, 12-heads, 440M parameters
(see details) |
|
LayoutLM |
|
12 layers, 768-hidden, 12-heads, 113M parameters
(see details) |
|
24 layers, 1024-hidden, 16-heads, 343M parameters
(see details) |
|
DeBERTa |
|
12-layer, 768-hidden, 12-heads, ~125M parameters
DeBERTa using the BERT-base architecture
(see details) |
|
24-layer, 1024-hidden, 16-heads, ~390M parameters
DeBERTa using the BERT-large architecture
(see details) |
|
SqueezeBERT |
|
12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone.
SqueezeBERT architecture pretrained from scratch on masked language model (MLM) and sentence order prediction (SOP) tasks.
|
|
12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone.
This is the squeezebert-uncased model finetuned on MNLI sentence pair classification task with distillation from electra-base.
|
|
|
12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone.
This is the squeezebert-uncased model finetuned on MNLI sentence pair classification task with distillation from electra-base.
The final classification layer is removed, so when you finetune, the final layer will be reinitialized.
|