Pretrained models¶

Here is a partial list of some of the available pretrained models together with a short presentation of each model.

For the full list, refer to https://huggingface.co/models.

Architecture	Model id	Details of the model
BERT	`bert-base-uncased`	12-layer, 768-hidden, 12-heads, 110M parameters. Trained on lower-cased English text.
	`bert-large-uncased`	24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on lower-cased English text.
	`bert-base-cased`	12-layer, 768-hidden, 12-heads, 109M parameters. Trained on cased English text.
	`bert-large-cased`	24-layer, 1024-hidden, 16-heads, 335M parameters. Trained on cased English text.
	`bert-base-multilingual-uncased`	(Original, not recommended) 12-layer, 768-hidden, 12-heads, 168M parameters. Trained on lower-cased text in the top 102 languages with the largest Wikipedias (see details).
	`bert-base-multilingual-cased`	(New, recommended) 12-layer, 768-hidden, 12-heads, 179M parameters. Trained on cased text in the top 104 languages with the largest Wikipedias (see details).
	`bert-base-chinese`	12-layer, 768-hidden, 12-heads, 103M parameters. Trained on cased Chinese Simplified and Traditional text.
	`bert-base-german-cased`	12-layer, 768-hidden, 12-heads, 110M parameters. Trained on cased German text by Deepset.ai (see details on deepset.ai website).
	`bert-large-uncased-whole-word-masking`	24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on lower-cased English text using Whole-Word-Masking (see details).
	`bert-large-cased-whole-word-masking`	24-layer, 1024-hidden, 16-heads, 335M parameters. Trained on cased English text using Whole-Word-Masking (see details).
	`bert-large-uncased-whole-word-masking-finetuned-squad`	24-layer, 1024-hidden, 16-heads, 336M parameters. The `bert-large-uncased-whole-word-masking` model fine-tuned on SQuAD (see details of fine-tuning in the example section).
	`bert-large-cased-whole-word-masking-finetuned-squad`	24-layer, 1024-hidden, 16-heads, 335M parameters The `bert-large-cased-whole-word-masking` model fine-tuned on SQuAD (see details of fine-tuning in the example section)
	`bert-base-cased-finetuned-mrpc`	12-layer, 768-hidden, 12-heads, 110M parameters. The `bert-base-cased` model fine-tuned on MRPC (see details of fine-tuning in the example section)
	`bert-base-german-dbmdz-cased`	12-layer, 768-hidden, 12-heads, 110M parameters. Trained on cased German text by DBMDZ (see details on dbmdz repository).
	`bert-base-german-dbmdz-uncased`	12-layer, 768-hidden, 12-heads, 110M parameters. Trained on uncased German text by DBMDZ (see details on dbmdz repository).
	`cl-tohoku/bert-base-japanese`	12-layer, 768-hidden, 12-heads, 111M parameters. Trained on Japanese text. Text is tokenized with MeCab and WordPiece and this requires some extra dependencies, fugashi which is a wrapper around MeCab. Use `pip install transformers["ja"]` (or `pip install -e .["ja"]` if you install from source) to install them. (see details on cl-tohoku repository).
	`cl-tohoku/bert-base-japanese-whole-word-masking`	12-layer, 768-hidden, 12-heads, 111M parameters. Trained on Japanese text. Text is tokenized with MeCab and WordPiece and this requires some extra dependencies, fugashi which is a wrapper around MeCab. Use `pip install transformers["ja"]` (or `pip install -e .["ja"]` if you install from source) to install them. (see details on cl-tohoku repository).
	`cl-tohoku/bert-base-japanese-char`	12-layer, 768-hidden, 12-heads, 90M parameters. Trained on Japanese text. Text is tokenized into characters. (see details on cl-tohoku repository).
	`cl-tohoku/bert-base-japanese-char-whole-word-masking`	12-layer, 768-hidden, 12-heads, 90M parameters. Trained on Japanese text using Whole-Word-Masking. Text is tokenized into characters. (see details on cl-tohoku repository).
	`TurkuNLP/bert-base-finnish-cased-v1`	12-layer, 768-hidden, 12-heads, 125M parameters. Trained on cased Finnish text. (see details on turkunlp.org).
	`TurkuNLP/bert-base-finnish-uncased-v1`	12-layer, 768-hidden, 12-heads, 110M parameters. Trained on uncased Finnish text. (see details on turkunlp.org).
	`wietsedv/bert-base-dutch-cased`	12-layer, 768-hidden, 12-heads, 110M parameters. Trained on cased Dutch text. (see details on wietsedv repository).
GPT	`openai-gpt`	12-layer, 768-hidden, 12-heads, 110M parameters. OpenAI GPT English model
GPT-2	`gpt2`	12-layer, 768-hidden, 12-heads, 117M parameters. OpenAI GPT-2 English model
	`gpt2-medium`	24-layer, 1024-hidden, 16-heads, 345M parameters. OpenAI’s Medium-sized GPT-2 English model
	`gpt2-large`	36-layer, 1280-hidden, 20-heads, 774M parameters. OpenAI’s Large-sized GPT-2 English model
	`gpt2-xl`	48-layer, 1600-hidden, 25-heads, 1558M parameters. OpenAI’s XL-sized GPT-2 English model
GPTNeo	`EleutherAI/gpt-neo-1.3B`	24-layer, 2048-hidden, 16-heads, 1.3B parameters. EleutherAI’s GPT-3 like language model.
GPTNeo	`EleutherAI/gpt-neo-2.7B`	32-layer, 2560-hidden, 20-heads, 2.7B parameters. EleutherAI’s GPT-3 like language model.
Transformer-XL	`transfo-xl-wt103`	18-layer, 1024-hidden, 16-heads, 257M parameters. English model trained on wikitext-103
XLNet	`xlnet-base-cased`	12-layer, 768-hidden, 12-heads, 110M parameters. XLNet English model
XLNet	`xlnet-large-cased`	24-layer, 1024-hidden, 16-heads, 340M parameters. XLNet Large English model
XLM	`xlm-mlm-en-2048`	12-layer, 2048-hidden, 16-heads XLM English model
	`xlm-mlm-ende-1024`	6-layer, 1024-hidden, 8-heads XLM English-German model trained on the concatenation of English and German wikipedia
	`xlm-mlm-enfr-1024`	6-layer, 1024-hidden, 8-heads XLM English-French model trained on the concatenation of English and French wikipedia
	`xlm-mlm-enro-1024`	6-layer, 1024-hidden, 8-heads XLM English-Romanian Multi-language model
	`xlm-mlm-xnli15-1024`	12-layer, 1024-hidden, 8-heads XLM Model pre-trained with MLM on the 15 XNLI languages.
	`xlm-mlm-tlm-xnli15-1024`	12-layer, 1024-hidden, 8-heads XLM Model pre-trained with MLM + TLM on the 15 XNLI languages.
	`xlm-clm-enfr-1024`	6-layer, 1024-hidden, 8-heads XLM English-French model trained with CLM (Causal Language Modeling) on the concatenation of English and French wikipedia
	`xlm-clm-ende-1024`	6-layer, 1024-hidden, 8-heads XLM English-German model trained with CLM (Causal Language Modeling) on the concatenation of English and German wikipedia
	`xlm-mlm-17-1280`	16-layer, 1280-hidden, 16-heads XLM model trained with MLM (Masked Language Modeling) on 17 languages.
	`xlm-mlm-100-1280`	16-layer, 1280-hidden, 16-heads XLM model trained with MLM (Masked Language Modeling) on 100 languages.
RoBERTa	`roberta-base`	12-layer, 768-hidden, 12-heads, 125M parameters RoBERTa using the BERT-base architecture (see details)
	`roberta-large`	24-layer, 1024-hidden, 16-heads, 355M parameters RoBERTa using the BERT-large architecture (see details)
	`roberta-large-mnli`	24-layer, 1024-hidden, 16-heads, 355M parameters `roberta-large` fine-tuned on MNLI. (see details)
	`distilroberta-base`	6-layer, 768-hidden, 12-heads, 82M parameters The DistilRoBERTa model distilled from the RoBERTa model roberta-base checkpoint. (see details)
	`roberta-base-openai-detector`	12-layer, 768-hidden, 12-heads, 125M parameters `roberta-base` fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model. (see details)
	`roberta-large-openai-detector`	24-layer, 1024-hidden, 16-heads, 355M parameters `roberta-large` fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model. (see details)
DistilBERT	`distilbert-base-uncased`	6-layer, 768-hidden, 12-heads, 66M parameters The DistilBERT model distilled from the BERT model bert-base-uncased checkpoint (see details)
	`distilbert-base-uncased-distilled-squad`	6-layer, 768-hidden, 12-heads, 66M parameters The DistilBERT model distilled from the BERT model bert-base-uncased checkpoint, with an additional linear layer. (see details)
	`distilbert-base-cased`	6-layer, 768-hidden, 12-heads, 65M parameters The DistilBERT model distilled from the BERT model bert-base-cased checkpoint (see details)
	`distilbert-base-cased-distilled-squad`	6-layer, 768-hidden, 12-heads, 65M parameters The DistilBERT model distilled from the BERT model bert-base-cased checkpoint, with an additional question answering layer. (see details)
	`distilgpt2`	6-layer, 768-hidden, 12-heads, 82M parameters The DistilGPT2 model distilled from the GPT2 model gpt2 checkpoint. (see details)
	`distilbert-base-german-cased`	6-layer, 768-hidden, 12-heads, 66M parameters The German DistilBERT model distilled from the German DBMDZ BERT model bert-base-german-dbmdz-cased checkpoint. (see details)
	`distilbert-base-multilingual-cased`	6-layer, 768-hidden, 12-heads, 134M parameters The multilingual DistilBERT model distilled from the Multilingual BERT model bert-base-multilingual-cased checkpoint. (see details)
CTRL	`ctrl`	48-layer, 1280-hidden, 16-heads, 1.6B parameters Salesforce’s Large-sized CTRL English model
CamemBERT	`camembert-base`	12-layer, 768-hidden, 12-heads, 110M parameters CamemBERT using the BERT-base architecture (see details)
ALBERT	`albert-base-v1`	12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters ALBERT base model (see details)
	`albert-large-v1`	24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters ALBERT large model (see details)
	`albert-xlarge-v1`	24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters ALBERT xlarge model (see details)
	`albert-xxlarge-v1`	12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters ALBERT xxlarge model (see details)
	`albert-base-v2`	12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters ALBERT base model with no dropout, additional training data and longer training (see details)
	`albert-large-v2`	24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters ALBERT large model with no dropout, additional training data and longer training (see details)
	`albert-xlarge-v2`	24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters ALBERT xlarge model with no dropout, additional training data and longer training (see details)
	`albert-xxlarge-v2`	12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters ALBERT xxlarge model with no dropout, additional training data and longer training (see details)
T5	`t5-small`	~60M parameters with 6-layers, 512-hidden-state, 2048 feed-forward hidden-state, 8-heads, Trained on English text: the Colossal Clean Crawled Corpus (C4)
	`t5-base`	~220M parameters with 12-layers, 768-hidden-state, 3072 feed-forward hidden-state, 12-heads, Trained on English text: the Colossal Clean Crawled Corpus (C4)
	`t5-large`	~770M parameters with 24-layers, 1024-hidden-state, 4096 feed-forward hidden-state, 16-heads, Trained on English text: the Colossal Clean Crawled Corpus (C4)
	`t5-3B`	~2.8B parameters with 24-layers, 1024-hidden-state, 16384 feed-forward hidden-state, 32-heads, Trained on English text: the Colossal Clean Crawled Corpus (C4)
	`t5-11B`	~11B parameters with 24-layers, 1024-hidden-state, 65536 feed-forward hidden-state, 128-heads, Trained on English text: the Colossal Clean Crawled Corpus (C4)
XLM-RoBERTa	`xlm-roberta-base`	~270M parameters with 12-layers, 768-hidden-state, 3072 feed-forward hidden-state, 8-heads, Trained on on 2.5 TB of newly created clean CommonCrawl data in 100 languages
XLM-RoBERTa	`xlm-roberta-large`	~550M parameters with 24-layers, 1024-hidden-state, 4096 feed-forward hidden-state, 16-heads, Trained on 2.5 TB of newly created clean CommonCrawl data in 100 languages
FlauBERT	`flaubert/flaubert_small_cased`	6-layer, 512-hidden, 8-heads, 54M parameters FlauBERT small architecture (see details)
	`flaubert/flaubert_base_uncased`	12-layer, 768-hidden, 12-heads, 137M parameters FlauBERT base architecture with uncased vocabulary (see details)
	`flaubert/flaubert_base_cased`	12-layer, 768-hidden, 12-heads, 138M parameters FlauBERT base architecture with cased vocabulary (see details)
	`flaubert/flaubert_large_cased`	24-layer, 1024-hidden, 16-heads, 373M parameters FlauBERT large architecture (see details)
Bart	`facebook/bart-large`	24-layer, 1024-hidden, 16-heads, 406M parameters (see details)
	`facebook/bart-base`	12-layer, 768-hidden, 16-heads, 139M parameters
	`facebook/bart-large-mnli`	Adds a 2 layer classification head with 1 million parameters bart-large base architecture with a classification head, finetuned on MNLI
	`facebook/bart-large-cnn`	24-layer, 1024-hidden, 16-heads, 406M parameters (same as large) bart-large base architecture finetuned on cnn summarization task
BARThez	`moussaKam/barthez`	12-layer, 768-hidden, 12-heads, 216M parameters (see details)
BARThez	`moussaKam/mbarthez`	24-layer, 1024-hidden, 16-heads, 561M parameters
DialoGPT	`DialoGPT-small`	12-layer, 768-hidden, 12-heads, 124M parameters Trained on English text: 147M conversation-like exchanges extracted from Reddit.
	`DialoGPT-medium`	24-layer, 1024-hidden, 16-heads, 355M parameters Trained on English text: 147M conversation-like exchanges extracted from Reddit.
	`DialoGPT-large`	36-layer, 1280-hidden, 20-heads, 774M parameters Trained on English text: 147M conversation-like exchanges extracted from Reddit.
Reformer	`reformer-enwik8`	12-layer, 1024-hidden, 8-heads, 149M parameters Trained on English Wikipedia data - enwik8.
Reformer	`reformer-crime-and-punishment`	6-layer, 256-hidden, 2-heads, 3M parameters Trained on English text: Crime and Punishment novel by Fyodor Dostoyevsky.
M2M100	`facebook/m2m100_418M`	24-layer, 1024-hidden, 16-heads, 418M parameters multilingual machine translation model for 100 languages
M2M100	`facebook/m2m100_1.2B`	48-layer, 1024-hidden, 16-heads, 1.2B parameters multilingual machine translation model for 100 languages
MarianMT	`Helsinki-NLP/opus-mt-{src}-{tgt}`	12-layer, 512-hidden, 8-heads, ~74M parameter Machine translation models. Parameter counts vary depending on vocab size. (see model list)
Pegasus	`google/pegasus-{dataset}`	16-layer, 1024-hidden, 16-heads, ~568M parameter, 2.2 GB for summary. model list
Longformer	`allenai/longformer-base-4096`	12-layer, 768-hidden, 12-heads, ~149M parameters Starting from RoBERTa-base checkpoint, trained on documents of max length 4,096
Longformer	`allenai/longformer-large-4096`	24-layer, 1024-hidden, 16-heads, ~435M parameters Starting from RoBERTa-large checkpoint, trained on documents of max length 4,096
MBart	`facebook/mbart-large-cc25`	24-layer, 1024-hidden, 16-heads, 610M parameters mBART (bart-large architecture) model trained on 25 languages’ monolingual corpus
	`facebook/mbart-large-en-ro`	24-layer, 1024-hidden, 16-heads, 610M parameters mbart-large-cc25 model finetuned on WMT english romanian translation.
	`facebook/mbart-large-50`	24-layer, 1024-hidden, 16-heads, mBART model trained on 50 languages’ monolingual corpus.
	`facebook/mbart-large-50-one-to-many-mmt`	24-layer, 1024-hidden, 16-heads, mbart-50-large model finetuned for one (English) to many multilingual machine translation covering 50 languages.
	`facebook/mbart-large-50-many-to-many-mmt`	24-layer, 1024-hidden, 16-heads, mbart-50-large model finetuned for many to many multilingual machine translation covering 50 languages.
Lxmert	`lxmert-base-uncased`	9-language layers, 9-relationship layers, and 12-cross-modality layers 768-hidden, 12-heads (for each layer) ~ 228M parameters Starting from lxmert-base checkpoint, trained on over 9 million image-text couplets from COCO, VisualGenome, GQA, VQA
Funnel Transformer	`funnel-transformer/small`	14 layers: 3 blocks of 4 layers then 2 layers decoder, 768-hidden, 12-heads, 130M parameters (see details)
	`funnel-transformer/small-base`	12 layers: 3 blocks of 4 layers (no decoder), 768-hidden, 12-heads, 115M parameters (see details)
	`funnel-transformer/medium`	14 layers: 3 blocks 6, 3x2, 3x2 layers then 2 layers decoder, 768-hidden, 12-heads, 130M parameters (see details)
	`funnel-transformer/medium-base`	12 layers: 3 blocks 6, 3x2, 3x2 layers(no decoder), 768-hidden, 12-heads, 115M parameters (see details)
	`funnel-transformer/intermediate`	20 layers: 3 blocks of 6 layers then 2 layers decoder, 768-hidden, 12-heads, 177M parameters (see details)
	`funnel-transformer/intermediate-base`	18 layers: 3 blocks of 6 layers (no decoder), 768-hidden, 12-heads, 161M parameters (see details)
	`funnel-transformer/large`	26 layers: 3 blocks of 8 layers then 2 layers decoder, 1024-hidden, 12-heads, 386M parameters (see details)
	`funnel-transformer/large-base`	24 layers: 3 blocks of 8 layers (no decoder), 1024-hidden, 12-heads, 358M parameters (see details)
	`funnel-transformer/xlarge`	32 layers: 3 blocks of 10 layers then 2 layers decoder, 1024-hidden, 12-heads, 468M parameters (see details)
	`funnel-transformer/xlarge-base`	30 layers: 3 blocks of 10 layers (no decoder), 1024-hidden, 12-heads, 440M parameters (see details)
LayoutLM	`microsoft/layoutlm-base-uncased`	12 layers, 768-hidden, 12-heads, 113M parameters (see details)
LayoutLM	`microsoft/layoutlm-large-uncased`	24 layers, 1024-hidden, 16-heads, 343M parameters (see details)
DeBERTa	`microsoft/deberta-base`	12-layer, 768-hidden, 12-heads, ~140M parameters DeBERTa using the BERT-base architecture (see details)
	`microsoft/deberta-large`	24-layer, 1024-hidden, 16-heads, ~400M parameters DeBERTa using the BERT-large architecture (see details)
	`microsoft/deberta-xlarge`	48-layer, 1024-hidden, 16-heads, ~750M parameters DeBERTa XLarge with similar BERT architecture (see details)
	`microsoft/deberta-xlarge-v2`	24-layer, 1536-hidden, 24-heads, ~900M parameters DeBERTa XLarge V2 with similar BERT architecture (see details)
	`microsoft/deberta-xxlarge-v2`	48-layer, 1536-hidden, 24-heads, ~1.5B parameters DeBERTa XXLarge V2 with similar BERT architecture (see details)
SqueezeBERT	`squeezebert/squeezebert-uncased`	12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone. SqueezeBERT architecture pretrained from scratch on masked language model (MLM) and sentence order prediction (SOP) tasks.
	`squeezebert/squeezebert-mnli`	12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone. This is the squeezebert-uncased model finetuned on MNLI sentence pair classification task with distillation from electra-base.
	`squeezebert/squeezebert-mnli-headless`	12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone. This is the squeezebert-uncased model finetuned on MNLI sentence pair classification task with distillation from electra-base. The final classification layer is removed, so when you finetune, the final layer will be reinitialized.