Pretrained models¶

Here is a partial list of some of the available pretrained models together with a short presentation of each model.

For the full list, refer to https://huggingface.co/models.

Architecture

Model id

Details of the model

BERT

bert-base-uncased

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on lower-cased English text.

bert-large-uncased

24-layer, 1024-hidden, 16-heads, 336M parameters.
Trained on lower-cased English text.

bert-base-cased

12-layer, 768-hidden, 12-heads, 109M parameters.
Trained on cased English text.

bert-large-cased

24-layer, 1024-hidden, 16-heads, 335M parameters.
Trained on cased English text.

bert-base-multilingual-uncased

(Original, not recommended) 12-layer, 768-hidden, 12-heads, 168M parameters.
Trained on lower-cased text in the top 102 languages with the largest Wikipedias

(see details).

bert-base-multilingual-cased

(New, recommended) 12-layer, 768-hidden, 12-heads, 179M parameters.
Trained on cased text in the top 104 languages with the largest Wikipedias

(see details).

bert-base-chinese

12-layer, 768-hidden, 12-heads, 103M parameters.
Trained on cased Chinese Simplified and Traditional text.

bert-base-german-cased

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased German text by Deepset.ai

(see details on deepset.ai website).

bert-large-uncased-whole-word-masking

24-layer, 1024-hidden, 16-heads, 336M parameters.
Trained on lower-cased English text using Whole-Word-Masking

(see details).

bert-large-cased-whole-word-masking

24-layer, 1024-hidden, 16-heads, 335M parameters.
Trained on cased English text using Whole-Word-Masking

(see details).

bert-large-uncased-whole-word-masking-finetuned-squad

24-layer, 1024-hidden, 16-heads, 336M parameters.
The bert-large-uncased-whole-word-masking model fine-tuned on SQuAD

(see details of fine-tuning in the example section).

bert-large-cased-whole-word-masking-finetuned-squad

24-layer, 1024-hidden, 16-heads, 335M parameters
The bert-large-cased-whole-word-masking model fine-tuned on SQuAD

(see details of fine-tuning in the example section)

bert-base-cased-finetuned-mrpc

12-layer, 768-hidden, 12-heads, 110M parameters.
The bert-base-cased model fine-tuned on MRPC

(see details of fine-tuning in the example section)

bert-base-german-dbmdz-cased

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased German text by DBMDZ

(see details on dbmdz repository).

bert-base-german-dbmdz-uncased

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on uncased German text by DBMDZ

(see details on dbmdz repository).

cl-tohoku/bert-base-japanese

12-layer, 768-hidden, 12-heads, 111M parameters.
Trained on Japanese text. Text is tokenized with MeCab and WordPiece and this requires some extra dependencies,
fugashi which is a wrapper around MeCab.
Use pip install transformers["ja"] (or pip install -e .["ja"] if you install from source) to install them.

(see details on cl-tohoku repository).

cl-tohoku/bert-base-japanese-whole-word-masking

12-layer, 768-hidden, 12-heads, 111M parameters.
Trained on Japanese text. Text is tokenized with MeCab and WordPiece and this requires some extra dependencies,
fugashi which is a wrapper around MeCab.
Use pip install transformers["ja"] (or pip install -e .["ja"] if you install from source) to install them.

(see details on cl-tohoku repository).

cl-tohoku/bert-base-japanese-char

12-layer, 768-hidden, 12-heads, 90M parameters.
Trained on Japanese text. Text is tokenized into characters.

(see details on cl-tohoku repository).

cl-tohoku/bert-base-japanese-char-whole-word-masking

12-layer, 768-hidden, 12-heads, 90M parameters.
Trained on Japanese text using Whole-Word-Masking. Text is tokenized into characters.

(see details on cl-tohoku repository).

TurkuNLP/bert-base-finnish-cased-v1

12-layer, 768-hidden, 12-heads, 125M parameters.
Trained on cased Finnish text.

(see details on turkunlp.org).

TurkuNLP/bert-base-finnish-uncased-v1

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on uncased Finnish text.

(see details on turkunlp.org).

wietsedv/bert-base-dutch-cased

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased Dutch text.

(see details on wietsedv repository).

GPT

openai-gpt

12-layer, 768-hidden, 12-heads, 110M parameters.
OpenAI GPT English model

GPT-2

gpt2

12-layer, 768-hidden, 12-heads, 117M parameters.
OpenAI GPT-2 English model

gpt2-medium

24-layer, 1024-hidden, 16-heads, 345M parameters.
OpenAI’s Medium-sized GPT-2 English model

gpt2-large

36-layer, 1280-hidden, 20-heads, 774M parameters.
OpenAI’s Large-sized GPT-2 English model

gpt2-xl

48-layer, 1600-hidden, 25-heads, 1558M parameters.
OpenAI’s XL-sized GPT-2 English model

GPTNeo

EleutherAI/gpt-neo-1.3B

24-layer, 2048-hidden, 16-heads, 1.3B parameters.
EleutherAI’s GPT-3 like language model.

EleutherAI/gpt-neo-2.7B

32-layer, 2560-hidden, 20-heads, 2.7B parameters.
EleutherAI’s GPT-3 like language model.

Transformer-XL

transfo-xl-wt103

18-layer, 1024-hidden, 16-heads, 257M parameters.
English model trained on wikitext-103

XLNet

xlnet-base-cased

12-layer, 768-hidden, 12-heads, 110M parameters.
XLNet English model

xlnet-large-cased

24-layer, 1024-hidden, 16-heads, 340M parameters.
XLNet Large English model

XLM

xlm-mlm-en-2048

12-layer, 2048-hidden, 16-heads
XLM English model

xlm-mlm-ende-1024

6-layer, 1024-hidden, 8-heads
XLM English-German model trained on the concatenation of English and German wikipedia

xlm-mlm-enfr-1024

6-layer, 1024-hidden, 8-heads
XLM English-French model trained on the concatenation of English and French wikipedia

xlm-mlm-enro-1024

6-layer, 1024-hidden, 8-heads
XLM English-Romanian Multi-language model

xlm-mlm-xnli15-1024

12-layer, 1024-hidden, 8-heads
XLM Model pre-trained with MLM on the 15 XNLI languages.

xlm-mlm-tlm-xnli15-1024

12-layer, 1024-hidden, 8-heads
XLM Model pre-trained with MLM + TLM on the 15 XNLI languages.

xlm-clm-enfr-1024

6-layer, 1024-hidden, 8-heads
XLM English-French model trained with CLM (Causal Language Modeling) on the concatenation of English and French wikipedia

xlm-clm-ende-1024

6-layer, 1024-hidden, 8-heads
XLM English-German model trained with CLM (Causal Language Modeling) on the concatenation of English and German wikipedia

xlm-mlm-17-1280

16-layer, 1280-hidden, 16-heads
XLM model trained with MLM (Masked Language Modeling) on 17 languages.

xlm-mlm-100-1280

16-layer, 1280-hidden, 16-heads
XLM model trained with MLM (Masked Language Modeling) on 100 languages.

RoBERTa

roberta-base

12-layer, 768-hidden, 12-heads, 125M parameters
RoBERTa using the BERT-base architecture

(see details)

roberta-large

24-layer, 1024-hidden, 16-heads, 355M parameters
RoBERTa using the BERT-large architecture

(see details)

roberta-large-mnli

24-layer, 1024-hidden, 16-heads, 355M parameters
roberta-large fine-tuned on MNLI.

(see details)

distilroberta-base

6-layer, 768-hidden, 12-heads, 82M parameters
The DistilRoBERTa model distilled from the RoBERTa model roberta-base checkpoint.

(see details)

roberta-base-openai-detector

12-layer, 768-hidden, 12-heads, 125M parameters
roberta-base fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model.

(see details)

roberta-large-openai-detector

24-layer, 1024-hidden, 16-heads, 355M parameters
roberta-large fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model.

(see details)

DistilBERT

distilbert-base-uncased

6-layer, 768-hidden, 12-heads, 66M parameters
The DistilBERT model distilled from the BERT model bert-base-uncased checkpoint

(see details)

distilbert-base-uncased-distilled-squad

6-layer, 768-hidden, 12-heads, 66M parameters
The DistilBERT model distilled from the BERT model bert-base-uncased checkpoint, with an additional linear layer.

(see details)

distilbert-base-cased

6-layer, 768-hidden, 12-heads, 65M parameters
The DistilBERT model distilled from the BERT model bert-base-cased checkpoint

(see details)

distilbert-base-cased-distilled-squad

6-layer, 768-hidden, 12-heads, 65M parameters
The DistilBERT model distilled from the BERT model bert-base-cased checkpoint, with an additional question answering layer.

(see details)

distilgpt2

6-layer, 768-hidden, 12-heads, 82M parameters
The DistilGPT2 model distilled from the GPT2 model gpt2 checkpoint.

(see details)

distilbert-base-german-cased

6-layer, 768-hidden, 12-heads, 66M parameters
The German DistilBERT model distilled from the German DBMDZ BERT model bert-base-german-dbmdz-cased checkpoint.

(see details)

distilbert-base-multilingual-cased

6-layer, 768-hidden, 12-heads, 134M parameters
The multilingual DistilBERT model distilled from the Multilingual BERT model bert-base-multilingual-cased checkpoint.

(see details)

CTRL

ctrl

48-layer, 1280-hidden, 16-heads, 1.6B parameters
Salesforce’s Large-sized CTRL English model

CamemBERT

camembert-base

12-layer, 768-hidden, 12-heads, 110M parameters
CamemBERT using the BERT-base architecture

(see details)

ALBERT

albert-base-v1

12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters
ALBERT base model

(see details)

albert-large-v1

24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters
ALBERT large model

(see details)

albert-xlarge-v1

24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters
ALBERT xlarge model

(see details)

albert-xxlarge-v1

12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters
ALBERT xxlarge model

(see details)

albert-base-v2

12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters
ALBERT base model with no dropout, additional training data and longer training

(see details)

albert-large-v2

24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters
ALBERT large model with no dropout, additional training data and longer training

(see details)

albert-xlarge-v2

24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters
ALBERT xlarge model with no dropout, additional training data and longer training

(see details)

albert-xxlarge-v2

12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters
ALBERT xxlarge model with no dropout, additional training data and longer training

(see details)

T5

t5-small

~60M parameters with 6-layers, 512-hidden-state, 2048 feed-forward hidden-state, 8-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)

t5-base

~220M parameters with 12-layers, 768-hidden-state, 3072 feed-forward hidden-state, 12-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)

t5-large

~770M parameters with 24-layers, 1024-hidden-state, 4096 feed-forward hidden-state, 16-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)

t5-3B

~2.8B parameters with 24-layers, 1024-hidden-state, 16384 feed-forward hidden-state, 32-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)

t5-11B

~11B parameters with 24-layers, 1024-hidden-state, 65536 feed-forward hidden-state, 128-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)

XLM-RoBERTa

xlm-roberta-base

~270M parameters with 12-layers, 768-hidden-state, 3072 feed-forward hidden-state, 8-heads,
Trained on on 2.5 TB of newly created clean CommonCrawl data in 100 languages

xlm-roberta-large

~550M parameters with 24-layers, 1024-hidden-state, 4096 feed-forward hidden-state, 16-heads,
Trained on 2.5 TB of newly created clean CommonCrawl data in 100 languages

FlauBERT

flaubert/flaubert_small_cased

6-layer, 512-hidden, 8-heads, 54M parameters
FlauBERT small architecture

(see details)

flaubert/flaubert_base_uncased

12-layer, 768-hidden, 12-heads, 137M parameters
FlauBERT base architecture with uncased vocabulary

(see details)

flaubert/flaubert_base_cased

12-layer, 768-hidden, 12-heads, 138M parameters
FlauBERT base architecture with cased vocabulary

(see details)

flaubert/flaubert_large_cased

24-layer, 1024-hidden, 16-heads, 373M parameters
FlauBERT large architecture

(see details)

Bart

facebook/bart-large

24-layer, 1024-hidden, 16-heads, 406M parameters

(see details)

facebook/bart-base

12-layer, 768-hidden, 16-heads, 139M parameters

facebook/bart-large-mnli

Adds a 2 layer classification head with 1 million parameters
bart-large base architecture with a classification head, finetuned on MNLI

facebook/bart-large-cnn

24-layer, 1024-hidden, 16-heads, 406M parameters (same as large)
bart-large base architecture finetuned on cnn summarization task

BARThez

moussaKam/barthez

12-layer, 768-hidden, 12-heads, 216M parameters

(see details)

moussaKam/mbarthez

24-layer, 1024-hidden, 16-heads, 561M parameters

DialoGPT

DialoGPT-small

12-layer, 768-hidden, 12-heads, 124M parameters
Trained on English text: 147M conversation-like exchanges extracted from Reddit.

DialoGPT-medium

24-layer, 1024-hidden, 16-heads, 355M parameters
Trained on English text: 147M conversation-like exchanges extracted from Reddit.

DialoGPT-large

36-layer, 1280-hidden, 20-heads, 774M parameters
Trained on English text: 147M conversation-like exchanges extracted from Reddit.

Reformer

reformer-enwik8

12-layer, 1024-hidden, 8-heads, 149M parameters
Trained on English Wikipedia data - enwik8.

reformer-crime-and-punishment

6-layer, 256-hidden, 2-heads, 3M parameters
Trained on English text: Crime and Punishment novel by Fyodor Dostoyevsky.

M2M100

facebook/m2m100_418M

24-layer, 1024-hidden, 16-heads, 418M parameters
multilingual machine translation model for 100 languages

facebook/m2m100_1.2B

48-layer, 1024-hidden, 16-heads, 1.2B parameters
multilingual machine translation model for 100 languages

MarianMT

Helsinki-NLP/opus-mt-{src}-{tgt}

12-layer, 512-hidden, 8-heads, ~74M parameter Machine translation models. Parameter counts vary depending on vocab size.

Pegasus

google/pegasus-{dataset}

16-layer, 1024-hidden, 16-heads, ~568M parameter, 2.2 GB for summary. model list

Longformer

allenai/longformer-base-4096

12-layer, 768-hidden, 12-heads, ~149M parameters
Starting from RoBERTa-base checkpoint, trained on documents of max length 4,096

allenai/longformer-large-4096

24-layer, 1024-hidden, 16-heads, ~435M parameters
Starting from RoBERTa-large checkpoint, trained on documents of max length 4,096

MBart

facebook/mbart-large-cc25

24-layer, 1024-hidden, 16-heads, 610M parameters
mBART (bart-large architecture) model trained on 25 languages’ monolingual corpus

facebook/mbart-large-en-ro

24-layer, 1024-hidden, 16-heads, 610M parameters
mbart-large-cc25 model finetuned on WMT english romanian translation.

facebook/mbart-large-50

24-layer, 1024-hidden, 16-heads,
mBART model trained on 50 languages’ monolingual corpus.

facebook/mbart-large-50-one-to-many-mmt

24-layer, 1024-hidden, 16-heads,
mbart-50-large model finetuned for one (English) to many multilingual machine translation covering 50 languages.

facebook/mbart-large-50-many-to-many-mmt

24-layer, 1024-hidden, 16-heads,
mbart-50-large model finetuned for many to many multilingual machine translation covering 50 languages.

Lxmert

lxmert-base-uncased

9-language layers, 9-relationship layers, and 12-cross-modality layers
768-hidden, 12-heads (for each layer) ~ 228M parameters
Starting from lxmert-base checkpoint, trained on over 9 million image-text couplets from COCO, VisualGenome, GQA, VQA

Funnel Transformer

funnel-transformer/small

14 layers: 3 blocks of 4 layers then 2 layers decoder, 768-hidden, 12-heads, 130M parameters

(see details)

funnel-transformer/small-base

12 layers: 3 blocks of 4 layers (no decoder), 768-hidden, 12-heads, 115M parameters

(see details)

funnel-transformer/medium

14 layers: 3 blocks 6, 3x2, 3x2 layers then 2 layers decoder, 768-hidden, 12-heads, 130M parameters

(see details)

funnel-transformer/medium-base

12 layers: 3 blocks 6, 3x2, 3x2 layers(no decoder), 768-hidden, 12-heads, 115M parameters

(see details)

funnel-transformer/intermediate

20 layers: 3 blocks of 6 layers then 2 layers decoder, 768-hidden, 12-heads, 177M parameters

(see details)

funnel-transformer/intermediate-base

18 layers: 3 blocks of 6 layers (no decoder), 768-hidden, 12-heads, 161M parameters

(see details)

funnel-transformer/large

26 layers: 3 blocks of 8 layers then 2 layers decoder, 1024-hidden, 12-heads, 386M parameters

(see details)

funnel-transformer/large-base

24 layers: 3 blocks of 8 layers (no decoder), 1024-hidden, 12-heads, 358M parameters

(see details)

funnel-transformer/xlarge

32 layers: 3 blocks of 10 layers then 2 layers decoder, 1024-hidden, 12-heads, 468M parameters

(see details)

funnel-transformer/xlarge-base

30 layers: 3 blocks of 10 layers (no decoder), 1024-hidden, 12-heads, 440M parameters

(see details)

LayoutLM

microsoft/layoutlm-base-uncased

12 layers, 768-hidden, 12-heads, 113M parameters

(see details)

microsoft/layoutlm-large-uncased

24 layers, 1024-hidden, 16-heads, 343M parameters

(see details)

DeBERTa

microsoft/deberta-base

12-layer, 768-hidden, 12-heads, ~140M parameters
DeBERTa using the BERT-base architecture

(see details)

microsoft/deberta-large

24-layer, 1024-hidden, 16-heads, ~400M parameters
DeBERTa using the BERT-large architecture

(see details)

microsoft/deberta-xlarge

48-layer, 1024-hidden, 16-heads, ~750M parameters
DeBERTa XLarge with similar BERT architecture

(see details)

microsoft/deberta-xlarge-v2

24-layer, 1536-hidden, 24-heads, ~900M parameters
DeBERTa XLarge V2 with similar BERT architecture

(see details)

microsoft/deberta-xxlarge-v2

48-layer, 1536-hidden, 24-heads, ~1.5B parameters
DeBERTa XXLarge V2 with similar BERT architecture

(see details)

SqueezeBERT

squeezebert/squeezebert-uncased

12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone.
SqueezeBERT architecture pretrained from scratch on masked language model (MLM) and sentence order prediction (SOP) tasks.

squeezebert/squeezebert-mnli

12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone.
This is the squeezebert-uncased model finetuned on MNLI sentence pair classification task with distillation from electra-base.

squeezebert/squeezebert-mnli-headless

12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone.
This is the squeezebert-uncased model finetuned on MNLI sentence pair classification task with distillation from electra-base.
The final classification layer is removed, so when you finetune, the final layer will be reinitialized.