Pretrained models

Here is the full list of the currently provided pretrained models together with a short presentation of each model.

For a list that includes community-uploaded models, refer to https://huggingface.co/models.

Architecture

Shortcut name

Details of the model

BERT

bert-base-uncased

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on lower-cased English text.

bert-large-uncased

24-layer, 1024-hidden, 16-heads, 340M parameters.
Trained on lower-cased English text.

bert-base-cased

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased English text.

bert-large-cased

24-layer, 1024-hidden, 16-heads, 340M parameters.
Trained on cased English text.

bert-base-multilingual-uncased

(Original, not recommended) 12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on lower-cased text in the top 102 languages with the largest Wikipedias

(see details).

bert-base-multilingual-cased

(New, recommended) 12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased text in the top 104 languages with the largest Wikipedias

(see details).

bert-base-chinese

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased Chinese Simplified and Traditional text.

bert-base-german-cased

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased German text by Deepset.ai

(see details on deepset.ai website).

bert-large-uncased-whole-word-masking

24-layer, 1024-hidden, 16-heads, 340M parameters.
Trained on lower-cased English text using Whole-Word-Masking

(see details).

bert-large-cased-whole-word-masking

24-layer, 1024-hidden, 16-heads, 340M parameters.
Trained on cased English text using Whole-Word-Masking

(see details).

bert-large-uncased-whole-word-masking-finetuned-squad

24-layer, 1024-hidden, 16-heads, 340M parameters.
The bert-large-uncased-whole-word-masking model fine-tuned on SQuAD

(see details of fine-tuning in the example section).

bert-large-cased-whole-word-masking-finetuned-squad

24-layer, 1024-hidden, 16-heads, 340M parameters
The bert-large-cased-whole-word-masking model fine-tuned on SQuAD

(see details of fine-tuning in the example section)

bert-base-cased-finetuned-mrpc

12-layer, 768-hidden, 12-heads, 110M parameters.
The bert-base-cased model fine-tuned on MRPC

(see details of fine-tuning in the example section)

bert-base-german-dbmdz-cased

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased German text by DBMDZ

(see details on dbmdz repository).

bert-base-german-dbmdz-uncased

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on uncased German text by DBMDZ

(see details on dbmdz repository).

bert-base-japanese

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on Japanese text. Text is tokenized with MeCab and WordPiece.
MeCab is required for tokenization.

(see details on cl-tohoku repository).

bert-base-japanese-whole-word-masking

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on Japanese text using Whole-Word-Masking. Text is tokenized with MeCab and WordPiece.
MeCab is required for tokenization.

(see details on cl-tohoku repository).

bert-base-japanese-char

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on Japanese text. Text is tokenized into characters.

(see details on cl-tohoku repository).

bert-base-japanese-char-whole-word-masking

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on Japanese text using Whole-Word-Masking. Text is tokenized into characters.

(see details on cl-tohoku repository).

bert-base-finnish-cased-v1

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased Finnish text.

(see details on turkunlp.org).

bert-base-finnish-uncased-v1

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on uncased Finnish text.

(see details on turkunlp.org).

bert-base-dutch-cased

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased Dutch text.

(see details on wietsedv repository).

GPT

openai-gpt

12-layer, 768-hidden, 12-heads, 110M parameters.
OpenAI GPT English model

GPT-2

gpt2

12-layer, 768-hidden, 12-heads, 117M parameters.
OpenAI GPT-2 English model

gpt2-medium

24-layer, 1024-hidden, 16-heads, 345M parameters.
OpenAI’s Medium-sized GPT-2 English model

gpt2-large

36-layer, 1280-hidden, 20-heads, 774M parameters.
OpenAI’s Large-sized GPT-2 English model

gpt2-xl

48-layer, 1600-hidden, 25-heads, 1558M parameters.
OpenAI’s XL-sized GPT-2 English model

Transformer-XL

transfo-xl-wt103

18-layer, 1024-hidden, 16-heads, 257M parameters.
English model trained on wikitext-103

XLNet

xlnet-base-cased

12-layer, 768-hidden, 12-heads, 110M parameters.
XLNet English model

xlnet-large-cased

24-layer, 1024-hidden, 16-heads, 340M parameters.
XLNet Large English model

XLM

xlm-mlm-en-2048

12-layer, 2048-hidden, 16-heads
XLM English model

xlm-mlm-ende-1024

6-layer, 1024-hidden, 8-heads
XLM English-German model trained on the concatenation of English and German wikipedia

xlm-mlm-enfr-1024

6-layer, 1024-hidden, 8-heads
XLM English-French model trained on the concatenation of English and French wikipedia

xlm-mlm-enro-1024

6-layer, 1024-hidden, 8-heads
XLM English-Romanian Multi-language model

xlm-mlm-xnli15-1024

12-layer, 1024-hidden, 8-heads
XLM Model pre-trained with MLM on the 15 XNLI languages.

xlm-mlm-tlm-xnli15-1024

12-layer, 1024-hidden, 8-heads
XLM Model pre-trained with MLM + TLM on the 15 XNLI languages.

xlm-clm-enfr-1024

6-layer, 1024-hidden, 8-heads
XLM English-French model trained with CLM (Causal Language Modeling) on the concatenation of English and French wikipedia

xlm-clm-ende-1024

6-layer, 1024-hidden, 8-heads
XLM English-German model trained with CLM (Causal Language Modeling) on the concatenation of English and German wikipedia

xlm-mlm-17-1280

16-layer, 1280-hidden, 16-heads
XLM model trained with MLM (Masked Language Modeling) on 17 languages.

xlm-mlm-100-1280

16-layer, 1280-hidden, 16-heads
XLM model trained with MLM (Masked Language Modeling) on 100 languages.

RoBERTa

roberta-base

12-layer, 768-hidden, 12-heads, 125M parameters
RoBERTa using the BERT-base architecture

(see details)

roberta-large

24-layer, 1024-hidden, 16-heads, 355M parameters
RoBERTa using the BERT-large architecture

(see details)

roberta-large-mnli

24-layer, 1024-hidden, 16-heads, 355M parameters
roberta-large fine-tuned on MNLI.

(see details)

distilroberta-base

6-layer, 768-hidden, 12-heads, 82M parameters
The DistilRoBERTa model distilled from the RoBERTa model roberta-base checkpoint.

(see details)

roberta-base-openai-detector

12-layer, 768-hidden, 12-heads, 125M parameters
roberta-base fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model.

(see details)

roberta-large-openai-detector

24-layer, 1024-hidden, 16-heads, 355M parameters
roberta-large fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model.

(see details)

DistilBERT

distilbert-base-uncased

6-layer, 768-hidden, 12-heads, 66M parameters
The DistilBERT model distilled from the BERT model bert-base-uncased checkpoint

(see details)

distilbert-base-uncased-distilled-squad

6-layer, 768-hidden, 12-heads, 66M parameters
The DistilBERT model distilled from the BERT model bert-base-uncased checkpoint, with an additional linear layer.

(see details)

distilbert-base-cased

6-layer, 768-hidden, 12-heads, 65M parameters
The DistilBERT model distilled from the BERT model bert-base-cased checkpoint

(see details)

distilbert-base-cased-distilled-squad

6-layer, 768-hidden, 12-heads, 65M parameters
The DistilBERT model distilled from the BERT model bert-base-cased checkpoint, with an additional question answering layer.

(see details)

distilgpt2

6-layer, 768-hidden, 12-heads, 82M parameters
The DistilGPT2 model distilled from the GPT2 model gpt2 checkpoint.

(see details)

distilbert-base-german-cased

6-layer, 768-hidden, 12-heads, 66M parameters
The German DistilBERT model distilled from the German DBMDZ BERT model bert-base-german-dbmdz-cased checkpoint.

(see details)

distilbert-base-multilingual-cased

6-layer, 768-hidden, 12-heads, 134M parameters
The multilingual DistilBERT model distilled from the Multilingual BERT model bert-base-multilingual-cased checkpoint.

(see details)

CTRL

ctrl

48-layer, 1280-hidden, 16-heads, 1.6B parameters
Salesforce’s Large-sized CTRL English model

CamemBERT

camembert-base

12-layer, 768-hidden, 12-heads, 110M parameters
CamemBERT using the BERT-base architecture

(see details)

ALBERT

albert-base-v1

12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters
ALBERT base model

(see details)

albert-large-v1

24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters
ALBERT large model

(see details)

albert-xlarge-v1

24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters
ALBERT xlarge model

(see details)

albert-xxlarge-v1

12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters
ALBERT xxlarge model

(see details)

albert-base-v2

12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters
ALBERT base model with no dropout, additional training data and longer training

(see details)

albert-large-v2

24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters
ALBERT large model with no dropout, additional training data and longer training

(see details)

albert-xlarge-v2

24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters
ALBERT xlarge model with no dropout, additional training data and longer training

(see details)

albert-xxlarge-v2

12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters
ALBERT xxlarge model with no dropout, additional training data and longer training

(see details)

T5

t5-small

~60M parameters with 6-layers, 512-hidden-state, 2048 feed-forward hidden-state, 8-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)

t5-base

~220M parameters with 12-layers, 768-hidden-state, 3072 feed-forward hidden-state, 12-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)

t5-large

~770M parameters with 24-layers, 1024-hidden-state, 4096 feed-forward hidden-state, 16-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)

t5-3B

~2.8B parameters with 24-layers, 1024-hidden-state, 16384 feed-forward hidden-state, 32-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)

t5-11B

~11B parameters with 24-layers, 1024-hidden-state, 65536 feed-forward hidden-state, 128-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)

XLM-RoBERTa

xlm-roberta-base

~125M parameters with 12-layers, 768-hidden-state, 3072 feed-forward hidden-state, 8-heads,
Trained on on 2.5 TB of newly created clean CommonCrawl data in 100 languages

xlm-roberta-large

~355M parameters with 24-layers, 1027-hidden-state, 4096 feed-forward hidden-state, 16-heads,
Trained on 2.5 TB of newly created clean CommonCrawl data in 100 languages

FlauBERT

flaubert-small-cased

6-layer, 512-hidden, 8-heads, 54M parameters
FlauBERT small architecture

(see details)

flaubert-base-uncased

12-layer, 768-hidden, 12-heads, 137M parameters
FlauBERT base architecture with uncased vocabulary

(see details)

flaubert-base-cased

12-layer, 768-hidden, 12-heads, 138M parameters
FlauBERT base architecture with cased vocabulary

(see details)

flaubert-large-cased

24-layer, 1024-hidden, 16-heads, 373M parameters
FlauBERT large architecture

(see details)

Bart

bart-large

12-layer, 1024-hidden, 16-heads, 406M parameters

(see details)

bart-large-mnli

Adds a 2 layer classification head with 1 million parameters
bart-large base architecture with a classification head, finetuned on MNLI

bart-large-cnn

12-layer, 1024-hidden, 16-heads, 406M parameters (same as base)
bart-large base architecture finetuned on cnn summarization task