Pretrained models

Here is the full list of the currently provided pretrained models together with a short presentation of each model.

Architecture

Shortcut name

Details of the model

BERT

bert-base-uncased

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on lower-cased English text.

bert-large-uncased

24-layer, 1024-hidden, 16-heads, 340M parameters.
Trained on lower-cased English text.

bert-base-cased

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased English text.

bert-large-cased

24-layer, 1024-hidden, 16-heads, 340M parameters.
Trained on cased English text.

bert-base-multilingual-uncased

(Original, not recommended) 12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on lower-cased text in the top 102 languages with the largest Wikipedias

(see details).

bert-base-multilingual-cased

(New, recommended) 12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased text in the top 104 languages with the largest Wikipedias

(see details).

bert-base-chinese

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased Chinese Simplified and Traditional text.

bert-base-german-cased

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased German text by Deepset.ai

(see details on deepset.ai website).

bert-large-uncased-whole-word-masking

24-layer, 1024-hidden, 16-heads, 340M parameters.
Trained on lower-cased English text using Whole-Word-Masking

(see details).

bert-large-cased-whole-word-masking

24-layer, 1024-hidden, 16-heads, 340M parameters.
Trained on cased English text using Whole-Word-Masking

(see details).

bert-large-uncased-whole-word-masking-finetuned-squad

24-layer, 1024-hidden, 16-heads, 340M parameters.
The bert-large-uncased-whole-word-masking model fine-tuned on SQuAD

(see details of fine-tuning in the example section).

bert-large-cased-whole-word-masking-finetuned-squad

24-layer, 1024-hidden, 16-heads, 340M parameters
The bert-large-cased-whole-word-masking model fine-tuned on SQuAD

(see details of fine-tuning in the example section)

bert-base-cased-finetuned-mrpc

12-layer, 768-hidden, 12-heads, 110M parameters.
The bert-base-cased model fine-tuned on MRPC

(see details of fine-tuning in the example section)

bert-base-german-dbmdz-cased

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased German text by DBMDZ

(see details on dbmdz repository).

bert-base-german-dbmdz-uncased

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on uncased German text by DBMDZ

(see details on dbmdz repository).

GPT

openai-gpt

12-layer, 768-hidden, 12-heads, 110M parameters.
OpenAI GPT English model

GPT-2

gpt2

12-layer, 768-hidden, 12-heads, 117M parameters.
OpenAI GPT-2 English model

gpt2-medium

24-layer, 1024-hidden, 16-heads, 345M parameters.
OpenAI’s Medium-sized GPT-2 English model

gpt2-large

36-layer, 1280-hidden, 20-heads, 774M parameters.
OpenAI’s Large-sized GPT-2 English model

gpt2-xl

48-layer, 1600-hidden, 25-heads, 1558M parameters.
OpenAI’s XL-sized GPT-2 English model

Transformer-XL

transfo-xl-wt103

18-layer, 1024-hidden, 16-heads, 257M parameters.
English model trained on wikitext-103

XLNet

xlnet-base-cased

12-layer, 768-hidden, 12-heads, 110M parameters.
XLNet English model

xlnet-large-cased

24-layer, 1024-hidden, 16-heads, 340M parameters.
XLNet Large English model

XLM

xlm-mlm-en-2048

12-layer, 2048-hidden, 16-heads
XLM English model

xlm-mlm-ende-1024

6-layer, 1024-hidden, 8-heads
XLM English-German model trained on the concatenation of English and German wikipedia

xlm-mlm-enfr-1024

6-layer, 1024-hidden, 8-heads
XLM English-French model trained on the concatenation of English and French wikipedia

xlm-mlm-enro-1024

6-layer, 1024-hidden, 8-heads
XLM English-Romanian Multi-language model

xlm-mlm-xnli15-1024

12-layer, 1024-hidden, 8-heads
XLM Model pre-trained with MLM on the 15 XNLI languages.

xlm-mlm-tlm-xnli15-1024

12-layer, 1024-hidden, 8-heads
XLM Model pre-trained with MLM + TLM on the 15 XNLI languages.

xlm-clm-enfr-1024

6-layer, 1024-hidden, 8-heads
XLM English-French model trained with CLM (Causal Language Modeling) on the concatenation of English and French wikipedia

xlm-clm-ende-1024

6-layer, 1024-hidden, 8-heads
XLM English-German model trained with CLM (Causal Language Modeling) on the concatenation of English and German wikipedia

xlm-mlm-17-1280

16-layer, 1280-hidden, 16-heads
XLM model trained with MLM (Masked Language Modeling) on 17 languages.

xlm-mlm-100-1280

16-layer, 1280-hidden, 16-heads
XLM model trained with MLM (Masked Language Modeling) on 100 languages.

RoBERTa

roberta-base

12-layer, 768-hidden, 12-heads, 125M parameters
RoBERTa using the BERT-base architecture

(see details)

roberta-large

24-layer, 1024-hidden, 16-heads, 355M parameters
RoBERTa using the BERT-large architecture

(see details)

roberta-large-mnli

24-layer, 1024-hidden, 16-heads, 355M parameters
roberta-large fine-tuned on MNLI.

(see details)

roberta-base-openai-detector

12-layer, 768-hidden, 12-heads, 125M parameters
roberta-base fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model.

(see details)

roberta-large-openai-detector

24-layer, 1024-hidden, 16-heads, 355M parameters
roberta-large fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model.

(see details)

DistilBERT

distilbert-base-uncased

6-layer, 768-hidden, 12-heads, 66M parameters
The DistilBERT model distilled from the BERT model bert-base-uncased checkpoint

(see details)

distilbert-base-uncased-distilled-squad

6-layer, 768-hidden, 12-heads, 66M parameters
The DistilBERT model distilled from the BERT model bert-base-uncased checkpoint, with an additional linear layer.

(see details)

distilgpt2

6-layer, 768-hidden, 12-heads, 82M parameters
The DistilGPT2 model distilled from the GPT2 model gpt2 checkpoint.

(see details)

distilroberta-base

6-layer, 768-hidden, 12-heads, 82M parameters
The DistilRoBERTa model distilled from the RoBERTa model roberta-base checkpoint.

(see details)

CTRL

ctrl

48-layer, 1280-hidden, 16-heads, 1.6B parameters
Salesforce’s Large-sized CTRL English model

CamemBERT

camembert-base

12-layer, 768-hidden, 12-heads, 110M parameters
CamemBERT using the BERT-base architecture

(see details)

ALBERT

albert-base-v1

12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters
ALBERT base model

(see details)

albert-large-v1

24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters
ALBERT large model

(see details)

albert-xlarge-v1

24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters
ALBERT xlarge model

(see details)

albert-xxlarge-v1

12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters
ALBERT xxlarge model

(see details)

albert-base-v2

12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters
ALBERT base model with no dropout, additional training data and longer training

(see details)

albert-large-v2

24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters
ALBERT large model with no dropout, additional training data and longer training

(see details)

albert-xlarge-v2

24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters
ALBERT xlarge model with no dropout, additional training data and longer training

(see details)

albert-xxlarge-v2

12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters
ALBERT xxlarge model with no dropout, additional training data and longer training

(see details)