This model was pretrained on the bookcorpus dataset using knowledge distillation. The particularity of this model is that even though it shares the same architecture as BERT, it has a hidden size of 256. Since it has 4 attention heads, the head size is 64 just as for the BERT base model. The knowledge distillation was performed using multiple loss functions. The weights of the model were initialized from scratch. PS : the tokenizer is the same as the one of the model bert-base-uncased. To load the model \& tokenizer : ````python from transformers import AutoModelForMaskedLM, BertTokenizer model_name = "eli4s/Bert-L12-h256-A4" model = AutoModelForMaskedLM.from_pretrained(model_name) tokenizer = BertTokenizer.from_pretrained(model_name) ```` To use it as a masked language model : ````python import torch sentence = "Let's have a [MASK]." model.eval() inputs = tokenizer([sentence], padding='longest', return_tensors='pt') output = model(inputs['input_ids'], attention_mask=inputs['attention_mask']) mask_index = inputs['input_ids'].tolist()[0].index(103) masked_token = output['logits'][0][mask_index].argmax(axis=-1) predicted_token = tokenizer.decode(masked_token) print(predicted_token) ```` Or we can also predict the n most relevant predictions : ````python top_n = 5 vocab_size = model.config.vocab_size logits = output['logits'][0][mask_index].tolist() top_tokens = sorted(list(range(vocab_size)), key=lambda i:logits[i], reverse=True)[:top_n] tokenizer.decode(top_tokens) ````