This model was pretrained on the bookcorpus dataset using knowledge distillation. The particularity of this model is that even though it shares the same architecture as BERT, it has a hidden size of 384 (half the hidden size of BERT) and 6 attention heads (hence the same head size of BERT). The weights of the model were initialized by pruning the weights of bert-base-uncased. A knowledge distillation was performed using multiple loss functions to fine-tune the model. PS : the tokenizer is the same as the one of the model bert-base-uncased. ## **PS2 : this model still needs a little more finetuning, I will keep updating it regularly.** To load the model \& tokenizer : ````python from transformers import AutoModelForMaskedLM, BertTokenizer model_name = "eli4s/Bert-L12-h384-A6-pruned" model = AutoModelForMaskedLM.from_pretrained(model_name) tokenizer = BertTokenizer.from_pretrained(model_name) ```` To use it on a sentence : ````python import torch sentence = "Let's have a [MASK]." encoded_inputs = tokenizer([sentence], padding='longest') input_ids = torch.tensor(encoded_inputs['input_ids']) attention_mask = torch.tensor(encoded_inputs['attention_mask']) output = model(input_ids, attention_mask=attention_mask) mask_index = input_ids.tolist()[0].index(103) masked_token = output['logits'][0][mask_index].argmax(axis=-1) predicted_token = tokenizer.decode(masked_token) print(predicted_token) ```` Or we can also predict the n most relevant predictions : ````python top_n = 5 vocab_size = model.config.vocab_size logits = output['logits'][0][mask_index].tolist() top_tokens = sorted(list(range(vocab_size)), key=lambda i:logits[i], reverse=True)[:top_n] tokenizer.decode(top_tokens) ````