# microsoft /Multilingual-MiniLM-L12-H384

## MiniLM: Small and Fast Pre-trained Models for Language Understanding and Generation

MiniLM is a distilled model from the paper "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers".

Please find the information about preprocessing, training and full details of the MiniLM in the original MiniLM repository.

Please note: This checkpoint uses BertModel with XLMRobertaTokenizer so AutoTokenizer won't work with this checkpoint!

### Multilingual Pretrained Model

• Multilingual-MiniLMv1-L12-H384: 12-layer, 384-hidden, 12-heads, 21M Transformer parameters, 96M embedding parameters

Multilingual MiniLM uses the same tokenizer as XLM-R. But the Transformer architecture of our model is the same as BERT. We provide the fine-tuning code on XNLI based on huggingface/transformers. Please replace run_xnli.py in transformers with ours to fine-tune multilingual MiniLM.

We evaluate the multilingual MiniLM on cross-lingual natural language inference benchmark (XNLI) and cross-lingual question answering benchmark (MLQA).

#### Cross-Lingual Natural Language Inference - XNLI

We evaluate our model on cross-lingual transfer from English to other languages. Following Conneau et al. (2019), we select the best single model on the joint dev set of all the languages.

Model #Layers #Hidden #Transformer Parameters Average en fr es de el bg ru tr ar vi th zh hi sw ur
mBERT 12 768 85M 66.3 82.1 73.8 74.3 71.1 66.4 68.9 69.0 61.6 64.9 69.5 55.8 69.3 60.0 50.4 58.0
XLM-100 16 1280 315M 70.7 83.2 76.7 77.7 74.0 72.7 74.1 72.7 68.7 68.6 72.9 68.9 72.5 65.6 58.2 62.4
XLM-R Base 12 768 85M 74.5 84.6 78.4 78.9 76.8 75.9 77.3 75.4 73.2 71.5 75.4 72.5 74.9 71.1 65.2 66.5
mMiniLM-L12xH384 12 384 21M 71.1 81.5 74.8 75.7 72.9 73.0 74.5 71.3 69.7 68.8 72.1 67.8 70.0 66.2 63.3 64.2

This example code fine-tunes 12-layer multilingual MiniLM on XNLI.

# run fine-tuning on XNLI
DATA_DIR=/{path_of_data}/
OUTPUT_DIR=/{path_of_fine-tuned_model}/
MODEL_PATH=/{path_of_pre-trained_model}/

python ./examples/run_xnli.py --model_type minilm \
--output_dir ${OUTPUT_DIR} --data_dir${DATA_DIR} \
--model_name_or_path microsoft/Multilingual-MiniLM-L12-H384 \
--tokenizer_name xlm-roberta-base \
--config_name \${MODEL_PATH}/multilingual-minilm-l12-h384-config.json \
--do_train \
--do_eval \
--max_seq_length 128 \
--per_gpu_train_batch_size 128 \
--learning_rate 5e-5 \
--num_train_epochs 5 \
--per_gpu_eval_batch_size 32 \
--weight_decay 0.001 \
--warmup_steps 500 \
--save_steps 1500 \
--logging_steps 1500 \
--eval_all_checkpoints \
--language en \
--fp16 \
--fp16_opt_level O2

#### Cross-Lingual Question Answering - MLQA

Following Lewis et al. (2019b), we adopt SQuAD 1.1 as training data and use MLQA English development data for early stopping.

Model F1 Score #Layers #Hidden #Transformer Parameters Average en es de ar hi vi zh
mBERT 12 768 85M 57.7 77.7 64.3 57.9 45.7 43.8 57.1 57.5
XLM-15 12 1024 151M 61.6 74.9 68.0 62.2 54.8 48.8 61.4 61.1
XLM-R Base (Reported) 12 768 85M 62.9 77.8 67.2 60.8 53.0 57.9 63.1 60.2
XLM-R Base (Our fine-tuned) 12 768 85M 64.9 80.3 67.0 62.7 55.0 60.4 66.5 62.3
mMiniLM-L12xH384 12 384 21M 63.2 79.4 66.1 61.2 54.9 58.5 63.1 59.0

### Citation

If you find MiniLM useful in your research, please cite the following paper:

@misc{wang2020minilm,
title={MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers},
author={Wenhui Wang and Furu Wei and Li Dong and Hangbo Bao and Nan Yang and Ming Zhou},
year={2020},
eprint={2002.10957},
archivePrefix={arXiv},
primaryClass={cs.CL}
}