Ceshine Lee
New version (distilbert-base to bert-base; attention matrices imitation)
db9775f
# TinyBERT_L-4_H-312_v2 English Sentence Encoder
This is distilled from the `bert-base-nli-stsb-mean-tokens` pre-trained model from [Sentence-Transformers](https://sbert.net/).
The embedding vector is obtained by mean/average pooling of the last layer's hidden states.
Update 20210325: Added the attention matrices imitation objective as in the TinyBERT paper, and the distill target has been changed from `distilbert-base-nli-stsb-mean-tokens` to `bert-base-nli-stsb-mean-tokens` (they have almost the same STSb performance).
## Model Comparison
We compute cosine similarity scores of the embeddings of the sentence pair to get the spearman correlation on the STS benchmark (bigger is better):
| | Dev | Test |
| ------------------------------------ | ----- | ----- |
| bert-base-nli-stsb-mean-tokens | .8704 | .8505 |
| distilbert-base-nli-stsb-mean-tokens | .8667 | .8516 |
| TinyBERT_L-4_H-312_v2-distill-AllNLI | .8587 | .8283 |
| TinyBERT_L-4_H (20210325) | .8551 | .8341 |