Alina Kolesnikova
Remove ipynb checkpoints a0cf62e
1
---
2
language:
3
- ru
4
---
5
# distilrubert-base-cased-conversational
6
Conversational DistilRuBERT \(Russian, cased, 6‑layer, 768‑hidden, 12‑heads, 135.4M parameters\) was trained on OpenSubtitles\[1\], [Dirty](https://d3.ru/), [Pikabu](https://pikabu.ru/), and a Social Media segment of Taiga corpus\[2\] (as [Conversational RuBERT](https://huggingface.co/DeepPavlov/rubert-base-cased-conversational)).
7
8
Our DistilRuBERT was highly inspired by \[3\], \[4\]. Namely, we used 
9
* KL loss (between teacher and student output logits)
10
* MLM loss (between tokens labels and student output logits)
11
* Cosine embedding loss between mean of two consecutive hidden states of the teacher and one hidden state of the student
12
13
The model was trained for about 100 hrs. on 8 nVIDIA Tesla P100-SXM2.0 16Gb.
14
15
To evaluate improvements in the inference speed, we ran teacher and student models on random sequences with seq_len=512, batch_size = 16 (for throughput) and batch_size=1 (for latency). 
16
All tests were performed on Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz and nVIDIA Tesla P100-SXM2.0 16Gb.
17
18
| Model                                           | Size, Mb.  | CPU latency, sec.| GPU latency, sec. | CPU throughput, samples/sec. | GPU throughput, samples/sec. |
19
|-------------------------------------------------|------------|------------------|-------------------|------------------------------|------------------------------|
20
| Teacher (RuBERT-base-cased-conversational)      | 679        | 0.655            | 0.031             | 0.3754                       | 36.4902                      |
21
| Student (DistilRuBERT-base-cased-conversational)| 517        | 0.3285           | 0.0212            | 0.5803                       | 52.2495                      |
22
23
\[1\]: P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation \(LREC 2016\)
24
25
\[2\]: Shavrina T., Shapovalova O. \(2017\) TO THE METHODOLOGY OF CORPUS CONSTRUCTION FOR MACHINE LEARNING: «TAIGA» SYNTAX TREE CORPUS AND PARSER. in proc. of “CORPORA2017”, international conference , Saint-Petersbourg, 2017.
26
27
\[3\]: Sanh, V., Debut, L., Chaumond, J., & Wolf, T. \(2019\). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
28
29
\[4\]: <https://github.com/huggingface/transformers/tree/master/examples/research_projects/distillation>