rohitsroch
commited on
Commit
•
d10496b
1
Parent(s):
a587a3e
Push SEAD-L-6_H-256_A-8-rte model weights
Browse files- README.md +80 -0
- config.json +35 -0
- eval_results.json +8 -0
- flax_model.msgpack +3 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +1 -0
- tf_model.h5 +3 -0
- tokenizer_config.json +1 -0
- training_args.bin +3 -0
- vocab.txt +0 -0
README.md
ADDED
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
license: apache-2.0
|
5 |
+
tags:
|
6 |
+
- SEAD
|
7 |
+
datasets:
|
8 |
+
- glue
|
9 |
+
- rte
|
10 |
+
---
|
11 |
+
|
12 |
+
## Paper
|
13 |
+
|
14 |
+
## [SEAD: SIMPLE ENSEMBLE AND KNOWLEDGE DISTILLATION FRAMEWORK FOR NATURAL LANGUAGE UNDERSTANDING](https://www.adasci.org/journals/lattice-35309407/?volumes=true&open=621a3b18edc4364e8a96cb63)
|
15 |
+
Aurthors: *Moyan Mei*, *Rohit Sroch*
|
16 |
+
|
17 |
+
## Abstract
|
18 |
+
|
19 |
+
With the widespread use of pre-trained language models (PLM), there has been increased research on how to make them applicable, especially in limited-resource or low latency high throughput scenarios. One of the dominant approaches is knowledge distillation (KD), where a smaller model is trained by receiving guidance from a large PLM. While there are many successful designs for learning knowledge from teachers, it remains unclear how students can learn better. Inspired by real university teaching processes, in this work we further explore knowledge distillation and propose a very simple yet effective framework, SEAD, to further improve task-specific generalization by utilizing multiple teachers. Our experiments show that SEAD leads to better performance compared to other popular KD methods [[1](https://arxiv.org/abs/1910.01108)] [[2](https://arxiv.org/abs/1909.10351)] [[3](https://arxiv.org/abs/2002.10957)] and achieves comparable or superior performance to its teacher model such as BERT [[4](https://arxiv.org/abs/1810.04805)] on total 13 tasks for the GLUE [[5](https://arxiv.org/abs/1804.07461)] and SuperGLUE [[6](https://arxiv.org/abs/1905.00537)] benchmarks.
|
20 |
+
|
21 |
+
*Moyan Mei and Rohit Sroch. 2022. [SEAD: Simple ensemble and knowledge distillation framework for natural language understanding](https://www.adasci.org/journals/lattice-35309407/?volumes=true&open=621a3b18edc4364e8a96cb63).
|
22 |
+
Lattice, THE MACHINE LEARNING JOURNAL by Association of Data Scientists, 3(1).*
|
23 |
+
|
24 |
+
## SEAD-L-6_H-256_A-8-rte
|
25 |
+
|
26 |
+
This is a student model distilled from [**BERT base**](https://huggingface.co/bert-base-uncased) as teacher by using SEAD framework on **rte** task. For weights initialization, we used [microsoft/xtremedistil-l6-h256-uncased](https://huggingface.co/microsoft/xtremedistil-l6-h256-uncased)
|
27 |
+
|
28 |
+
|
29 |
+
## All SEAD Checkpoints
|
30 |
+
|
31 |
+
Other Community Checkpoints: [here](https://huggingface.co/models?search=SEAD)
|
32 |
+
|
33 |
+
## Intended uses & limitations
|
34 |
+
|
35 |
+
More information needed
|
36 |
+
|
37 |
+
### Training hyperparameters
|
38 |
+
|
39 |
+
Please take a look at the `training_args.bin` file
|
40 |
+
|
41 |
+
```python
|
42 |
+
$ import torch
|
43 |
+
$ hyperparameters = torch.load(os.path.join('training_args.bin'))
|
44 |
+
|
45 |
+
```
|
46 |
+
|
47 |
+
|
48 |
+
### Evaluation results
|
49 |
+
|
50 |
+
| eval_accuracy | eval_runtime | eval_samples_per_second | eval_steps_per_second | eval_loss | eval_samples |
|
51 |
+
|:-------------:|:------------:|:-----------------------:|:---------------------:|:---------:|:------------:|
|
52 |
+
| 0.7906 | 1.5528 | 178.391 | 5.796 | 0.6934 | 277 |
|
53 |
+
|
54 |
+
|
55 |
+
### Framework versions
|
56 |
+
|
57 |
+
- Transformers >=4.8.0
|
58 |
+
- Pytorch >=1.6.0
|
59 |
+
- TensorFlow >=2.5.0
|
60 |
+
- Flax >=0.3.5
|
61 |
+
- Datasets >=1.10.2
|
62 |
+
- Tokenizers >=0.11.6
|
63 |
+
|
64 |
+
If you use these models, please cite the following paper:
|
65 |
+
|
66 |
+
|
67 |
+
```
|
68 |
+
@article{article,
|
69 |
+
author={Mei, Moyan and Sroch, Rohit},
|
70 |
+
title={SEAD: Simple Ensemble and Knowledge Distillation Framework for Natural Language Understanding},
|
71 |
+
volume={3},
|
72 |
+
number={1},
|
73 |
+
journal={Lattice, The Machine Learning Journal by Association of Data Scientists},
|
74 |
+
day={26},
|
75 |
+
year={2022},
|
76 |
+
month={Feb},
|
77 |
+
url = {www.adasci.org/journals/lattice-35309407/?volumes=true&open=621a3b18edc4364e8a96cb63}
|
78 |
+
}
|
79 |
+
```
|
80 |
+
|
config.json
ADDED
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "../artifacts/best_models/rte/L-6_H-256_A-8/student-ckpt",
|
3 |
+
"architectures": [
|
4 |
+
"BertForSequenceClassification"
|
5 |
+
],
|
6 |
+
"attention_probs_dropout_prob": 0.1,
|
7 |
+
"classifier_dropout": null,
|
8 |
+
"finetuning_task": "rte",
|
9 |
+
"gradient_checkpointing": false,
|
10 |
+
"hidden_act": "gelu",
|
11 |
+
"hidden_dropout_prob": 0.1,
|
12 |
+
"hidden_size": 256,
|
13 |
+
"id2label": {
|
14 |
+
"0": 0,
|
15 |
+
"1": 1
|
16 |
+
},
|
17 |
+
"initializer_range": 0.02,
|
18 |
+
"intermediate_size": 1024,
|
19 |
+
"label2id": {
|
20 |
+
"0": 0,
|
21 |
+
"1": 1
|
22 |
+
},
|
23 |
+
"layer_norm_eps": 1e-12,
|
24 |
+
"max_position_embeddings": 512,
|
25 |
+
"model_type": "bert",
|
26 |
+
"num_attention_heads": 8,
|
27 |
+
"num_hidden_layers": 6,
|
28 |
+
"pad_token_id": 0,
|
29 |
+
"position_embedding_type": "absolute",
|
30 |
+
"problem_type": "single_label_classification",
|
31 |
+
"transformers_version": "4.18.0",
|
32 |
+
"type_vocab_size": 2,
|
33 |
+
"use_cache": true,
|
34 |
+
"vocab_size": 30522
|
35 |
+
}
|
eval_results.json
ADDED
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"eval_accuracy": 0.7906137184115524,
|
3 |
+
"eval_loss": 0.6933708323372735,
|
4 |
+
"eval_runtime": 1.5528,
|
5 |
+
"eval_samples": 277,
|
6 |
+
"eval_samples_per_second": 178.391,
|
7 |
+
"eval_steps_per_second": 5.796
|
8 |
+
}
|
flax_model.msgpack
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:03cdb76fff760f58f98ac2a850b6ea4190480bc5809fbd807902827fe216752c
|
3 |
+
size 51006182
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:570499e721efa4920d0c064f8d8d25158dc1ffe3547872a298b6a6943d8c214f
|
3 |
+
size 51032629
|
special_tokens_map.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
|
tf_model.h5
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:96fdb44a900de74b73e700486b60ad8fa70aafaa8fbc08c2a9f2eb02f4d86894
|
3 |
+
size 51150416
|
tokenizer_config.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"do_lower_case": true, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "tokenizer_file": null, "name_or_path": "microsoft/xtremedistil-l6-h256-uncased", "tokenizer_class": "BertTokenizer"}
|
training_args.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:65db38c18dc5513718a7c844a8c361eda2226eab423ee5fe5f0f3d6fd104db2d
|
3 |
+
size 2713
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|