File size: 5,054 Bytes
b4cf683
 
 
 
 
 
 
 
 
 
4d31706
 
 
 
b4cf683
 
7b2d637
b4cf683
 
 
 
 
 
 
 
7b2d637
b4cf683
 
 
7b2d637
 
 
 
 
 
 
 
 
b4cf683
 
 
7b2d637
b4cf683
 
 
7b2d637
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b4cf683
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4d31706
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
license: apache-2.0
base_model: distilbert-base-uncased
tags:
- generated_from_trainer
metrics:
- accuracy
model-index:
- name: distilbert-base-uncased-finetuned-clinc
  results: []
datasets:
- clinc_oos
library_name: transformers
pipeline_tag: text-classification
---

# Transformer Efficiency and Knowledge Distillation

This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on an unknown dataset.
It achieves the following results on the evaluation set:
- Loss: 0.7872
- Accuracy: 0.9206

## Model description

This setup involves benchmarking the performance of a fine-tuned BERT model (transformersbook/bert-base-uncased-finetuned-clinc) and applying knowledge distillation to train a smaller DistilBERT model. The BERT model is used for text classification tasks, and its efficiency is evaluated in terms of accuracy, model size, and latency. The DistilBERT model is trained to mimic the BERT model's performance while being more efficient.

## Intended uses & limitations

### Intended uses:

Evaluating the performance efficiency of transformer models.
Applying knowledge distillation to create smaller and faster models for text classification.

### Limitations:

The benchmark results are specific to the dataset used (CLINC150) and may not generalize to other datasets.
Knowledge distillation relies on the quality and performance of the teacher model.

## Training and evaluation data

The BERT model is fine-tuned on the CLINC150 dataset, which consists of labeled examples for intent classification. The dataset includes training, validation, and test splits.

## Training procedure

### Training and evaluation data

The BERT model is fine-tuned on the CLINC150 dataset, which consists of labeled examples for intent classification. The dataset includes training, validation, and test splits.

### Performance Benchmark

The performance of the BERT model is evaluated using the PerformanceBenchmark class, which measures accuracy, model size, and latency.

### Accuracy

The model's accuracy is computed on the test set of the CLINC150 dataset.
accuracy_score = load_metric("accuracy") 

### Model Size

The size of the model is computed by saving its state dictionary to disk and measuring the file size in megabytes.

def compute_size(self):
    state_dict = self.pipeline.model.state_dict()
    tmp_path = Path("model.pt")
    torch.save(state_dict, tmp_path)
    size_mb = Path(tmp_path).stat().st_size / (1024 * 1024)
    tmp_path.unlink()
    return {"size_mb": size_mb}
    
### Latency

The average latency per query is measured over a sample of 100 queries.

def time_pipeline(self):
    latencies = []
    for example in self.dataset[:100]:
        start_time = perf_counter()
        _ = self.pipeline(example)
        latency = perf_counter() - start_time
        latencies.append(latency)
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    return {"time_avg_ms": time_avg_ms, "time_std_ms": time_std_ms}
    
### Knowledge Distillation

Knowledge distillation is used to train a smaller DistilBERT model using the predictions of the fine-tuned BERT model as soft labels.

### Distillation Process

Teacher Model: transformersbook/bert-base-uncased-finetuned-clinc
Student Model: distilbert-base-uncased
The distillation process involves computing a weighted average of the cross-entropy loss with the ground truth labels and the Kullback-Leibler divergence between the teacher and student model predictions.

class DistillationTrainer(Trainer):
  def compute_loss(self, model, inputs, return_outputs=False):
    outputs_stu = model(**inputs)
    loss_ce = outputs_stu.loss
    logits_stu = outputs_stu.logits
    with torch.no_grad():
      outputs_tea = self.teacher(**inputs)
      logits_tea = outputs_tea.logits
    loss_fct = nn.KLDivLoss(reduction="batchmean")
    loss_kd = self.args.temperature ** 2 * loss_fct(
        F.log_softmax(logits_stu / self.args.temperature, dim=-1),
        F.softmax(logits_tea / self.args.temperature, dim=-1)
    )
    loss = self.args.alpha * loss_ce + (1. - self.args.alpha) * loss_kd
    return (loss, outputs_stu) if return_outputs else loss
    
### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 48
- eval_batch_size: 48
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 5

### Training results

| Training Loss | Epoch | Step | Validation Loss | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:--------:|
| No log        | 1.0   | 318  | 3.2931          | 0.7255   |
| 3.8009        | 2.0   | 636  | 1.8849          | 0.8526   |
| 3.8009        | 3.0   | 954  | 1.1702          | 0.8897   |
| 1.7128        | 4.0   | 1272 | 0.8717          | 0.9145   |
| 0.9206        | 5.0   | 1590 | 0.7872          | 0.9206   |


### Framework versions

- Transformers 4.41.1
- Pytorch 2.3.0+cu121
- Datasets 2.19.1
- Tokenizers 0.19.1