Adriana213 commited on
Commit
7b2d637
1 Parent(s): b4cf683

Update ModelCard

Browse files
Files changed (1) hide show
  1. README.md +78 -7
README.md CHANGED
@@ -10,10 +10,7 @@ model-index:
10
  results: []
11
  ---
12
 
13
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
14
- should probably proofread and complete it, then remove this comment. -->
15
-
16
- # distilbert-base-uncased-finetuned-clinc
17
 
18
  This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on an unknown dataset.
19
  It achieves the following results on the evaluation set:
@@ -22,18 +19,92 @@ It achieves the following results on the evaluation set:
22
 
23
  ## Model description
24
 
25
- More information needed
26
 
27
  ## Intended uses & limitations
28
 
29
- More information needed
 
 
 
 
 
 
 
 
30
 
31
  ## Training and evaluation data
32
 
33
- More information needed
34
 
35
  ## Training procedure
36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  ### Training hyperparameters
38
 
39
  The following hyperparameters were used during training:
 
10
  results: []
11
  ---
12
 
13
+ # Transformer Efficiency and Knowledge Distillation
 
 
 
14
 
15
  This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on an unknown dataset.
16
  It achieves the following results on the evaluation set:
 
19
 
20
  ## Model description
21
 
22
+ This setup involves benchmarking the performance of a fine-tuned BERT model (transformersbook/bert-base-uncased-finetuned-clinc) and applying knowledge distillation to train a smaller DistilBERT model. The BERT model is used for text classification tasks, and its efficiency is evaluated in terms of accuracy, model size, and latency. The DistilBERT model is trained to mimic the BERT model's performance while being more efficient.
23
 
24
  ## Intended uses & limitations
25
 
26
+ ### Intended uses:
27
+
28
+ Evaluating the performance efficiency of transformer models.
29
+ Applying knowledge distillation to create smaller and faster models for text classification.
30
+
31
+ ### Limitations:
32
+
33
+ The benchmark results are specific to the dataset used (CLINC150) and may not generalize to other datasets.
34
+ Knowledge distillation relies on the quality and performance of the teacher model.
35
 
36
  ## Training and evaluation data
37
 
38
+ The BERT model is fine-tuned on the CLINC150 dataset, which consists of labeled examples for intent classification. The dataset includes training, validation, and test splits.
39
 
40
  ## Training procedure
41
 
42
+ ### Training and evaluation data
43
+
44
+ The BERT model is fine-tuned on the CLINC150 dataset, which consists of labeled examples for intent classification. The dataset includes training, validation, and test splits.
45
+
46
+ ### Performance Benchmark
47
+
48
+ The performance of the BERT model is evaluated using the PerformanceBenchmark class, which measures accuracy, model size, and latency.
49
+
50
+ ### Accuracy
51
+
52
+ The model's accuracy is computed on the test set of the CLINC150 dataset.
53
+ accuracy_score = load_metric("accuracy")
54
+
55
+ ### Model Size
56
+
57
+ The size of the model is computed by saving its state dictionary to disk and measuring the file size in megabytes.
58
+
59
+ def compute_size(self):
60
+ state_dict = self.pipeline.model.state_dict()
61
+ tmp_path = Path("model.pt")
62
+ torch.save(state_dict, tmp_path)
63
+ size_mb = Path(tmp_path).stat().st_size / (1024 * 1024)
64
+ tmp_path.unlink()
65
+ return {"size_mb": size_mb}
66
+
67
+ ### Latency
68
+
69
+ The average latency per query is measured over a sample of 100 queries.
70
+
71
+ def time_pipeline(self):
72
+ latencies = []
73
+ for example in self.dataset[:100]:
74
+ start_time = perf_counter()
75
+ _ = self.pipeline(example)
76
+ latency = perf_counter() - start_time
77
+ latencies.append(latency)
78
+ time_avg_ms = 1000 * np.mean(latencies)
79
+ time_std_ms = 1000 * np.std(latencies)
80
+ return {"time_avg_ms": time_avg_ms, "time_std_ms": time_std_ms}
81
+
82
+ ### Knowledge Distillation
83
+
84
+ Knowledge distillation is used to train a smaller DistilBERT model using the predictions of the fine-tuned BERT model as soft labels.
85
+
86
+ ### Distillation Process
87
+
88
+ Teacher Model: transformersbook/bert-base-uncased-finetuned-clinc
89
+ Student Model: distilbert-base-uncased
90
+ The distillation process involves computing a weighted average of the cross-entropy loss with the ground truth labels and the Kullback-Leibler divergence between the teacher and student model predictions.
91
+
92
+ class DistillationTrainer(Trainer):
93
+ def compute_loss(self, model, inputs, return_outputs=False):
94
+ outputs_stu = model(**inputs)
95
+ loss_ce = outputs_stu.loss
96
+ logits_stu = outputs_stu.logits
97
+ with torch.no_grad():
98
+ outputs_tea = self.teacher(**inputs)
99
+ logits_tea = outputs_tea.logits
100
+ loss_fct = nn.KLDivLoss(reduction="batchmean")
101
+ loss_kd = self.args.temperature ** 2 * loss_fct(
102
+ F.log_softmax(logits_stu / self.args.temperature, dim=-1),
103
+ F.softmax(logits_tea / self.args.temperature, dim=-1)
104
+ )
105
+ loss = self.args.alpha * loss_ce + (1. - self.args.alpha) * loss_kd
106
+ return (loss, outputs_stu) if return_outputs else loss
107
+
108
  ### Training hyperparameters
109
 
110
  The following hyperparameters were used during training: