AliakbarKamali commited on
Commit
0096df5
·
1 Parent(s): 0035b39

Initial conunit

Browse files
Files changed (4) hide show
  1. .gitiqnore +1 -0
  2. .vscode/settings.json +5 -0
  3. app.py +415 -0
  4. requirements.txt +4 -0
.gitiqnore ADDED
@@ -0,0 +1 @@
 
 
1
+ venv
.vscode/settings.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "python-envs.defaultEnvManager": "ms-python.python:conda",
3
+ "python-envs.defaultPackageManager": "ms-python.python:conda",
4
+ "python-envs.pythonProjects": []
5
+ }
app.py ADDED
@@ -0,0 +1,415 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ #
5
+ # # Exercise 3: Fine-Tuning Pretrained Transformer on Text Classification Task
6
+ #
7
+ # In this lab, you will apply the concepts learned by fine-tuning a pre-trained Transformer model on a text classification task using the Hugging Face `transformers` library.
8
+ #
9
+ # ## Objectives:
10
+ # - Learn to load a pre-trained model from Hugging Face.
11
+ # - Fine-tune the model on a text classification dataset.
12
+ # - Evaluate and save the fine-tuned model.
13
+ #
14
+
15
+ # ## Installing Necessary Libraries
16
+ #
17
+ # In this lab, we will use the following Python libraries:
18
+ #
19
+ # - **Transformers**: Hugging Face's `transformers` library provides pre-trained models for various Natural Language Processing (NLP) tasks. We will use this to load and fine-tune a pre-trained transformer model.
20
+ #
21
+ # - **Datasets**: Hugging Face's `datasets` library allows easy access to a wide range of datasets and provides tools for efficient data processing.
22
+ #
23
+ # - **Scikit-learn**: A widely-used library for machine learning, which includes tools for metrics, evaluation, and preprocessing. In this lab, we'll use it to calculate evaluation metrics such as accuracy, precision, recall, and F1 score.
24
+ #
25
+
26
+ # In[1]:
27
+
28
+
29
+ # Install necessary libraries
30
+
31
+
32
+ # # Setup and Import Libraries
33
+ #
34
+ # In this section, we import all the necessary libraries for our IMDb sentiment analysis lab.
35
+ #
36
+ # - **os, random, numpy, torch**: For system operations, reproducibility, and PyTorch support.
37
+ # - **datasets**: To load the IMDb dataset from Hugging Face.
38
+ # - **transformers**: Includes tokenizer, model, training utilities, data collator, and callbacks for early stopping.
39
+ #
40
+
41
+ # In[2]:
42
+
43
+
44
+ # Set random seeds for reproducibility
45
+ import os
46
+ import random
47
+ import numpy as np
48
+ import torch
49
+
50
+ SEED = 42
51
+ random.seed(SEED)
52
+ np.random.seed(SEED)
53
+ torch.manual_seed(SEED)
54
+ if torch.cuda.is_available():
55
+ torch.cuda.manual_seed_all(SEED)
56
+
57
+ # Import dataset and transformer utilities
58
+ from datasets import load_dataset
59
+ from transformers import (
60
+ AutoTokenizer,
61
+ AutoModelForSequenceClassification,
62
+ DataCollatorWithPadding,
63
+ TrainingArguments,
64
+ Trainer,
65
+ EarlyStoppingCallback
66
+ )
67
+
68
+
69
+ USE_MPS = torch.backends.mps.is_available()
70
+ device = torch.device("mps" if USE_MPS else "cpu")
71
+ print("Using device:", device)
72
+
73
+
74
+ # # Load IMDb Dataset
75
+ #
76
+ # We load the IMDb movie review dataset from Hugging Face:
77
+ #
78
+ # - The dataset contains **25,000 training examples** and **25,000 test examples**.
79
+ # - Each example has a `text` field (the movie review) and a `label` field (`0` = negative, `1` = positive).
80
+ # - We print the dataset to verify its structure.
81
+ #
82
+
83
+ # In[3]:
84
+
85
+
86
+ # Load the ag_news dataset
87
+ raw = load_dataset("SetFit/ag_news")
88
+ print(raw)
89
+
90
+
91
+ # # Tokenization and Dataset Preparation
92
+ #
93
+ # In this section, we:
94
+ #
95
+ # 1. Load the pre-trained BERT tokenizer (`bert-base-uncased`).
96
+ # 2. Tokenize the IMDb text data with truncation to a maximum sequence length of 128 tokens.
97
+ # 3. Remove the original text column to ensure Trainer only works with tensor inputs.
98
+ # 4. Set the dataset format to PyTorch tensors for compatibility with the Trainer.
99
+ # 5. Split the training set to create a validation set (5,000 examples) for monitoring validation loss during fine-tuning.
100
+ #
101
+
102
+ # In[4]:
103
+
104
+
105
+ # Load BERT tokenizer
106
+ MODEL_NAME = "bert-base-uncased"
107
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
108
+
109
+ # Tokenization function
110
+ def tokenize_fn(examples):
111
+ return tokenizer(examples["text"], truncation=True, max_length=128)
112
+
113
+ cols_to_remove = [c for c in raw["train"].column_names if c not in ("label",)]
114
+
115
+ # Apply tokenization to the dataset
116
+ tokenized = raw.map(tokenize_fn, batched=True, remove_columns=cols_to_remove)
117
+
118
+
119
+ # Remove original text column to avoid issues during batching
120
+ if "text" in tokenized["train"].column_names:
121
+ tokenized = tokenized.remove_columns(["text"])
122
+
123
+ # Set dataset format to PyTorch tensors
124
+ tokenized.set_format("torch")
125
+
126
+ # Shuffle and split the training dataset to create a validation set
127
+ train_dataset = tokenized["train"].shuffle(seed=SEED)
128
+ val_split = train_dataset.train_test_split(test_size=5000, seed=SEED)
129
+ train_dataset = val_split["train"]
130
+ eval_dataset = val_split["test"]
131
+
132
+
133
+ # After tokenization, column removal, and setting the dataset format to PyTorch tensors, the training dataset looks like this:
134
+ #
135
+ # - **Number of examples:** 20,000 (after splitting 5,000 examples for validation)
136
+ # - **Features:**
137
+ # - `label` – the target sentiment (0 = negative, 1 = positive)
138
+ # - `input_ids` – numerical token IDs representing the words/subwords of each sentence
139
+ # - `token_type_ids` – segment IDs used by BERT to distinguish sentences (for single-sentence tasks, usually all zeros)
140
+ # - `attention_mask` – indicates which tokens are real (1) and which are padding (0), allowing the model to ignore padding during self-attention
141
+ #
142
+
143
+ # In[5]:
144
+
145
+
146
+ print(train_dataset)
147
+
148
+
149
+ # # Model Initialization and Data Collator
150
+ #
151
+ # In this section, we:
152
+ #
153
+ # 1. Load a pre-trained BERT model (`bert-base-uncased`) for sequence classification with 2 output labels (positive and negative sentiment).
154
+ # 2. Set up a `DataCollatorWithPadding` to dynamically pad batches during training and evaluation.
155
+ #
156
+
157
+ # In[6]:
158
+
159
+
160
+ # Load pre-trained BERT model for sequence classification
161
+ model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=4)
162
+
163
+ # Create a data collator that dynamically pads input sequences in each batch
164
+ data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
165
+
166
+
167
+ # # Metrics Calculation Using Scikit-Learn
168
+ #
169
+ # In this section, we define a function to compute evaluation metrics for the model using **scikit-learn**:
170
+ #
171
+ # - **Accuracy**: Fraction of correctly predicted examples.
172
+ # - **F1 Score**: Harmonic mean of precision and recall for binary classification.
173
+ #
174
+ # Using scikit-learn simplifies metric computation and ensures correctness.
175
+ #
176
+
177
+ # In[7]:
178
+
179
+
180
+ from sklearn.metrics import accuracy_score, f1_score
181
+ import numpy as np
182
+
183
+ # Define a metrics computation function using scikit-learn
184
+ def compute_metrics(eval_pred):
185
+ logits, labels = eval_pred
186
+ # Convert logits to predicted class indices
187
+ preds = np.argmax(logits, axis=-1)
188
+
189
+ # Compute accuracy and F1 score using scikit-learn
190
+ acc = accuracy_score(labels, preds)
191
+ f1 = f1_score(labels, preds, average='macro')
192
+
193
+ return {"accuracy": acc, "f1_macro": f1}
194
+
195
+
196
+ # # Training Setup with Trainer and Early Stopping
197
+ #
198
+ # In this section, we configure the training process:
199
+ #
200
+ # 1. **TrainingArguments**:
201
+ # - Sets output directory, batch sizes, number of epochs, learning rate, weight decay, warmup steps, and other training hyperparameters.
202
+ # - Enables evaluation and checkpoint saving at the end of each epoch.
203
+ # - Loads the best model at the end based on evaluation loss.
204
+ # - Uses mixed-precision (fp16) if a GPU is available.
205
+ #
206
+ # 2. **Trainer**:
207
+ # - Combines the model, datasets, tokenizer, data collator, and metrics function.
208
+ # - Includes `EarlyStoppingCallback` to stop training if validation loss does not improve for 2 evaluation steps (epochs in this case).
209
+ #
210
+ # This setup makes fine-tuning BERT efficient and helps prevent overfitting.
211
+ #
212
+
213
+ # In[8]:
214
+
215
+
216
+ from transformers import TrainingArguments, Trainer, EarlyStoppingCallback
217
+
218
+ # Define training arguments
219
+ training_args = TrainingArguments(
220
+ output_dir="./results",
221
+ eval_strategy="epoch",
222
+ save_strategy="epoch",
223
+ logging_strategy="epoch",
224
+ #report_to=[], # <- disable all integrations (no wandb, no tensorboard)
225
+ per_device_train_batch_size=8,
226
+ per_device_eval_batch_size=8,
227
+ num_train_epochs=3,
228
+ learning_rate=2e-5,
229
+ weight_decay=0.1,
230
+ warmup_steps=100,
231
+ load_best_model_at_end=True,
232
+ metric_for_best_model="eval_loss",
233
+ greater_is_better=False,
234
+ save_total_limit=3,
235
+ fp16=torch.cuda.is_available(),
236
+ dataloader_drop_last=False,
237
+ gradient_accumulation_steps=1,
238
+ seed=SEED,
239
+ )
240
+
241
+ # Create Trainer instance with early stopping
242
+ trainer = Trainer(
243
+ model=model,
244
+ args=training_args,
245
+ train_dataset=train_dataset,
246
+ eval_dataset=eval_dataset,
247
+ tokenizer=tokenizer,
248
+ data_collator=data_collator,
249
+ compute_metrics=compute_metrics,
250
+ callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],
251
+ )
252
+
253
+
254
+ # # Start Training
255
+ #
256
+ # Now we begin fine-tuning the BERT model on the IMDb dataset using the Hugging Face Trainer.
257
+ #
258
+ # - The training process will:
259
+ # 1. Iterate over the training dataset for the specified number of epochs.
260
+ # 2. Evaluate on the validation set at the end of each epoch.
261
+ # 3. Save checkpoints for the best model based on validation loss.
262
+ # 4. Stop early if the validation loss does not improve for 2 consecutive evaluation steps (early stopping).
263
+ # - Training progress, loss, accuracy, and F1 score will be displayed in real time.
264
+ #
265
+
266
+ # In[9]:
267
+
268
+
269
+ # Start model training
270
+ trainer.train()
271
+
272
+
273
+ # # Save the Fine-Tuned Model and Tokenizer
274
+ #
275
+ # After training, we save the fine-tuned BERT model and its tokenizer so that they can be easily reloaded later for inference or further fine-tuning.
276
+ #
277
+ # - The model weights and configuration will be saved in the folder `my-fine-tuned-bert`.
278
+ # - The tokenizer files are also saved in the same directory.
279
+ #
280
+
281
+ # In[10]:
282
+
283
+
284
+ # Save the fine-tuned model
285
+ trainer.save_model('my-fine-tuned-bert')
286
+
287
+ # Save the tokenizer
288
+ tokenizer.save_pretrained('my-fine-tuned-bert')
289
+
290
+
291
+ # # Inference with the Fine-Tuned Model
292
+ #
293
+ # In this section, we demonstrate how to load the fine-tuned BERT model and tokenizer, and perform sentiment prediction on new text.
294
+ #
295
+ # - We use Hugging Face's `TextClassificationPipeline` for convenient text classification.
296
+ # - The model predicts either `LABEL_0` (negative) or `LABEL_1` (positive).
297
+ # - We map these labels to more meaningful sentiment labels for clarity.
298
+ # - Finally, we test the model on a sample sentence.
299
+ #
300
+
301
+ # In[11]:
302
+
303
+
304
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer, TextClassificationPipeline
305
+
306
+ # Load the fine-tuned model and tokenizer
307
+ new_model = AutoModelForSequenceClassification.from_pretrained('my-fine-tuned-bert')
308
+ new_tokenizer = AutoTokenizer.from_pretrained('my-fine-tuned-bert')
309
+
310
+ # Create a text classification pipeline
311
+ classifier = TextClassificationPipeline(
312
+ model=new_model,
313
+ tokenizer=new_tokenizer, )
314
+
315
+ # Define label mapping
316
+ label_mapping = {
317
+ 0: 'World',
318
+ 1: 'Sports',
319
+ 2: 'Business',
320
+ 3: 'Sci/Tech'
321
+ }
322
+
323
+ # Test the classifier on a sample sentence
324
+ sample_text = "This movie was good"
325
+ result = classifier(sample_text)
326
+
327
+ # Map the predicted label to a meaningful sentiment
328
+ mapped_result = {
329
+ 'label': label_mapping[int(result[0]['label'].split('_')[1])],
330
+ 'score': result[0]['score']
331
+ }
332
+
333
+ print(mapped_result)
334
+
335
+
336
+ # In[12]:
337
+
338
+
339
+ from datasets import load_dataset
340
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer
341
+ import numpy as np
342
+ from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix
343
+
344
+ MODEL_DIR = "my-fine-tuned-bert"
345
+ MODEL_NAME = "bert-base-uncased"
346
+
347
+ raw = load_dataset("SetFit/ag_news")
348
+ tok = AutoTokenizer.from_pretrained(MODEL_NAME)
349
+
350
+ def tokenize_fn(ex):
351
+ return tok(ex["text"], truncation=True, max_length=128)
352
+
353
+ test_tok = raw["test"].map(tokenize_fn, batched=True)
354
+ test_tok = test_tok.rename_column("label", "labels")
355
+ cols = ["input_ids", "attention_mask", "labels"]
356
+ if "token_type_ids" in test_tok.column_names:
357
+ cols.append("token_type_ids")
358
+ test_tok.set_format(type="torch", columns=cols)
359
+
360
+ model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR)
361
+
362
+ trainer = Trainer(model=model, tokenizer=tok)
363
+ pred_out = trainer.predict(test_tok)
364
+
365
+ y_prob = pred_out.predictions
366
+ y_pred = np.argmax(y_prob, axis=1)
367
+ y_true = pred_out.label_ids
368
+
369
+ acc = accuracy_score(y_true, y_pred)
370
+ f1m = f1_score(y_true, y_pred, average="macro")
371
+ print({"test_accuracy": acc, "test_f1_macro": f1m})
372
+ print(classification_report(y_true, y_pred, digits=4))
373
+
374
+ from sklearn.metrics import ConfusionMatrixDisplay
375
+ import matplotlib.pyplot as plt
376
+ ConfusionMatrixDisplay.from_predictions(y_true, y_pred)
377
+ plt.title("AG News – Confusion Matrix (Test)")
378
+ plt.show()
379
+
380
+
381
+ # In[13]:
382
+
383
+
384
+ import torch, gradio as gr
385
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
386
+
387
+ MODEL_ID = "my-fine-tuned-bert"
388
+
389
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
390
+ model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
391
+
392
+ label_names = {0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"}
393
+
394
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
395
+ clf = pipeline("text-classification", model=model, tokenizer=tokenizer, top_k=None, device=device)
396
+
397
+ def predict(text):
398
+ text = text.strip()
399
+ out = clf(text, truncation=True)
400
+ out = out[0] if isinstance(out[0], list) else out
401
+ results = {}
402
+ for o in sorted(out, key=lambda x: -x["score"]):
403
+ idx = int(o["label"].split("_")[1])
404
+ results[label_names[idx]] = o["score"]
405
+ return results
406
+
407
+ demo = gr.Interface(
408
+ fn=predict,
409
+ inputs=gr.Textbox(lines=3, label="Enter news headline"),
410
+ outputs=gr.Label(num_top_classes=4, label="Predicted topic"),
411
+ title="AG News Topic Classifier (BERT-base)"
412
+ )
413
+
414
+ demo.launch(share=True)
415
+
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ numpy
2
+ torch
3
+ transformers
4
+ gradio