## Load dataset

In [1]:
from datasets import load_dataset

In [2]:
shoe_dataset = load_dataset("mazed/amazon_shoe_review")

Downloading readme:   0%|          | 0.00/456 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/10.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/90000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [3]:
shoe_dataset

DatasetDict({
    train: Dataset({
        features: ['labels', 'text'],
        num_rows: 90000
    })
    test: Dataset({
        features: ['labels', 'text'],
        num_rows: 10000
    })
})

In [4]:
shoe_dataset["train"][0]

{'labels': 1,
 'text': "Material looks cheaper than what I expected. Doesn't seem like real quality leather."}

## Train

In [5]:
import transformers

In [6]:
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
Trainer,
TrainingArguments,
)

2024-07-06 11:40:07.714038: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-06 11:40:07.714148: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-06 11:40:07.799139: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [7]:
train_dataset = shoe_dataset['train']
valid_dataset = shoe_dataset['test']

In [8]:
train_dataset

Dataset({
    features: ['labels', 'text'],
    num_rows: 90000
})

In [9]:
valid_dataset

Dataset({
    features: ['labels', 'text'],
    num_rows: 10000
})

Define a function to compute different metrics.

In [10]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

In [11]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1":f1, "precision": precision, "recall": recall}

Let's download the base model and its tokenizer from the Hugging Face Hub.

In [12]:
base_model_id = "distilbert-base-uncased"
num_labels = 5

In [13]:
model = AutoModelForSequenceClassification.from_pretrained(base_model_id, num_labels=num_labels)
tokenizer = AutoTokenizer.from_pretrained(base_model_id)

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [14]:
sample_text="This is a sample text."

In [15]:
encoded_sample_text=tokenizer(sample_text)

In [16]:
encoded_sample_text

{'input_ids': [101, 2023, 2003, 1037, 7099, 3793, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

Now, we define a function to tokenize the datasets.

In [17]:
def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True)


In [18]:
# # Define the tokenize function
# def tokenize(batch):
#     try:
#         # Tokenize the text and return the tokenized output
#         tokenized = tokenizer(batch['text'], padding='max_length', truncation=True)
#         return tokenized
#     except Exception as e:
#         print(f"Error tokenizing batch: {e}")
#         print(batch['text'])
#         return {}

In [19]:
train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
valid_dataset = valid_dataset.map(tokenize, batched=True, batch_size=len(valid_dataset))

Map:   0%|          | 0/90000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Define the TrainingArguments for our training job: hyperparameters, wherere to save the model etc.

In [20]:
epochs = 3
learning_rate = 5e-5
train_batch_size = 32
eval_batch_size = 32
save_strategy = 'epoch'

In [21]:
training_args = TrainingArguments(
output_dir="/kaggle/working/",
run_name="bert-base-uncased-finetune-sst2",
num_train_epochs=epochs,
per_device_train_batch_size=train_batch_size,
per_device_eval_batch_size=eval_batch_size,
save_strategy=save_strategy,
eval_strategy='epoch',
learning_rate=learning_rate,
)

Now, we use the trainer object to put all the pieces together.

In [22]:
trainer= Trainer(
model=model,
args=training_args,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
train_dataset=train_dataset,
eval_dataset=valid_dataset,
)

READY to TRAIN..

In [23]:
trainer.train()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc




Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,1.0092,0.94827,0.5769,0.57194,0.572744,0.5769
2,0.8831,0.937286,0.5826,0.581743,0.584837,0.5826
3,0.7738,0.975922,0.582,0.580953,0.581424,0.582




TrainOutput(global_step=4221, training_loss=0.8959762511787694, metrics={'train_runtime': 6804.0045, 'train_samples_per_second': 39.683, 'train_steps_per_second': 0.62, 'total_flos': 3.57681111552e+16, 'train_loss': 0.8959762511787694, 'epoch': 3.0})