Edit model card

Introduction

This model predicts the sentiment of a text if it is Positive, Neutral, or Negative. This model is a finetune version of UBC-NLP/MARBERTv2 on labr.

Data

The data used is labr, an Arabic book reviews dataset. The sentiment is obtained from the number of stars given by each review.

Nubmer of stars Sentiment
1-2 Negative
3 Neutral
4-5 Positive

Training

Using the Arabic Pre-Trained MARBERTv2 as a base, we finetuned the model for a classification task. For 3 epochs, the training has been done using huggingface trainer on Google Colab. This is a POC experiment, so the training hyper-parameters were not optimized.

Evaluation

Using the test set from labr, and the same preprocessing steps, the model was evaluated. Please note the for the following results, we obtained the macro average.

Metric Score
Precision 0.663
Recall 0.662
F1 0.66

Using the model

To use the model in your code, follow huggingface instructions, or

from transformers import pipeline

pipe = pipeline("text-classification", model="AbdallahNasir/book-review-sentiment-classification")
result = pipe("من أفضل الكتب التي قرأتها في هذا العام")
print(result)

Training code

Following this code, you will get the same results I got. You can run it in Google Colab. Please use a GPU runtime to finish the training quickly.

# Notebook only:
!pip install transformers[torch] datasets

# Download and load the data
import datasets
dataset = datasets.load_dataset("labr")

# Transform the ratings into Sentiment
POSITIVE = "Positive"
NEUTRAL = "Neutral"
NEGATIVE = "Negative"
rate_to_sentiment = {0: NEGATIVE, 1: NEGATIVE, 2: NEUTRAL, 3: POSITIVE, 4: POSITIVE}
dataset = dataset.map(lambda example: {"sentiment": rate_to_sentiment[example["label"]]}, remove_columns=["label"])
dataset = dataset.rename_column("sentiment", "label")
class_names = [POSITIVE, NEUTRAL, NEGATIVE]  
num_classes = len(class_names)
dataset = dataset.cast_column('label', datasets.ClassLabel(num_classes=num_classes, names=class_names))

# Download and load the pre-trained model and tokenizer
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/MARBERTv2")
model = AutoModelForSequenceClassification.from_pretrained("UBC-NLP/MARBERTv2", num_labels=3)

# Tokenize data for training
def tokenize_function(examples):
  return tokenizer(examples["text"],  truncation=True, return_length=True,return_attention_mask=True, max_length=512)
tokenized_datasets = dataset.map(tokenize_function, batched=False, num_proc=16)

# Define data collator, useful for training and batching.
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Defining training args
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")

from transformers import Trainer
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

# Train and save
trainer.train()
trainer.save_model("final_output")
Keywords
  • sentiment analysis
  • arabic
  • book reviews
Downloads last month
23
Safetensors
Model size
163M params
Tensor type
F32
·

Dataset used to train AbdallahNasir/book-review-sentiment-classification