--- license: apache-2.0 datasets: - hadyelsahar/ar_res_reviews language: - ar metrics: - accuracy - precision - recall - f1 base_model: - aubmindlab/bert-base-arabertv02 pipeline_tag: text-classification --- # 🍽️ Arabic Restaurant Review Sentiment Analysis 🚀 ## 📌 Overview This project fine-tunes a **transformer-based model** to analyze sentiment in **Arabic restaurant reviews**. We utilized **Hugging Face’s model training pipeline** and deployed the final model as an **interactive Gradio web app**. ## 📥 Data Collection The dataset used for fine-tuning was sourced from **Hugging Face Datasets**, specifically: [📂 Arabic Restaurant Reviews Dataset](https://huggingface.co/datasets/hadyelsahar/ar_res_reviews) It contains **restaurant reviews in Arabic** labeled with sentiment polarity. ## 🔄 Data Preparation - **Cleaning & Normalization**: - Removed non-Arabic text, special characters, and extra spaces. - Normalized Arabic characters (e.g., `إ, أ, آ → ا`, `ة → ه`). - Downsampled positive reviews to balance the dataset. - **Tokenization**: - Used **AraBERT tokenizer** for efficient text processing. - **Train-Test Split**: - **80% Training** | **20% Testing**. ## 🏋️ Fine-Tuning & Results The model was fine-tuned using **Hugging Face Transformers** on a dataset of restaurant reviews. ### **📊 Evaluation Metrics** | Metric | Score | |-------------|--------| | **Train Loss**| `0.470`| | **Eval Loss** | `0.373` | | **Accuracy** | `86.41%` | | **Precision** | `87.01%` | | **Recall** | `86.49%` | | **F1-score** | `86.75%` | ## ⚙️ Training Parameters ```python model_name = "aubmindlab/bert-base-arabertv2" model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2, classifier_dropout=0.5).to(device) training_args = TrainingArguments( output_dir="./results", evaluation_strategy="epoch", save_strategy="epoch", per_device_train_batch_size=8, per_device_eval_batch_size=8, num_train_epochs=4, weight_decay=1, learning_rate=1e-5, lr_scheduler_type="cosine", warmup_ratio=0.1, fp16=True, report_to="none", save_total_limit=2, gradient_accumulation_steps=2, load_best_model_at_end=True, max_grad_norm=1.0, metric_for_best_model="eval_loss", greater_is_better=False, )