Spaces:

shegga
/

SentimentAnalysisForNMTTNT

Runtime error

App Files Files Community

shegga commited on Oct 18

Commit

4fc5703

1 Parent(s): 4d616d3

update md

Browse files

Files changed (1) hide show

README.md +202 -7

README.md CHANGED Viewed

@@ -44,13 +44,98 @@ A Vietnamese sentiment analysis web interface built with Gradio and transformer
 ## 📊 Model Details
-- **Model**: 5CD-AI/Vietnamese-Sentiment-visobert
 - **Architecture**: Transformer-based (XLM-RoBERTa)
 - **Language**: Vietnamese
-- **Labels**: Negative, Neutral, Positive
 - **Max Sequence Length**: 512 tokens
 - **Device**: Automatic CUDA/CPU detection
 ## 💡 Example Usage
 Try these example Vietnamese texts:
@@ -74,6 +159,96 @@ Try these example Vietnamese texts:
 - Efficient batch processing
 - Memory limit: 8GB (Hugging Face Spaces)
 ## 📋 Model Performance
 The model provides:
@@ -93,11 +268,31 @@ This Space is configured for Hugging Face Spaces with:
 ## 📄 Requirements
 See `requirements.txt` for complete dependency list:
-- torch>=2.0.0
-- transformers>=4.21.0
-- gradio>=4.44.0
-- pandas, numpy, scikit-learn
-- psutil for memory monitoring
 ## 🎯 Use Cases

 ## 📊 Model Details
+- **Base Model**: 5CD-AI/Vietnamese-Sentiment-visobert
 - **Architecture**: Transformer-based (XLM-RoBERTa)
 - **Language**: Vietnamese
+- **Labels**: Negative (0), Neutral (1), Positive (2)
 - **Max Sequence Length**: 512 tokens
 - **Device**: Automatic CUDA/CPU detection
+## 🎯 Fine-Tuning Configuration
+### Training Parameters
+- **Learning Rate**: 2e-5
+- **Batch Size**: 16 (train/eval)
+- **Training Epochs**: 3
+- **Weight Decay**: 0.01
+- **Seed**: 42 (for reproducibility)
+- **Optimizer**: AdamW (default in Trainer)
+### Training Strategy
+- **Evaluation Strategy**: Epoch-based evaluation
+- **Save Strategy**: Save model at each epoch
+- **Best Model Selection**: Based on F1 score
+- **Early Stopping**: Load best model at end
+- **Logging**: Every 10 steps
+- **Checkpoint Limit**: Save last 2 checkpoints
+### Data Processing
+- **Tokenization**: AutoTokenizer with truncation and padding
+- **Max Length**: 512 tokens
+- **Data Collator**: DataCollatorWithPadding for dynamic padding
+- **Text Columns**: Auto-detection (sentence, text, comment, feedback)
+- **Label Columns**: Auto-detection (sentiment, label, labels)
+## 📚 Dataset Information
+### Primary Dataset
+- **Name**: uitnlp/vietnamese_students_feedback
+- **Type**: Student feedback sentiment analysis
+- **Language**: Vietnamese
+- **Labels**: 3-way classification (Negative, Neutral, Positive)
+### Alternative Datasets (Fallback)
+- **Name**: linhtranvi/5cdAI-Vietnamese-sentiment
+- **Type**: General Vietnamese sentiment
+- **Purpose**: Backup dataset if primary fails
+### Sample Dataset (Built-in)
+If external datasets fail, the system creates a sample dataset with:
+- **Total Samples**: 20 Vietnamese texts
+- **Distribution**:
+  - Positive: 8 samples
+  - Negative: 6 samples
+  - Neutral: 6 samples
+- **Split**: 60% train, 20% validation, 20% test
+- **Content**: Educational feedback and reviews
+### Sample Data Examples
+```python
+# Positive examples
+"Giảng viên dạy rất hay và tâm huyết, tôi học được nhiều kiến thức bổ ích."
+"Môn học này rất thú vị và practical, giúp tôi áp dụng được vào thực tế."
+# Negative examples
+"Môn học quá khó và nhàm chán, không có gì để học cả."
+"Giảng viên dạy không rõ ràng, tốc độ quá nhanh, không theo kịp."
+# Neutral examples
+"Môn học ổn định, không có gì đặc biệt để nhận xét."
+"Nội dung cơ bản, phù hợp với chương trình đề ra."
+```
+## 📈 Model Performance & Evaluation
+### Metrics Tracked
+- **Accuracy**: Overall prediction accuracy
+- **F1 Score**: Weighted F1 score (primary metric)
+- **Precision**: Weighted precision
+- **Recall**: Weighted recall
+- **Training Loss**: Loss progression over epochs
+- **Evaluation Loss**: Validation loss per epoch
+### Evaluation Output
+- **Classification Report**: Detailed per-class metrics
+- **Confusion Matrix**: Visual confusion matrix saved as PNG
+- **Training History**: Loss and F1 plots saved as PNG
+- **Best Model**: Saved based on highest F1 score
+### Expected Performance
+- **Target F1 Score**: >0.80 on validation set
+- **Target Accuracy**: >0.80 on validation set
+- **Training Time**: ~15-30 minutes (depending on hardware)
+- **Memory Usage**: ~2-4GB during training
 ## 💡 Example Usage
 Try these example Vietnamese texts:
 - Efficient batch processing
 - Memory limit: 8GB (Hugging Face Spaces)
+## 📁 Project Structure
+```
+SentimentAnalysis/
+├── app.py                          # Main Hugging Face Spaces app
+├── train.py                        # Training entry point
+├── test.py                         # Testing entry point
+├── demo.py                         # Demo entry point
+├── web.py                          # Web interface entry point
+├── main.py                         # Main program entry point
+├── requirements.txt                # Python dependencies
+├── requirements_spaces.txt         # Hugging Face Spaces dependencies
+├── .space.yaml                     # Hugging Face Spaces configuration
+├── .gitignore                      # Git ignore rules
+├── README.md                       # This file
+├── py/                             # Core Python modules
+│   ├── fine_tune_sentiment.py      # Fine-tuning implementation
+│   ├── test_model.py               # Model testing utilities
+│   └── demo.py                     # Demo implementation
+├── pdf/                            # Documentation (paper.tex only)
+│   └── paper.tex                   # LaTeX paper (only tracked file)
+├── vietnamese_sentiment_finetuned/ # Fine-tuned model output (if trained)
+├── training_history.png            # Training history plot
+├── confusion_matrix.png            # Confusion matrix visualization
+└── deploy_package/                 # Deployment artifacts
+```
+## 🔬 Model Training & Fine-Tuning
+### How to Fine-Tune the Model
+1. **Using the training script**:
+   ```bash
+   python train.py
+   ```
+2. **Direct fine-tuning**:
+   ```python
+   from py.fine_tune_sentiment import SentimentFineTuner
+   # Initialize fine-tuner
+   fine_tuner = SentimentFineTuner()
+   # Run complete fine-tuning pipeline
+   fine_tuner.run_fine_tuning(
+       output_dir="./vietnamese_sentiment_finetuned",
+       learning_rate=2e-5,
+       batch_size=16,
+       num_epochs=3
+   )
+   ```
+3. **Custom configuration**:
+   ```python
+   # Load model and tokenizer
+   fine_tuner.load_model_and_tokenizer()
+   # Load and prepare dataset
+   fine_tuner.load_and_prepare_dataset()
+   # Tokenize datasets
+   fine_tuner.tokenize_datasets()
+   # Setup custom training
+   fine_tuner.setup_trainer(
+       output_dir="./custom_model",
+       learning_rate=5e-5,  # Custom learning rate
+       batch_size=8,        # Custom batch size
+       num_epochs=5         # Custom epochs
+   )
+   # Train and evaluate
+   fine_tuner.train_model()
+   eval_results, y_pred, y_true = fine_tuner.evaluate_model()
+   ```
+### Training Outputs
+- **Model Files**: Saved to specified output directory
+- **Tokenizer**: Saved with model configuration
+- **Training History**: `training_history.png`
+- **Confusion Matrix**: `confusion_matrix.png`
+- **Logs**: Training logs in `{output_dir}/logs/`
+### Fine-Tuning Features
+- **Automatic Dataset Loading**: Supports multiple Vietnamese datasets
+- **Flexible Column Detection**: Auto-detects text and label columns
+- **Fallback Sample Dataset**: Built-in dataset if external fails
+- **Comprehensive Evaluation**: Multiple metrics and visualizations
+- **Memory Efficient**: Optimized for limited resources
 ## 📋 Model Performance
 The model provides:
 ## 📄 Requirements
 See `requirements.txt` for complete dependency list:
+### Core Dependencies
+- **torch>=2.0.0**: PyTorch for deep learning
+- **transformers>=4.21.0**: Hugging Face transformers
+- **gradio>=4.44.0**: Web interface framework
+- **psutil**: System and process monitoring
+### Fine-Tuning Dependencies
+- **datasets**: Hugging Face datasets for loading training data
+- **scikit-learn**: Machine learning metrics and evaluation
+- **pandas**: Data manipulation and analysis
+- **numpy**: Numerical computing
+- **matplotlib**: Plotting and visualization
+- **seaborn**: Statistical data visualization
+- **tqdm**: Progress bars for training
+### Installation
+```bash
+pip install -r requirements.txt
+```
+For fine-tuning specifically:
+```bash
+pip install torch transformers datasets scikit-learn pandas numpy matplotlib seaborn tqdm psutil gradio
+```
 ## 🎯 Use Cases