Spaces:
Runtime error
Runtime error
update md
Browse files
README.md
CHANGED
|
@@ -44,13 +44,98 @@ A Vietnamese sentiment analysis web interface built with Gradio and transformer
|
|
| 44 |
|
| 45 |
## 📊 Model Details
|
| 46 |
|
| 47 |
-
- **Model**: 5CD-AI/Vietnamese-Sentiment-visobert
|
| 48 |
- **Architecture**: Transformer-based (XLM-RoBERTa)
|
| 49 |
- **Language**: Vietnamese
|
| 50 |
-
- **Labels**: Negative, Neutral, Positive
|
| 51 |
- **Max Sequence Length**: 512 tokens
|
| 52 |
- **Device**: Automatic CUDA/CPU detection
|
| 53 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
## 💡 Example Usage
|
| 55 |
|
| 56 |
Try these example Vietnamese texts:
|
|
@@ -74,6 +159,96 @@ Try these example Vietnamese texts:
|
|
| 74 |
- Efficient batch processing
|
| 75 |
- Memory limit: 8GB (Hugging Face Spaces)
|
| 76 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
## 📋 Model Performance
|
| 78 |
|
| 79 |
The model provides:
|
|
@@ -93,11 +268,31 @@ This Space is configured for Hugging Face Spaces with:
|
|
| 93 |
## 📄 Requirements
|
| 94 |
|
| 95 |
See `requirements.txt` for complete dependency list:
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
-
|
| 99 |
-
-
|
| 100 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
|
| 102 |
## 🎯 Use Cases
|
| 103 |
|
|
|
|
| 44 |
|
| 45 |
## 📊 Model Details
|
| 46 |
|
| 47 |
+
- **Base Model**: 5CD-AI/Vietnamese-Sentiment-visobert
|
| 48 |
- **Architecture**: Transformer-based (XLM-RoBERTa)
|
| 49 |
- **Language**: Vietnamese
|
| 50 |
+
- **Labels**: Negative (0), Neutral (1), Positive (2)
|
| 51 |
- **Max Sequence Length**: 512 tokens
|
| 52 |
- **Device**: Automatic CUDA/CPU detection
|
| 53 |
|
| 54 |
+
## 🎯 Fine-Tuning Configuration
|
| 55 |
+
|
| 56 |
+
### Training Parameters
|
| 57 |
+
- **Learning Rate**: 2e-5
|
| 58 |
+
- **Batch Size**: 16 (train/eval)
|
| 59 |
+
- **Training Epochs**: 3
|
| 60 |
+
- **Weight Decay**: 0.01
|
| 61 |
+
- **Seed**: 42 (for reproducibility)
|
| 62 |
+
- **Optimizer**: AdamW (default in Trainer)
|
| 63 |
+
|
| 64 |
+
### Training Strategy
|
| 65 |
+
- **Evaluation Strategy**: Epoch-based evaluation
|
| 66 |
+
- **Save Strategy**: Save model at each epoch
|
| 67 |
+
- **Best Model Selection**: Based on F1 score
|
| 68 |
+
- **Early Stopping**: Load best model at end
|
| 69 |
+
- **Logging**: Every 10 steps
|
| 70 |
+
- **Checkpoint Limit**: Save last 2 checkpoints
|
| 71 |
+
|
| 72 |
+
### Data Processing
|
| 73 |
+
- **Tokenization**: AutoTokenizer with truncation and padding
|
| 74 |
+
- **Max Length**: 512 tokens
|
| 75 |
+
- **Data Collator**: DataCollatorWithPadding for dynamic padding
|
| 76 |
+
- **Text Columns**: Auto-detection (sentence, text, comment, feedback)
|
| 77 |
+
- **Label Columns**: Auto-detection (sentiment, label, labels)
|
| 78 |
+
|
| 79 |
+
## 📚 Dataset Information
|
| 80 |
+
|
| 81 |
+
### Primary Dataset
|
| 82 |
+
- **Name**: uitnlp/vietnamese_students_feedback
|
| 83 |
+
- **Type**: Student feedback sentiment analysis
|
| 84 |
+
- **Language**: Vietnamese
|
| 85 |
+
- **Labels**: 3-way classification (Negative, Neutral, Positive)
|
| 86 |
+
|
| 87 |
+
### Alternative Datasets (Fallback)
|
| 88 |
+
- **Name**: linhtranvi/5cdAI-Vietnamese-sentiment
|
| 89 |
+
- **Type**: General Vietnamese sentiment
|
| 90 |
+
- **Purpose**: Backup dataset if primary fails
|
| 91 |
+
|
| 92 |
+
### Sample Dataset (Built-in)
|
| 93 |
+
If external datasets fail, the system creates a sample dataset with:
|
| 94 |
+
- **Total Samples**: 20 Vietnamese texts
|
| 95 |
+
- **Distribution**:
|
| 96 |
+
- Positive: 8 samples
|
| 97 |
+
- Negative: 6 samples
|
| 98 |
+
- Neutral: 6 samples
|
| 99 |
+
- **Split**: 60% train, 20% validation, 20% test
|
| 100 |
+
- **Content**: Educational feedback and reviews
|
| 101 |
+
|
| 102 |
+
### Sample Data Examples
|
| 103 |
+
```python
|
| 104 |
+
# Positive examples
|
| 105 |
+
"Giảng viên dạy rất hay và tâm huyết, tôi học được nhiều kiến thức bổ ích."
|
| 106 |
+
"Môn học này rất thú vị và practical, giúp tôi áp dụng được vào thực tế."
|
| 107 |
+
|
| 108 |
+
# Negative examples
|
| 109 |
+
"Môn học quá khó và nhàm chán, không có gì để học cả."
|
| 110 |
+
"Giảng viên dạy không rõ ràng, tốc độ quá nhanh, không theo kịp."
|
| 111 |
+
|
| 112 |
+
# Neutral examples
|
| 113 |
+
"Môn học ổn định, không có gì đặc biệt để nhận xét."
|
| 114 |
+
"Nội dung cơ bản, phù hợp với chương trình đề ra."
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
## 📈 Model Performance & Evaluation
|
| 118 |
+
|
| 119 |
+
### Metrics Tracked
|
| 120 |
+
- **Accuracy**: Overall prediction accuracy
|
| 121 |
+
- **F1 Score**: Weighted F1 score (primary metric)
|
| 122 |
+
- **Precision**: Weighted precision
|
| 123 |
+
- **Recall**: Weighted recall
|
| 124 |
+
- **Training Loss**: Loss progression over epochs
|
| 125 |
+
- **Evaluation Loss**: Validation loss per epoch
|
| 126 |
+
|
| 127 |
+
### Evaluation Output
|
| 128 |
+
- **Classification Report**: Detailed per-class metrics
|
| 129 |
+
- **Confusion Matrix**: Visual confusion matrix saved as PNG
|
| 130 |
+
- **Training History**: Loss and F1 plots saved as PNG
|
| 131 |
+
- **Best Model**: Saved based on highest F1 score
|
| 132 |
+
|
| 133 |
+
### Expected Performance
|
| 134 |
+
- **Target F1 Score**: >0.80 on validation set
|
| 135 |
+
- **Target Accuracy**: >0.80 on validation set
|
| 136 |
+
- **Training Time**: ~15-30 minutes (depending on hardware)
|
| 137 |
+
- **Memory Usage**: ~2-4GB during training
|
| 138 |
+
|
| 139 |
## 💡 Example Usage
|
| 140 |
|
| 141 |
Try these example Vietnamese texts:
|
|
|
|
| 159 |
- Efficient batch processing
|
| 160 |
- Memory limit: 8GB (Hugging Face Spaces)
|
| 161 |
|
| 162 |
+
## 📁 Project Structure
|
| 163 |
+
|
| 164 |
+
```
|
| 165 |
+
SentimentAnalysis/
|
| 166 |
+
├── app.py # Main Hugging Face Spaces app
|
| 167 |
+
├── train.py # Training entry point
|
| 168 |
+
├── test.py # Testing entry point
|
| 169 |
+
├── demo.py # Demo entry point
|
| 170 |
+
├── web.py # Web interface entry point
|
| 171 |
+
├── main.py # Main program entry point
|
| 172 |
+
├── requirements.txt # Python dependencies
|
| 173 |
+
├── requirements_spaces.txt # Hugging Face Spaces dependencies
|
| 174 |
+
├── .space.yaml # Hugging Face Spaces configuration
|
| 175 |
+
├── .gitignore # Git ignore rules
|
| 176 |
+
├── README.md # This file
|
| 177 |
+
├── py/ # Core Python modules
|
| 178 |
+
│ ├── fine_tune_sentiment.py # Fine-tuning implementation
|
| 179 |
+
│ ├── test_model.py # Model testing utilities
|
| 180 |
+
│ └── demo.py # Demo implementation
|
| 181 |
+
├── pdf/ # Documentation (paper.tex only)
|
| 182 |
+
│ └── paper.tex # LaTeX paper (only tracked file)
|
| 183 |
+
├── vietnamese_sentiment_finetuned/ # Fine-tuned model output (if trained)
|
| 184 |
+
├── training_history.png # Training history plot
|
| 185 |
+
├── confusion_matrix.png # Confusion matrix visualization
|
| 186 |
+
└── deploy_package/ # Deployment artifacts
|
| 187 |
+
```
|
| 188 |
+
|
| 189 |
+
## 🔬 Model Training & Fine-Tuning
|
| 190 |
+
|
| 191 |
+
### How to Fine-Tune the Model
|
| 192 |
+
|
| 193 |
+
1. **Using the training script**:
|
| 194 |
+
```bash
|
| 195 |
+
python train.py
|
| 196 |
+
```
|
| 197 |
+
|
| 198 |
+
2. **Direct fine-tuning**:
|
| 199 |
+
```python
|
| 200 |
+
from py.fine_tune_sentiment import SentimentFineTuner
|
| 201 |
+
|
| 202 |
+
# Initialize fine-tuner
|
| 203 |
+
fine_tuner = SentimentFineTuner()
|
| 204 |
+
|
| 205 |
+
# Run complete fine-tuning pipeline
|
| 206 |
+
fine_tuner.run_fine_tuning(
|
| 207 |
+
output_dir="./vietnamese_sentiment_finetuned",
|
| 208 |
+
learning_rate=2e-5,
|
| 209 |
+
batch_size=16,
|
| 210 |
+
num_epochs=3
|
| 211 |
+
)
|
| 212 |
+
```
|
| 213 |
+
|
| 214 |
+
3. **Custom configuration**:
|
| 215 |
+
```python
|
| 216 |
+
# Load model and tokenizer
|
| 217 |
+
fine_tuner.load_model_and_tokenizer()
|
| 218 |
+
|
| 219 |
+
# Load and prepare dataset
|
| 220 |
+
fine_tuner.load_and_prepare_dataset()
|
| 221 |
+
|
| 222 |
+
# Tokenize datasets
|
| 223 |
+
fine_tuner.tokenize_datasets()
|
| 224 |
+
|
| 225 |
+
# Setup custom training
|
| 226 |
+
fine_tuner.setup_trainer(
|
| 227 |
+
output_dir="./custom_model",
|
| 228 |
+
learning_rate=5e-5, # Custom learning rate
|
| 229 |
+
batch_size=8, # Custom batch size
|
| 230 |
+
num_epochs=5 # Custom epochs
|
| 231 |
+
)
|
| 232 |
+
|
| 233 |
+
# Train and evaluate
|
| 234 |
+
fine_tuner.train_model()
|
| 235 |
+
eval_results, y_pred, y_true = fine_tuner.evaluate_model()
|
| 236 |
+
```
|
| 237 |
+
|
| 238 |
+
### Training Outputs
|
| 239 |
+
- **Model Files**: Saved to specified output directory
|
| 240 |
+
- **Tokenizer**: Saved with model configuration
|
| 241 |
+
- **Training History**: `training_history.png`
|
| 242 |
+
- **Confusion Matrix**: `confusion_matrix.png`
|
| 243 |
+
- **Logs**: Training logs in `{output_dir}/logs/`
|
| 244 |
+
|
| 245 |
+
### Fine-Tuning Features
|
| 246 |
+
- **Automatic Dataset Loading**: Supports multiple Vietnamese datasets
|
| 247 |
+
- **Flexible Column Detection**: Auto-detects text and label columns
|
| 248 |
+
- **Fallback Sample Dataset**: Built-in dataset if external fails
|
| 249 |
+
- **Comprehensive Evaluation**: Multiple metrics and visualizations
|
| 250 |
+
- **Memory Efficient**: Optimized for limited resources
|
| 251 |
+
|
| 252 |
## 📋 Model Performance
|
| 253 |
|
| 254 |
The model provides:
|
|
|
|
| 268 |
## 📄 Requirements
|
| 269 |
|
| 270 |
See `requirements.txt` for complete dependency list:
|
| 271 |
+
|
| 272 |
+
### Core Dependencies
|
| 273 |
+
- **torch>=2.0.0**: PyTorch for deep learning
|
| 274 |
+
- **transformers>=4.21.0**: Hugging Face transformers
|
| 275 |
+
- **gradio>=4.44.0**: Web interface framework
|
| 276 |
+
- **psutil**: System and process monitoring
|
| 277 |
+
|
| 278 |
+
### Fine-Tuning Dependencies
|
| 279 |
+
- **datasets**: Hugging Face datasets for loading training data
|
| 280 |
+
- **scikit-learn**: Machine learning metrics and evaluation
|
| 281 |
+
- **pandas**: Data manipulation and analysis
|
| 282 |
+
- **numpy**: Numerical computing
|
| 283 |
+
- **matplotlib**: Plotting and visualization
|
| 284 |
+
- **seaborn**: Statistical data visualization
|
| 285 |
+
- **tqdm**: Progress bars for training
|
| 286 |
+
|
| 287 |
+
### Installation
|
| 288 |
+
```bash
|
| 289 |
+
pip install -r requirements.txt
|
| 290 |
+
```
|
| 291 |
+
|
| 292 |
+
For fine-tuning specifically:
|
| 293 |
+
```bash
|
| 294 |
+
pip install torch transformers datasets scikit-learn pandas numpy matplotlib seaborn tqdm psutil gradio
|
| 295 |
+
```
|
| 296 |
|
| 297 |
## 🎯 Use Cases
|
| 298 |
|