shegga commited on
Commit
4fc5703
·
1 Parent(s): 4d616d3
Files changed (1) hide show
  1. README.md +202 -7
README.md CHANGED
@@ -44,13 +44,98 @@ A Vietnamese sentiment analysis web interface built with Gradio and transformer
44
 
45
  ## 📊 Model Details
46
 
47
- - **Model**: 5CD-AI/Vietnamese-Sentiment-visobert
48
  - **Architecture**: Transformer-based (XLM-RoBERTa)
49
  - **Language**: Vietnamese
50
- - **Labels**: Negative, Neutral, Positive
51
  - **Max Sequence Length**: 512 tokens
52
  - **Device**: Automatic CUDA/CPU detection
53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
  ## 💡 Example Usage
55
 
56
  Try these example Vietnamese texts:
@@ -74,6 +159,96 @@ Try these example Vietnamese texts:
74
  - Efficient batch processing
75
  - Memory limit: 8GB (Hugging Face Spaces)
76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
  ## 📋 Model Performance
78
 
79
  The model provides:
@@ -93,11 +268,31 @@ This Space is configured for Hugging Face Spaces with:
93
  ## 📄 Requirements
94
 
95
  See `requirements.txt` for complete dependency list:
96
- - torch>=2.0.0
97
- - transformers>=4.21.0
98
- - gradio>=4.44.0
99
- - pandas, numpy, scikit-learn
100
- - psutil for memory monitoring
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
101
 
102
  ## 🎯 Use Cases
103
 
 
44
 
45
  ## 📊 Model Details
46
 
47
+ - **Base Model**: 5CD-AI/Vietnamese-Sentiment-visobert
48
  - **Architecture**: Transformer-based (XLM-RoBERTa)
49
  - **Language**: Vietnamese
50
+ - **Labels**: Negative (0), Neutral (1), Positive (2)
51
  - **Max Sequence Length**: 512 tokens
52
  - **Device**: Automatic CUDA/CPU detection
53
 
54
+ ## 🎯 Fine-Tuning Configuration
55
+
56
+ ### Training Parameters
57
+ - **Learning Rate**: 2e-5
58
+ - **Batch Size**: 16 (train/eval)
59
+ - **Training Epochs**: 3
60
+ - **Weight Decay**: 0.01
61
+ - **Seed**: 42 (for reproducibility)
62
+ - **Optimizer**: AdamW (default in Trainer)
63
+
64
+ ### Training Strategy
65
+ - **Evaluation Strategy**: Epoch-based evaluation
66
+ - **Save Strategy**: Save model at each epoch
67
+ - **Best Model Selection**: Based on F1 score
68
+ - **Early Stopping**: Load best model at end
69
+ - **Logging**: Every 10 steps
70
+ - **Checkpoint Limit**: Save last 2 checkpoints
71
+
72
+ ### Data Processing
73
+ - **Tokenization**: AutoTokenizer with truncation and padding
74
+ - **Max Length**: 512 tokens
75
+ - **Data Collator**: DataCollatorWithPadding for dynamic padding
76
+ - **Text Columns**: Auto-detection (sentence, text, comment, feedback)
77
+ - **Label Columns**: Auto-detection (sentiment, label, labels)
78
+
79
+ ## 📚 Dataset Information
80
+
81
+ ### Primary Dataset
82
+ - **Name**: uitnlp/vietnamese_students_feedback
83
+ - **Type**: Student feedback sentiment analysis
84
+ - **Language**: Vietnamese
85
+ - **Labels**: 3-way classification (Negative, Neutral, Positive)
86
+
87
+ ### Alternative Datasets (Fallback)
88
+ - **Name**: linhtranvi/5cdAI-Vietnamese-sentiment
89
+ - **Type**: General Vietnamese sentiment
90
+ - **Purpose**: Backup dataset if primary fails
91
+
92
+ ### Sample Dataset (Built-in)
93
+ If external datasets fail, the system creates a sample dataset with:
94
+ - **Total Samples**: 20 Vietnamese texts
95
+ - **Distribution**:
96
+ - Positive: 8 samples
97
+ - Negative: 6 samples
98
+ - Neutral: 6 samples
99
+ - **Split**: 60% train, 20% validation, 20% test
100
+ - **Content**: Educational feedback and reviews
101
+
102
+ ### Sample Data Examples
103
+ ```python
104
+ # Positive examples
105
+ "Giảng viên dạy rất hay và tâm huyết, tôi học được nhiều kiến thức bổ ích."
106
+ "Môn học này rất thú vị và practical, giúp tôi áp dụng được vào thực tế."
107
+
108
+ # Negative examples
109
+ "Môn học quá khó và nhàm chán, không có gì để học cả."
110
+ "Giảng viên dạy không rõ ràng, tốc độ quá nhanh, không theo kịp."
111
+
112
+ # Neutral examples
113
+ "Môn học ổn định, không có gì đặc biệt để nhận xét."
114
+ "Nội dung cơ bản, phù hợp với chương trình đề ra."
115
+ ```
116
+
117
+ ## 📈 Model Performance & Evaluation
118
+
119
+ ### Metrics Tracked
120
+ - **Accuracy**: Overall prediction accuracy
121
+ - **F1 Score**: Weighted F1 score (primary metric)
122
+ - **Precision**: Weighted precision
123
+ - **Recall**: Weighted recall
124
+ - **Training Loss**: Loss progression over epochs
125
+ - **Evaluation Loss**: Validation loss per epoch
126
+
127
+ ### Evaluation Output
128
+ - **Classification Report**: Detailed per-class metrics
129
+ - **Confusion Matrix**: Visual confusion matrix saved as PNG
130
+ - **Training History**: Loss and F1 plots saved as PNG
131
+ - **Best Model**: Saved based on highest F1 score
132
+
133
+ ### Expected Performance
134
+ - **Target F1 Score**: >0.80 on validation set
135
+ - **Target Accuracy**: >0.80 on validation set
136
+ - **Training Time**: ~15-30 minutes (depending on hardware)
137
+ - **Memory Usage**: ~2-4GB during training
138
+
139
  ## 💡 Example Usage
140
 
141
  Try these example Vietnamese texts:
 
159
  - Efficient batch processing
160
  - Memory limit: 8GB (Hugging Face Spaces)
161
 
162
+ ## 📁 Project Structure
163
+
164
+ ```
165
+ SentimentAnalysis/
166
+ ├── app.py # Main Hugging Face Spaces app
167
+ ├── train.py # Training entry point
168
+ ├── test.py # Testing entry point
169
+ ├── demo.py # Demo entry point
170
+ ├── web.py # Web interface entry point
171
+ ├── main.py # Main program entry point
172
+ ├── requirements.txt # Python dependencies
173
+ ├── requirements_spaces.txt # Hugging Face Spaces dependencies
174
+ ├── .space.yaml # Hugging Face Spaces configuration
175
+ ├── .gitignore # Git ignore rules
176
+ ├── README.md # This file
177
+ ├── py/ # Core Python modules
178
+ │ ├── fine_tune_sentiment.py # Fine-tuning implementation
179
+ │ ├── test_model.py # Model testing utilities
180
+ │ └── demo.py # Demo implementation
181
+ ├── pdf/ # Documentation (paper.tex only)
182
+ │ └── paper.tex # LaTeX paper (only tracked file)
183
+ ├── vietnamese_sentiment_finetuned/ # Fine-tuned model output (if trained)
184
+ ├── training_history.png # Training history plot
185
+ ├── confusion_matrix.png # Confusion matrix visualization
186
+ └── deploy_package/ # Deployment artifacts
187
+ ```
188
+
189
+ ## 🔬 Model Training & Fine-Tuning
190
+
191
+ ### How to Fine-Tune the Model
192
+
193
+ 1. **Using the training script**:
194
+ ```bash
195
+ python train.py
196
+ ```
197
+
198
+ 2. **Direct fine-tuning**:
199
+ ```python
200
+ from py.fine_tune_sentiment import SentimentFineTuner
201
+
202
+ # Initialize fine-tuner
203
+ fine_tuner = SentimentFineTuner()
204
+
205
+ # Run complete fine-tuning pipeline
206
+ fine_tuner.run_fine_tuning(
207
+ output_dir="./vietnamese_sentiment_finetuned",
208
+ learning_rate=2e-5,
209
+ batch_size=16,
210
+ num_epochs=3
211
+ )
212
+ ```
213
+
214
+ 3. **Custom configuration**:
215
+ ```python
216
+ # Load model and tokenizer
217
+ fine_tuner.load_model_and_tokenizer()
218
+
219
+ # Load and prepare dataset
220
+ fine_tuner.load_and_prepare_dataset()
221
+
222
+ # Tokenize datasets
223
+ fine_tuner.tokenize_datasets()
224
+
225
+ # Setup custom training
226
+ fine_tuner.setup_trainer(
227
+ output_dir="./custom_model",
228
+ learning_rate=5e-5, # Custom learning rate
229
+ batch_size=8, # Custom batch size
230
+ num_epochs=5 # Custom epochs
231
+ )
232
+
233
+ # Train and evaluate
234
+ fine_tuner.train_model()
235
+ eval_results, y_pred, y_true = fine_tuner.evaluate_model()
236
+ ```
237
+
238
+ ### Training Outputs
239
+ - **Model Files**: Saved to specified output directory
240
+ - **Tokenizer**: Saved with model configuration
241
+ - **Training History**: `training_history.png`
242
+ - **Confusion Matrix**: `confusion_matrix.png`
243
+ - **Logs**: Training logs in `{output_dir}/logs/`
244
+
245
+ ### Fine-Tuning Features
246
+ - **Automatic Dataset Loading**: Supports multiple Vietnamese datasets
247
+ - **Flexible Column Detection**: Auto-detects text and label columns
248
+ - **Fallback Sample Dataset**: Built-in dataset if external fails
249
+ - **Comprehensive Evaluation**: Multiple metrics and visualizations
250
+ - **Memory Efficient**: Optimized for limited resources
251
+
252
  ## 📋 Model Performance
253
 
254
  The model provides:
 
268
  ## 📄 Requirements
269
 
270
  See `requirements.txt` for complete dependency list:
271
+
272
+ ### Core Dependencies
273
+ - **torch>=2.0.0**: PyTorch for deep learning
274
+ - **transformers>=4.21.0**: Hugging Face transformers
275
+ - **gradio>=4.44.0**: Web interface framework
276
+ - **psutil**: System and process monitoring
277
+
278
+ ### Fine-Tuning Dependencies
279
+ - **datasets**: Hugging Face datasets for loading training data
280
+ - **scikit-learn**: Machine learning metrics and evaluation
281
+ - **pandas**: Data manipulation and analysis
282
+ - **numpy**: Numerical computing
283
+ - **matplotlib**: Plotting and visualization
284
+ - **seaborn**: Statistical data visualization
285
+ - **tqdm**: Progress bars for training
286
+
287
+ ### Installation
288
+ ```bash
289
+ pip install -r requirements.txt
290
+ ```
291
+
292
+ For fine-tuning specifically:
293
+ ```bash
294
+ pip install torch transformers datasets scikit-learn pandas numpy matplotlib seaborn tqdm psutil gradio
295
+ ```
296
 
297
  ## 🎯 Use Cases
298