Sentiment Span Extraction in Vietnamese using BiLSTM-CRF Model
This repository contains the implementation of a BiLSTM-CRF model for the task of Sentiment Span Extraction in Vietnamese. The model is built using PyTorch and trained on the UIT-ViSD4SA dataset.
Task Description
Sentiment span extraction involves identifying and extracting spans of text that express sentiment in a given document. The task is particularly useful for sentiment analysis in text data, as it allows for more granular understanding of sentiment expression compared to traditional methods like sentiment classification.
Dataset
The model is trained on the UIT-ViSD4SA dataset, which is specifically curated for sentiment analysis in Vietnamese text data. The dataset consists of annotated text samples where sentiment spans are marked along with their corresponding sentiment labels.
Model Architecture
The model architecture is based on a Bidirectional Long Short-Term Memory (BiLSTM) network followed by a Conditional Random Field (CRF) layer. BiLSTM networks are effective in capturing sequential dependencies in text data, while CRF layers help in modeling label dependencies and enhancing the overall performance of sequence labeling tasks like sentiment span extraction.
Usage
NOTE: Please use CUDA
!pip install pytorch-crf
from huggingface_hub import hf_hub_download, PyTorchModelHubMixin
import torch
import torch.nn as nn
from torchcrf import CRF
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
REPO_ID = "hoangduy0610/uit-cs112-sentiment-span-extraction-vietnamese"
FILENAME = "bilstm_crf_state.pth"
CONFIG = {
'vocab_size': 18182,
'tag_to_idx': {
'I-CAMERA#NEUTRAL': 0,
'B-SER&ACC#NEGATIVE': 1,
'B-SER&ACC#NEUTRAL': 2,
'B-CAMERA#NEUTRAL': 3,
'B-STORAGE#NEUTRAL': 4,
'I-STORAGE#NEUTRAL': 5,
'I-BATTERY#NEUTRAL': 6,
'B-STORAGE#NEGATIVE': 7,
'I-STORAGE#NEGATIVE': 8,
'B-GENERAL#POSITIVE': 9,
'I-CAMERA#NEGATIVE': 10,
'B-GENERAL#NEUTRAL': 11,
'I-DESIGN#NEUTRAL': 12,
'I-PERFORMANCE#POSITIVE': 13,
'B-SCREEN#NEUTRAL': 14,
'B-FEATURES#POSITIVE': 15,
'I-DESIGN#NEGATIVE': 16,
'I-BATTERY#NEGATIVE': 17,
'B-FEATURES#NEGATIVE': 18,
'I-DESIGN#POSITIVE': 19,
'I-FEATURES#NEGATIVE': 20,
'B-CAMERA#NEGATIVE': 21,
'I-PRICE#NEUTRAL': 22,
'I-FEATURES#NEUTRAL': 23,
'B-PRICE#POSITIVE': 24,
'I-PERFORMANCE#NEUTRAL': 25,
'I-FEATURES#POSITIVE': 26,
'I-PERFORMANCE#NEGATIVE': 27,
'B-SER&ACC#POSITIVE': 28,
'B-PRICE#NEGATIVE': 29,
'I-SCREEN#NEUTRAL': 30,
'B-DESIGN#NEUTRAL': 31,
'B-BATTERY#POSITIVE': 32,
'B-STORAGE#POSITIVE': 33,
'I-GENERAL#POSITIVE': 34,
'B-CAMERA#POSITIVE': 35,
'B-PERFORMANCE#NEGATIVE': 36,
'B-PERFORMANCE#NEUTRAL': 37,
'B-GENERAL#NEGATIVE': 38,
'I-CAMERA#POSITIVE': 39,
'I-BATTERY#POSITIVE': 40,
'I-GENERAL#NEGATIVE': 41,
'B-BATTERY#NEUTRAL': 42,
'I-SER&ACC#POSITIVE': 43,
'I-SER&ACC#NEUTRAL': 44,
'I-SER&ACC#NEGATIVE': 45,
'I-PRICE#NEGATIVE': 46,
'B-FEATURES#NEUTRAL': 47,
'B-SCREEN#POSITIVE': 48,
'B-BATTERY#NEGATIVE': 49,
'I-SCREEN#NEGATIVE': 50,
'B-SCREEN#NEGATIVE': 51,
'O': 52,
'I-GENERAL#NEUTRAL': 53,
'I-SCREEN#POSITIVE': 54,
'B-PERFORMANCE#POSITIVE': 55,
'B-PRICE#NEUTRAL': 56,
'I-STORAGE#POSITIVE': 57,
'B-DESIGN#NEGATIVE': 58,
'I-PRICE#POSITIVE': 59,
'B-DESIGN#POSITIVE': 60
},
'embedding_dim': 100,
'hidden_dim': 256,
'lstm_layers': 1
}
# BiLSTM-CRF model
class BiLSTM_CRF(
nn.Module,
PyTorchModelHubMixin
):
def __init__(self, config: dict):
super().__init__()
self.embedding = nn.Embedding(config["vocab_size"], config["embedding_dim"])
self.lstm = nn.LSTM(config["embedding_dim"], config["hidden_dim"] // 2,
num_layers=config["lstm_layers"], bidirectional=True, batch_first=True)
self.ln = nn.Linear(config["hidden_dim"], len(config["tag_to_idx"]))
self.crf = CRF(len(config["tag_to_idx"]))
def forward(self, sentence):
embeds = self.embedding(sentence)
lstm_out, _ = self.lstm(embeds)
emissions = self.ln(lstm_out)
return emissions
def summary(self):
print(self)
print('\n\nModel Summary:')
print('=================================================================')
print('Layer (type) Output Shape Param # ')
print('=================================================================')
total_params = 0
for name, param in self.named_parameters():
print(f'{name:<30} {str(param.shape):<30} {param.numel():<10}')
total_params += param.numel()
print('=================================================================')
print(f'Total params: {total_params}')
loaded_model = BiLSTM_CRF(config=CONFIG).to(device)
loaded_model.load_state_dict(torch.load(hf_hub_download(repo_id=REPO_ID, filename=FILENAME)))
loaded_model.eval()
Performance
The performance of the model can be evaluated using standard metrics for sequence labeling tasks such as precision, recall, and F1-score. Additionally, qualitative analysis can be conducted to examine the model's ability to accurately extract sentiment spans from text.
Future Work
- Fine-tuning the model on larger datasets to improve performance.
- Experimenting with different variations of the model architecture to further enhance results.
- Exploring techniques for domain adaptation to improve model generalization across different text domains.
References
License
This project is licensed under the MIT License - see the LICENSE file for details.