Sentiment Span Extraction in Vietnamese using BiLSTM-CRF Model

This repository contains the implementation of a BiLSTM-CRF model for the task of Sentiment Span Extraction in Vietnamese. The model is built using PyTorch and trained on the UIT-ViSD4SA dataset.

Task Description

Sentiment span extraction involves identifying and extracting spans of text that express sentiment in a given document. The task is particularly useful for sentiment analysis in text data, as it allows for more granular understanding of sentiment expression compared to traditional methods like sentiment classification.

Dataset

The model is trained on the UIT-ViSD4SA dataset, which is specifically curated for sentiment analysis in Vietnamese text data. The dataset consists of annotated text samples where sentiment spans are marked along with their corresponding sentiment labels.

Model Architecture

The model architecture is based on a Bidirectional Long Short-Term Memory (BiLSTM) network followed by a Conditional Random Field (CRF) layer. BiLSTM networks are effective in capturing sequential dependencies in text data, while CRF layers help in modeling label dependencies and enhancing the overall performance of sequence labeling tasks like sentiment span extraction.

Usage

NOTE: Please use CUDA

!pip install pytorch-crf

from huggingface_hub import hf_hub_download, PyTorchModelHubMixin
import torch
import torch.nn as nn
from torchcrf import CRF

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

REPO_ID = "hoangduy0610/uit-cs112-sentiment-span-extraction-vietnamese"
FILENAME = "bilstm_crf_state.pth"
CONFIG = {
    'vocab_size': 18182,
    'tag_to_idx': {
        'I-CAMERA#NEUTRAL': 0,
        'B-SER&ACC#NEGATIVE': 1,
        'B-SER&ACC#NEUTRAL': 2,
        'B-CAMERA#NEUTRAL': 3,
        'B-STORAGE#NEUTRAL': 4,
        'I-STORAGE#NEUTRAL': 5,
        'I-BATTERY#NEUTRAL': 6,
        'B-STORAGE#NEGATIVE': 7,
        'I-STORAGE#NEGATIVE': 8,
        'B-GENERAL#POSITIVE': 9,
        'I-CAMERA#NEGATIVE': 10,
        'B-GENERAL#NEUTRAL': 11,
        'I-DESIGN#NEUTRAL': 12,
        'I-PERFORMANCE#POSITIVE': 13,
        'B-SCREEN#NEUTRAL': 14,
        'B-FEATURES#POSITIVE': 15,
        'I-DESIGN#NEGATIVE': 16,
        'I-BATTERY#NEGATIVE': 17,
        'B-FEATURES#NEGATIVE': 18,
        'I-DESIGN#POSITIVE': 19,
        'I-FEATURES#NEGATIVE': 20,
        'B-CAMERA#NEGATIVE': 21,
        'I-PRICE#NEUTRAL': 22,
        'I-FEATURES#NEUTRAL': 23,
        'B-PRICE#POSITIVE': 24,
        'I-PERFORMANCE#NEUTRAL': 25,
        'I-FEATURES#POSITIVE': 26,
        'I-PERFORMANCE#NEGATIVE': 27,
        'B-SER&ACC#POSITIVE': 28,
        'B-PRICE#NEGATIVE': 29,
        'I-SCREEN#NEUTRAL': 30,
        'B-DESIGN#NEUTRAL': 31,
        'B-BATTERY#POSITIVE': 32,
        'B-STORAGE#POSITIVE': 33,
        'I-GENERAL#POSITIVE': 34,
        'B-CAMERA#POSITIVE': 35,
        'B-PERFORMANCE#NEGATIVE': 36,
        'B-PERFORMANCE#NEUTRAL': 37,
        'B-GENERAL#NEGATIVE': 38,
        'I-CAMERA#POSITIVE': 39,
        'I-BATTERY#POSITIVE': 40,
        'I-GENERAL#NEGATIVE': 41,
        'B-BATTERY#NEUTRAL': 42,
        'I-SER&ACC#POSITIVE': 43,
        'I-SER&ACC#NEUTRAL': 44,
        'I-SER&ACC#NEGATIVE': 45,
        'I-PRICE#NEGATIVE': 46,
        'B-FEATURES#NEUTRAL': 47,
        'B-SCREEN#POSITIVE': 48,
        'B-BATTERY#NEGATIVE': 49,
        'I-SCREEN#NEGATIVE': 50,
        'B-SCREEN#NEGATIVE': 51,
        'O': 52,
        'I-GENERAL#NEUTRAL': 53,
        'I-SCREEN#POSITIVE': 54,
        'B-PERFORMANCE#POSITIVE': 55,
        'B-PRICE#NEUTRAL': 56,
        'I-STORAGE#POSITIVE': 57,
        'B-DESIGN#NEGATIVE': 58,
        'I-PRICE#POSITIVE': 59,
        'B-DESIGN#POSITIVE': 60
    },
    'embedding_dim': 100,
    'hidden_dim': 256,
    'lstm_layers': 1
}

# BiLSTM-CRF model
class BiLSTM_CRF(
        nn.Module, 
        PyTorchModelHubMixin
    ):
    def __init__(self, config: dict):
        super().__init__()
        self.embedding = nn.Embedding(config["vocab_size"], config["embedding_dim"])
        self.lstm = nn.LSTM(config["embedding_dim"], config["hidden_dim"] // 2,
                            num_layers=config["lstm_layers"], bidirectional=True, batch_first=True)
        self.ln = nn.Linear(config["hidden_dim"], len(config["tag_to_idx"]))
        self.crf = CRF(len(config["tag_to_idx"]))

    def forward(self, sentence):
        embeds = self.embedding(sentence)
        lstm_out, _ = self.lstm(embeds)
        emissions = self.ln(lstm_out)
        return emissions
    
    def summary(self):
        print(self)
        print('\n\nModel Summary:')
        print('=================================================================')
        print('Layer (type)                Output Shape              Param #   ')
        print('=================================================================')
        total_params = 0
        for name, param in self.named_parameters():
            print(f'{name:<30} {str(param.shape):<30} {param.numel():<10}')
            total_params += param.numel()
        print('=================================================================')
        print(f'Total params: {total_params}')

loaded_model = BiLSTM_CRF(config=CONFIG).to(device)
loaded_model.load_state_dict(torch.load(hf_hub_download(repo_id=REPO_ID, filename=FILENAME)))

loaded_model.eval()

Performance

The performance of the model can be evaluated using standard metrics for sequence labeling tasks such as precision, recall, and F1-score. Additionally, qualitative analysis can be conducted to examine the model's ability to accurately extract sentiment spans from text.

Future Work

Fine-tuning the model on larger datasets to improve performance.
Experimenting with different variations of the model architecture to further enhance results.
Exploring techniques for domain adaptation to improve model generalization across different text domains.

References

License

This project is licensed under the MIT License - see the LICENSE file for details.