Model Card for Bartpho fine-tuned model for question answering task

Bartpho is a powerful language model that tackles Vietnamese text with impressive results. It comes in two versions, focusing on words or syllables, and excels at generative tasks like summarizing Vietnamese text. The syllable version of Bartpho, specifically called BARTpho-syllable, takes a unique approach to processing Vietnamese text. Here's how it works:

Syllable-based processing: Unlike traditional word-based models, BARTpho-syllable breaks down Vietnamese words into their individual syllables. This approach can be particularly beneficial for Vietnamese because syllables often hold more meaning than individual letters.
Improved performance: By understanding the language at a syllable level, BARTpho-syllable can potentially capture the nuances of Vietnamese more effectively. This has been shown in tasks like text summarization, where BARTpho-syllable outperforms other strong models.

The Vietnamese language presents unique challenges for natural language processing (NLP) tasks due to its tonal nature and complex morphology. BARTpho, a pre-trained sequence-to-sequence model specifically designed for Vietnamese, offers a powerful solution for various NLP applications. This essay explores the potential of fine-tuning BARTpho for question answering (QA), a crucial component in building intelligent systems that can understand and respond to Vietnamese queries.

Traditional QA approaches often rely on word-based models, which can struggle with Vietnamese due to its agglutinative nature, where words are formed by combining smaller meaningful units. BARTpho, particularly the syllable-based version (BARTpho-syllable), offers a distinct advantage. By processing text at the syllable level, BARTpho-syllable can capture the finer nuances of Vietnamese, potentially leading to more accurate answer extraction.

Dataset

The Vietnamese QA dataset, created by Nguyen et al. (2020), is known as UIT-ViQuAD and was introduced in their research paper. This dataset has also been used in a shared task.

Original Version:

Comprises over 23,000 question-answer (QA) pairs.
Sourced from 174 Vietnamese Wikipedia articles.

UIT-ViQuAD 2.0:

Adds over 12,000 unanswerable questions to the same passages.
Includes new fields: is_impossible and plausible_answer.
These additions and modifications ensure that the dataset is more comprehensive by including both answerable and unanswerable questions.
The dataset has been refined to eliminate a few duplicated questions and answers.

Fields in UIT-ViQuAD 2.0:

Context: The passage from which questions are derived.
Question: The question to be answered based on the context.
Answer: The correct answer extracted from the context for answerable questions.
is_impossible: A boolean indicating if the question is unanswerable (True) or answerable (False).
plausible_answer: For unanswerable questions, this provides a seemingly correct but actually incorrect answer extracted from the context.

The term for hyperparameters used in the fine-tuning process

epochs = 4
batch_size = 16
learning rate = 2e-5
evaluation strategy = "steps"
save_total_limit = 1
save_steps = 2000
eval_steps = 2000
gradient_accumulation_steps = 2
eval_accumulation_steps = 2
load_best_model_at_end = True

Best result

epoch = 3250207813798838
grad_norm = 136582374572754
learning_rate = 3.3610648918469217e-06
loss = 0.9397
step = 2000
eval_loss = 0.7907648682594299

Inference

Using a pipeline as a high-level helper

from transformers import pipeline

context="""
Trường Đại học Công nghệ Thông tin (ĐH CNTT), Đại học Quốc gia Thành phố Hồ Chí Minh (ĐHQG-HCM) là trường đại học công lập đào tạo về công nghệ thông tin và truyền thông (CNTT&TT) được thành lập theo quyết định số 134/2006/QĐ-TTg ngày 08/06/2006 của Thủ tướng Chính phủ. Là trường thành viên của ĐHQG-HCM, trường ĐH CNTT có nhiệm vụ đào tạo nguồn nhân lực công nghệ thông tin chất lượng cao, góp phần tích cực vào sự phát triển của nền công nghiệp công nghệ thông tin Việt Nam, đồng thời tiến hành nghiên cứu khoa học và chuyển giao công nghệ thông tin tiên tiến, đặc biệt là hướng vào các ứng dụng nhằm góp phần đẩy mạnh sự nghiệp công nghiệp hóa, hiện đại hóa đất nước.
Sau hơn 10 năm xây dựng và phát triển, hiện trường ĐH CNTT sở hữu cơ sở vật chất gồm khu học tập, nghiên cứu và làm việc được đầu tư xây dựng khang trang, hiện đại với tổng diện tích trên 14 hecta trong khuôn viên khu đô thị ĐHQG-HCM.
"""
question="""
Trường UIT mang trong mình nhiệm vụ gì?
"""

pipe = pipeline("question-answering", model="PhucDanh/Bartpho-fine-tuning-model-for-question-answering")
pipe(question=question, context=context)

Load model directly

from transformers import AutoTokenizer
from transformers import AutoModelForQuestionAnswering
import torch

context="""
Trường Đại học Công nghệ Thông tin (ĐH CNTT), Đại học Quốc gia Thành phố Hồ Chí Minh (ĐHQG-HCM) là trường đại học công lập đào tạo về công nghệ thông tin và truyền thông (CNTT&TT) được thành lập theo quyết định số 134/2006/QĐ-TTg ngày 08/06/2006 của Thủ tướng Chính phủ. Là trường thành viên của ĐHQG-HCM, trường ĐH CNTT có nhiệm vụ đào tạo nguồn nhân lực công nghệ thông tin chất lượng cao, góp phần tích cực vào sự phát triển của nền công nghiệp công nghệ thông tin Việt Nam, đồng thời tiến hành nghiên cứu khoa học và chuyển giao công nghệ thông tin tiên tiến, đặc biệt là hướng vào các ứng dụng nhằm góp phần đẩy mạnh sự nghiệp công nghiệp hóa, hiện đại hóa đất nước.
Sau hơn 10 năm xây dựng và phát triển, hiện trường ĐH CNTT sở hữu cơ sở vật chất gồm khu học tập, nghiên cứu và làm việc được đầu tư xây dựng khang trang, hiện đại với tổng diện tích trên 14 hecta trong khuôn viên khu đô thị ĐHQG-HCM.
"""
question="""
Trường UIT mang trong mình nhiệm vụ gì?
"""

tokenizer = AutoTokenizer.from_pretrained("PhucDanh/Bartpho-fine-tuning-model-for-question-answering")
tokenizer.model_input_names.remove("token_type_ids")

inputs = tokenizer(question, context, return_tensors="pt")

model = AutoModelForQuestionAnswering.from_pretrained("PhucDanh/Bartpho-fine-tuning-model-for-question-answering")
with torch.no_grad():
    outputs = model(**inputs)

answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
print(tokenizer.decode(predict_answer_tokens))

Inference API

Contact for API token authentication

import requests

API_URL = "https://api-inference.huggingface.co/models/PhucDanh/Bartpho-fine-tuning-model-for-question-answering"
headers = {"Authorization": "Bearer hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()
    
output = query({
    "inputs": {
        "question": "What is my name?",
        "context": "My name is Clara and I live in Berkeley."
    },
})

Reference

Model:

@article{tran2021bartpho,
  title={BartPho: pre-trained sequence-to-sequence models for Vietnamese},
  author={Tran, Nguyen Luong and Le, Duong Minh and Nguyen, Dat Quoc},
  journal={arXiv preprint arXiv:2109.09701},
  year={2021}
}

Dataset:

@article{Nguyen_2022,
   title={VLSP 2021-ViMRC Challenge: Vietnamese Machine Reading Comprehension},
   volume={38},
   ISSN={2615-9260},
   url={http://dx.doi.org/10.25073/2588-1086/vnucsce.340},
   DOI={10.25073/2588-1086/vnucsce.340},
   number={2},
   journal={VNU Journal of Science: Computer Science and Communication Engineering},
   publisher={Vietnam National University Journal of Science},
   author={Nguyen, Kiet and Tran, Son Quoc and Nguyen, Luan Thanh and Huynh, Tin Van and Luu, Son Thanh and Nguyen, Ngan Luu-Thuy},
   year={2022},
   month=dec}

PhucDanh
/

Bartpho-fine-tuning-model-for-question-answering