Edit model card

Model Card for Bartpho fine-tuned model for question answering task

Bartpho is a powerful language model that tackles Vietnamese text with impressive results. It comes in two versions, focusing on words or syllables, and excels at generative tasks like summarizing Vietnamese text.
The syllable version of Bartpho, specifically called BARTpho-syllable, takes a unique approach to processing Vietnamese text. Here's how it works:

  • Syllable-based processing: Unlike traditional word-based models, BARTpho-syllable breaks down Vietnamese words into their individual syllables. This approach can be particularly beneficial for Vietnamese because syllables often hold more meaning than individual letters.
  • Improved performance: By understanding the language at a syllable level, BARTpho-syllable can potentially capture the nuances of Vietnamese more effectively. This has been shown in tasks like text summarization, where BARTpho-syllable outperforms other strong models.
  • The Vietnamese language presents unique challenges for natural language processing (NLP) tasks due to its tonal nature and complex morphology. BARTpho, a pre-trained sequence-to-sequence model specifically designed for Vietnamese, offers a powerful solution for various NLP applications. This essay explores the potential of fine-tuning BARTpho for question answering (QA), a crucial component in building intelligent systems that can understand and respond to Vietnamese queries.
  • Traditional QA approaches often rely on word-based models, which can struggle with Vietnamese due to its agglutinative nature, where words are formed by combining smaller meaningful units. BARTpho, particularly the syllable-based version (BARTpho-syllable), offers a distinct advantage. By processing text at the syllable level, BARTpho-syllable can capture the finer nuances of Vietnamese, potentially leading to more accurate answer extraction.

Dataset

UIT-ViQuAD

The Vietnamese QA dataset, created by Nguyen et al. (2020), is known as UIT-ViQuAD and was introduced in their research paper. This dataset has also been used in a shared task.
Original Version

  • Comprises over 23,000 question-answer (QA) pairs.
  • Sourced from 174 Vietnamese Wikipedia articles.
    UIT-ViQuAD 2.0
  • Adds over 12,000 unanswerable questions to the same passages.
  • Includes new fields: is_impossible and plausible_answer.
  • These additions and modifications ensure that the dataset is more comprehensive by including both answerable and unanswerable questions.
  • The dataset has been refined to eliminate a few duplicated questions and answers.
    Fields in UIT-ViQuAD 2.0
  • Context: The passage from which questions are derived.
  • Question: The question to be answered based on the context.
  • Answer: The correct answer extracted from the context for answerable questions.
  • is_impossible: A boolean indicating if the question is unanswerable (True) or answerable (False).
  • plausible_answer: For unanswerable questions, this provides a seemingly correct but actually incorrect answer extracted from the context.

UIT-CourseInfo

The Vietnamese QA dataset about UIT course information which we collect based on available free access sources. 1. Data Collection Sources
we use data crawling techniques to collect data automatically. We focus on gathering information related to course summaries and study programs at the University of Information Technology - VNU-HCM, collected from the student.uit website. The initial collected data consists of 422 samples.
2. Data Labeling
To label the data, we use Label Studio, a platform that supports data annotation. To ensure efficient and fair labeling, each member evaluates and labels all the assigned data samples. We then apply a voting technique to determine the final label for each sample. This method increases the accuracy of the labeling process by incorporating opinions from multiple people, thereby minimizing errors and ensuring fairness in the final decision.
3. Data Augmentation
With the 422 collected contexts, we use GPT as a tool for data augmentation. We utilize the "few-shot prompting" technique to generate question-answer pairs for the question-answering task.
4. Automated Data Verification
To verify the question-answer pairs generated by GPT, we use GPT itself for testing and evaluation by designing prompts to determine if the answers are appropriate for the given context. Additionally, we employ several Python logic functions to check the information, ensuring that the answers do not exceed the scope of the provided context.
5. Data Statistics and Observations
we split the data into training, validation, and test sets in a 7/2/1 ratio, corresponding to 2,961 samples for the training set, 846 samples for the validation set, and 423 samples for the test set.

Downloads last month
2
Safetensors
Model size
396M params
Tensor type
F32
·
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train PhucDanh/Bartpho-fine-tuning-on-UIT-Course-information