README.md · nguyenvulebinh/vi-mrc-large at 732c3096bbc2b9c7360e46ffb93c4f89692dafdb

metadata

language:
  - vi
  - vn
  - en
tags:
  - question-answering
  - pytorch
datasets:
  - squad
license: cc-by-nc-4.0
pipeline_tag: question-answering
metrics:
  - squad
widget:
  - text: Bình là chuyên gia về gì ?
    context: >-
      Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh
      nhận chứng chỉ Google Developer Expert năm 2020
  - text: Bình được công nhận với danh hiệu gì ?
    context: >-
      Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh
      nhận chứng chỉ Google Developer Expert năm 2020

Model Description

Language model: XLM-RoBERTa
Fine-tune: MRCQuestionAnswering
Language: Vietnamese, Englsih
Downstream-task: Extractive QA
Dataset (combine English and Vietnamese):

This model is intended to be used for QA in the Vietnamese language so the valid set is Vietnamese only (but English works fine). The evaluation result below uses the VLSP MRC 2021 test set. This experiment achieves TOP 1 on the leaderboard.

Model	EM	F1
large public_test_set	85.847	83.826
large private_test_set	82.072	78.071

Public leaderboard	Private leaderboard

MRCQuestionAnswering using XLM-RoBERTa as a pre-trained language model. By default, XLM-RoBERTa will split word in to sub-words. But in my implementation, I re-combine sub-words representation (after encoded by BERT layer) into word representation using sum strategy.

Using pre-trained model

Hugging Face pipeline style (NOT using sum features strategy).

from transformers import pipeline
# model_checkpoint = "nguyenvulebinh/vi-mrc-large"
model_checkpoint = "nguyenvulebinh/vi-mrc-base"
nlp = pipeline('question-answering', model=model_checkpoint,
                   tokenizer=model_checkpoint)
QA_input = {
  'question': "Bình là chuyên gia về gì ?",
  'context': "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"
}
res = nlp(QA_input)
print('pipeline: {}'.format(res))
#{'score': 0.5782045125961304, 'start': 45, 'end': 68, 'answer': 'xử lý ngôn ngữ tự nhiên'}

More accurate infer process (Using sum features strategy)

from infer import tokenize_function, data_collator, extract_answer
from model.mrc_model import MRCQuestionAnswering
from transformers import AutoTokenizer

model_checkpoint = "nguyenvulebinh/vi-mrc-large"
#model_checkpoint = "nguyenvulebinh/vi-mrc-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = MRCQuestionAnswering.from_pretrained(model_checkpoint)

QA_input = {
  'question': "Bình được công nhận với danh hiệu gì ?",
  'context': "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"
}

inputs = [tokenize_function(*QA_input)]
inputs_ids = data_collator(inputs)
outputs = model(**inputs_ids)
answer = extract_answer(inputs, outputs, tokenizer)

print(answer)
# answer: Google Developer Expert. Score start: 0.9926977753639221, Score end: 0.9909810423851013

About

Built by Binh Nguyen For more details, visit the project repository.