|
# <a name="introduction"></a> ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining |
|
|
|
ViHealthBERT is the a strong baseline language models for Vietnamese in Healthcare domain. |
|
|
|
We empirically investigate our model with different training strategies, achieving state of the art (SOTA) performances on 3 downstream tasks: NER (COVID-19 & ViMQ), Acronym Disambiguation, and Summarization. |
|
|
|
We introduce two Vietnamese datasets: the acronym dataset (acrDrAid) and the FAQ summarization dataset in the healthcare domain. Our acrDrAid dataset is annotated with 135 sets of keywords. |
|
The general approaches and experimental results of ViHealthBERT can be found in our LREC-2022 Poster [paper]() (updated soon): |
|
|
|
@article{vihealthbert, |
|
title = {{ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining}}, |
|
author = {Minh Phuc Nguyen, Vu Hoang Tran, Vu Hoang, Ta Duc Huy, Trung H. Bui, Steven Q. H. Truong }, |
|
journal = {13th Edition of its Language Resources and Evaluation Conference}, |
|
year = {2022} |
|
} |
|
|
|
### Installation <a name="install2"></a> |
|
- Python 3.6+, and PyTorch >= 1.6 |
|
- Install `transformers`: |
|
`pip install transformers==4.2.0` |
|
|
|
### Pre-trained models <a name="models2"></a> |
|
|
|
Model | #params | Arch. | Tokenizer |
|
---|---|---|--- |
|
`demdecuong/vihealthbert-base-word` | 135M | base | Word-level |
|
`demdecuong/vihealthbert-base-syllable` | 135M | base | Syllable-level |
|
|
|
### Example usage <a name="usage1"></a> |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
vihealthbert = AutoModel.from_pretrained("demdecuong/vihealthbert-base-word") |
|
tokenizer = AutoTokenizer.from_pretrained("demdecuong/vihealthbert-base-word") |
|
|
|
# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED! |
|
line = "Tôi là sinh_viên trường đại_học Công_nghệ ." |
|
|
|
input_ids = torch.tensor([tokenizer.encode(line)]) |
|
with torch.no_grad(): |
|
features = vihealthbert(input_ids) # Models outputs are now tuples |
|
``` |
|
|
|
### Example usage for raw text <a name="usage2"></a> |
|
Since ViHealthBERT used the [RDRSegmenter](https://github.com/datquocnguyen/RDRsegmenter) from [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP) to pre-process the pre-training data. |
|
We highly recommend use the same word-segmenter for ViHealthBERT downstream applications. |
|
|
|
#### Installation |
|
``` |
|
# Install the vncorenlp python wrapper |
|
pip3 install vncorenlp |
|
|
|
# Download VnCoreNLP-1.1.1.jar & its word segmentation component (i.e. RDRSegmenter) |
|
mkdir -p vncorenlp/models/wordsegmenter |
|
wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/VnCoreNLP-1.1.1.jar |
|
wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/models/wordsegmenter/vi-vocab |
|
wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/models/wordsegmenter/wordsegmenter.rdr |
|
mv VnCoreNLP-1.1.1.jar vncorenlp/ |
|
mv vi-vocab vncorenlp/models/wordsegmenter/ |
|
mv wordsegmenter.rdr vncorenlp/models/wordsegmenter/ |
|
``` |
|
|
|
`VnCoreNLP-1.1.1.jar` (27MB) and folder `models/` must be placed in the same working folder. |
|
|
|
#### Example usage |
|
``` |
|
# See more details at: https://github.com/vncorenlp/VnCoreNLP |
|
|
|
# Load rdrsegmenter from VnCoreNLP |
|
from vncorenlp import VnCoreNLP |
|
rdrsegmenter = VnCoreNLP("/Absolute-path-to/vncorenlp/VnCoreNLP-1.1.1.jar", annotators="wseg", max_heap_size='-Xmx500m') |
|
|
|
# Input |
|
text = "Ông Nguyễn Khắc Chúc đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây." |
|
|
|
# To perform word (and sentence) segmentation |
|
sentences = rdrsegmenter.tokenize(text) |
|
for sentence in sentences: |
|
print(" ".join(sentence)) |
|
``` |