# ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining ViHealthBERT is the a strong baseline language models for Vietnamese in Healthcare domain. We empirically investigate our model with different training strategies, achieving state of the art (SOTA) performances on 3 downstream tasks: NER (COVID-19 & ViMQ), Acronym Disambiguation, and Summarization. We introduce two Vietnamese datasets: the acronym dataset (acrDrAid) and the FAQ summarization dataset in the healthcare domain. Our acrDrAid dataset is annotated with 135 sets of keywords. The general approaches and experimental results of ViHealthBERT can be found in our LREC-2022 Poster [paper]() (updated soon): @article{vihealthbert, title = {{ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining}}, author = {Minh Phuc Nguyen, Vu Hoang Tran, Vu Hoang, Ta Duc Huy, Trung H. Bui, Steven Q. H. Truong }, journal = {13th Edition of its Language Resources and Evaluation Conference}, year = {2022} } ### Installation - Python 3.6+, and PyTorch >= 1.6 - Install `transformers`: `pip install transformers==4.2.0` ### Pre-trained models Model | #params | Arch. | Tokenizer ---|---|---|--- `demdecuong/vihealthbert-base-word` | 135M | base | Word-level `demdecuong/vihealthbert-base-syllable` | 135M | base | Syllable-level ### Example usage ```python import torch from transformers import AutoModel, AutoTokenizer vihealthbert = AutoModel.from_pretrained("demdecuong/vihealthbert-base-word") tokenizer = AutoTokenizer.from_pretrained("demdecuong/vihealthbert-base-word") # INPUT TEXT MUST BE ALREADY WORD-SEGMENTED! line = "Tôi là sinh_viên trường đại_học Công_nghệ ." input_ids = torch.tensor([tokenizer.encode(line)]) with torch.no_grad(): features = vihealthbert(input_ids) # Models outputs are now tuples ``` ### Example usage for raw text Since ViHealthBERT used the [RDRSegmenter](https://github.com/datquocnguyen/RDRsegmenter) from [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP) to pre-process the pre-training data. We highly recommend use the same word-segmenter for ViHealthBERT downstream applications. #### Installation ``` # Install the vncorenlp python wrapper pip3 install vncorenlp # Download VnCoreNLP-1.1.1.jar & its word segmentation component (i.e. RDRSegmenter) mkdir -p vncorenlp/models/wordsegmenter wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/VnCoreNLP-1.1.1.jar wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/models/wordsegmenter/vi-vocab wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/models/wordsegmenter/wordsegmenter.rdr mv VnCoreNLP-1.1.1.jar vncorenlp/ mv vi-vocab vncorenlp/models/wordsegmenter/ mv wordsegmenter.rdr vncorenlp/models/wordsegmenter/ ``` `VnCoreNLP-1.1.1.jar` (27MB) and folder `models/` must be placed in the same working folder. #### Example usage ``` # See more details at: https://github.com/vncorenlp/VnCoreNLP # Load rdrsegmenter from VnCoreNLP from vncorenlp import VnCoreNLP rdrsegmenter = VnCoreNLP("/Absolute-path-to/vncorenlp/VnCoreNLP-1.1.1.jar", annotators="wseg", max_heap_size='-Xmx500m') # Input text = "Ông Nguyễn Khắc Chúc đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây." # To perform word (and sentence) segmentation sentences = rdrsegmenter.tokenize(text) for sentence in sentences: print(" ".join(sentence)) ```