PhoBERT: Pre-trained language models for Vietnamese
Pre-trained PhoBERT models are the state-of-the-art language models for Vietnamese (Pho, i.e. "Phở", is a popular food in Vietnam):
- Two PhoBERT versions of "base" and "large" are the first public large-scale monolingual language models pre-trained for Vietnamese. PhoBERT pre-training approach is based on RoBERTa which optimizes the BERT pre-training procedure for more robust performance.
- PhoBERT outperforms previous monolingual and multilingual approaches, obtaining new state-of-the-art performances on four downstream Vietnamese NLP tasks of Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference.
The general architecture and experimental results of PhoBERT can be found in our EMNLP-2020 Findings paper:
@article{phobert,
title = {{PhoBERT: Pre-trained language models for Vietnamese}},
author = {Dat Quoc Nguyen and Anh Tuan Nguyen},
journal = {Findings of EMNLP},
year = {2020}
}
Please CITE our paper when PhoBERT is used to help produce published results or is incorporated into other software.
For further information or requests, please go to PhoBERT's homepage!
Installation
- Python 3.6+, and PyTorch 1.1.0+ (or TensorFlow 2.0+)
- Install
transformers
: -git clone https://github.com/huggingface/transformers.git
-cd transformers
-pip3 install --upgrade .
Pre-trained models
Model | #params | Arch. | Pre-training data |
---|---|---|---|
vinai/phobert-base |
135M | base | 20GB of texts |
vinai/phobert-large |
370M | large | 20GB of texts |
Example usage
import torch
from transformers import AutoModel, AutoTokenizer
phobert = AutoModel.from_pretrained("vinai/phobert-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")
# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
line = "Tôi là sinh_viên trường đại_học Công_nghệ ."
input_ids = torch.tensor([tokenizer.encode(line)])
with torch.no_grad():
features = phobert(input_ids) # Models outputs are now tuples
## With TensorFlow 2.0+:
# from transformers import TFAutoModel
# phobert = TFAutoModel.from_pretrained("vinai/phobert-base")