|
--- |
|
license: mit |
|
--- |
|
# Vietnamese Legal Text BERT |
|
#### Table of contents |
|
1. [Introduction](#introduction) |
|
2. [Using Vietnamese Legal Text BERT](#transformers) |
|
- [Installation](#install2) |
|
- [Pre-trained models](#models2) |
|
- [Example usage](#usage2) |
|
|
|
# <a name="introduction"></a> Using Vietnamese Legal Text BERT `hmthanh/VietnamLegalText-SBERT` |
|
|
|
Pre-trained PhoBERT models are the state-of-the-art language models for Vietnamese ([Pho](https://en.wikipedia.org/wiki/Pho), i.e. "Phở", is a popular food in Vietnam): |
|
|
|
|
|
## <a name="transformers"></a> Using Vietnamese Legal Text BERT `transformers` |
|
|
|
### Installation <a name="install2"></a> |
|
- Install `transformers` with pip: |
|
`pip install transformers`<br /> |
|
|
|
- Install `tokenizers` with pip: |
|
`pip install tokenizers` |
|
|
|
### Pre-trained models <a name="models2"></a> |
|
|
|
|
|
Model | #params | Arch. | Max length | Pre-training data |
|
---|---|---|---|--- |
|
`hmthanh/VietnamLegalText-SBERT` | 135M | base | 256 | 20GB of texts |
|
|
|
### Example usage <a name="usage2"></a> |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
phobert = AutoModel.from_pretrained("hmthanh/VietnamLegalText-SBERT") |
|
tokenizer = AutoTokenizer.from_pretrained("hmthanh/VietnamLegalText-SBERT") |
|
|
|
sentence = 'Chúng_tôi là những nghiên_cứu_viên .' |
|
|
|
input_ids = torch.tensor([tokenizer.encode(sentence)]) |
|
|
|
with torch.no_grad(): |
|
features = phobert(input_ids) # Models outputs are now tuples |
|
``` |
|
|