Migrate model card from transformers-repo
Browse filesRead announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/NlpHUST/vibert4news-base-cased/README.md
README.md
ADDED
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: vn
|
3 |
+
---
|
4 |
+
# BERT for Vietnamese is trained on more 20 GB news dataset
|
5 |
+
|
6 |
+
Apply for task sentiment analysis on using [AIViVN's comments dataset](https://www.aivivn.com/contests/6)
|
7 |
+
|
8 |
+
The model achieved 0.90268 on the public leaderboard, (winner's score is 0.90087)
|
9 |
+
Bert4news is used for a toolkit Vietnames(segmentation and Named Entity Recognition) at ViNLPtoolkit(https://github.com/bino282/ViNLP)
|
10 |
+
|
11 |
+
***************New Mar 11 , 2020 ***************
|
12 |
+
|
13 |
+
**[BERT](https://github.com/google-research/bert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805).
|
14 |
+
|
15 |
+
We use word sentencepiece, use basic bert tokenization and same config with bert base with lowercase = False.
|
16 |
+
|
17 |
+
You can download trained model:
|
18 |
+
- [tensorflow](https://drive.google.com/file/d/1X-sRDYf7moS_h61J3L79NkMVGHP-P-k5/view?usp=sharing).
|
19 |
+
- [pytorch](https://drive.google.com/file/d/11aFSTpYIurn-oI2XpAmcCTccB_AonMOu/view?usp=sharing).
|
20 |
+
|
21 |
+
Use with huggingface/transformers
|
22 |
+
``` bash
|
23 |
+
import torch
|
24 |
+
from transformers import AutoTokenizer,AutoModel
|
25 |
+
tokenizer= AutoTokenizer.from_pretrained("NlpHUST/vibert4news-base-cased")
|
26 |
+
bert_model = AutoModel.from_pretrained("NlpHUST/vibert4news-base-cased")
|
27 |
+
|
28 |
+
line = "Tôi là sinh viên trường Bách Khoa Hà Nội ."
|
29 |
+
input_id = tokenizer.encode(line,add_special_tokens = True)
|
30 |
+
att_mask = [int(token_id > 0) for token_id in input_id]
|
31 |
+
input_ids = torch.tensor([input_id])
|
32 |
+
att_masks = torch.tensor([att_mask])
|
33 |
+
with torch.no_grad():
|
34 |
+
features = bert_model(input_ids,att_masks)
|
35 |
+
|
36 |
+
print(features)
|
37 |
+
|
38 |
+
```
|
39 |
+
|
40 |
+
Run training with base config
|
41 |
+
|
42 |
+
``` bash
|
43 |
+
|
44 |
+
python train_pytorch.py \
|
45 |
+
--model_path=bert4news.pytorch \
|
46 |
+
--max_len=200 \
|
47 |
+
--batch_size=16 \
|
48 |
+
--epochs=6 \
|
49 |
+
--lr=2e-5
|
50 |
+
|
51 |
+
```
|
52 |
+
|
53 |
+
### Contact information
|
54 |
+
For personal communication related to this project, please contact Nha Nguyen Van (nha282@gmail.com).
|
55 |
+
|