julien-c HF staff commited on
Commit
27e7bc5
·
1 Parent(s): 9501b17

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/NlpHUST/vibert4news-base-cased/README.md

Files changed (1) hide show
  1. README.md +55 -0
README.md ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: vn
3
+ ---
4
+ # BERT for Vietnamese is trained on more 20 GB news dataset
5
+
6
+ Apply for task sentiment analysis on using [AIViVN's comments dataset](https://www.aivivn.com/contests/6)
7
+
8
+ The model achieved 0.90268 on the public leaderboard, (winner's score is 0.90087)
9
+ Bert4news is used for a toolkit Vietnames(segmentation and Named Entity Recognition) at ViNLPtoolkit(https://github.com/bino282/ViNLP)
10
+
11
+ ***************New Mar 11 , 2020 ***************
12
+
13
+ **[BERT](https://github.com/google-research/bert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805).
14
+
15
+ We use word sentencepiece, use basic bert tokenization and same config with bert base with lowercase = False.
16
+
17
+ You can download trained model:
18
+ - [tensorflow](https://drive.google.com/file/d/1X-sRDYf7moS_h61J3L79NkMVGHP-P-k5/view?usp=sharing).
19
+ - [pytorch](https://drive.google.com/file/d/11aFSTpYIurn-oI2XpAmcCTccB_AonMOu/view?usp=sharing).
20
+
21
+ Use with huggingface/transformers
22
+ ``` bash
23
+ import torch
24
+ from transformers import AutoTokenizer,AutoModel
25
+ tokenizer= AutoTokenizer.from_pretrained("NlpHUST/vibert4news-base-cased")
26
+ bert_model = AutoModel.from_pretrained("NlpHUST/vibert4news-base-cased")
27
+
28
+ line = "Tôi là sinh viên trường Bách Khoa Hà Nội ."
29
+ input_id = tokenizer.encode(line,add_special_tokens = True)
30
+ att_mask = [int(token_id > 0) for token_id in input_id]
31
+ input_ids = torch.tensor([input_id])
32
+ att_masks = torch.tensor([att_mask])
33
+ with torch.no_grad():
34
+ features = bert_model(input_ids,att_masks)
35
+
36
+ print(features)
37
+
38
+ ```
39
+
40
+ Run training with base config
41
+
42
+ ``` bash
43
+
44
+ python train_pytorch.py \
45
+ --model_path=bert4news.pytorch \
46
+ --max_len=200 \
47
+ --batch_size=16 \
48
+ --epochs=6 \
49
+ --lr=2e-5
50
+
51
+ ```
52
+
53
+ ### Contact information
54
+ For personal communication related to this project, please contact Nha Nguyen Van (nha282@gmail.com).
55
+