julien-c HF staff commited on
Commit
3cb884d
1 Parent(s): d5443e3

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/voidful/albert_chinese_base/README.md

Files changed (1) hide show
  1. README.md +43 -0
README.md ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: zh
3
+ ---
4
+
5
+ # albert_chinese_base
6
+
7
+ This a albert_chinese_base model from [Google's github](https://github.com/google-research/ALBERT)
8
+ converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py)
9
+
10
+ ## Attention (注意)
11
+
12
+ Since sentencepiece is not used in albert_chinese_base model
13
+ you have to call BertTokenizer instead of AlbertTokenizer !!!
14
+ we can eval it using an example on MaskedLM
15
+
16
+ 由於 albert_chinese_base 模型沒有用 sentencepiece
17
+ 用AlbertTokenizer會載不進詞表,因此需要改用BertTokenizer !!!
18
+ 我們可以跑MaskedLM預測來驗證這個做法是否正確
19
+
20
+ ## Justify (驗證有效性)
21
+ [colab trial](https://colab.research.google.com/drive/1Wjz48Uws6-VuSHv_-DcWLilv77-AaYgj)
22
+ ```python
23
+ from transformers import *
24
+ import torch
25
+ from torch.nn.functional import softmax
26
+
27
+ pretrained = 'voidful/albert_chinese_base'
28
+ tokenizer = BertTokenizer.from_pretrained(pretrained)
29
+ model = AlbertForMaskedLM.from_pretrained(pretrained)
30
+
31
+ inputtext = "今天[MASK]情很好"
32
+
33
+ maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)
34
+
35
+ input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0) # Batch size 1
36
+ outputs = model(input_ids, masked_lm_labels=input_ids)
37
+ loss, prediction_scores = outputs[:2]
38
+ logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
39
+ predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
40
+ predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
41
+ print(predicted_token,logit_prob[predicted_index])
42
+ ```
43
+ Result: `感 0.36333346366882324`