voidful commited on
Commit
86cad98
1 Parent(s): a28d99d

update auto tokenizer support

Browse files
Files changed (4) hide show
  1. README.md +13 -9
  2. config.json +3 -0
  3. special_tokens_map.json +1 -0
  4. tokenizer_config.json +1 -0
README.md CHANGED
@@ -1,31 +1,35 @@
1
  ---
2
  language: zh
 
 
 
3
  ---
4
 
 
5
  # albert_chinese_xlarge
6
 
7
  This a albert_chinese_xlarge model from [Google's github](https://github.com/google-research/ALBERT)
8
  converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py)
9
 
10
- ## Attention (注意)
 
11
 
12
- Since sentencepiece is not used in albert_chinese_xlarge model
13
  you have to call BertTokenizer instead of AlbertTokenizer !!!
14
  we can eval it using an example on MaskedLM
15
 
16
- 由於 albert_chinese_xlarge 模型沒有用 sentencepiece
17
  用AlbertTokenizer會載不進詞表,因此需要改用BertTokenizer !!!
18
  我們可以跑MaskedLM預測來驗證這個做法是否正確
19
 
20
  ## Justify (驗證有效性)
21
- [colab trial](https://colab.research.google.com/drive/1Wjz48Uws6-VuSHv_-DcWLilv77-AaYgj)
22
  ```python
23
- from transformers import *
24
  import torch
25
  from torch.nn.functional import softmax
26
 
27
  pretrained = 'voidful/albert_chinese_xlarge'
28
- tokenizer = BertTokenizer.from_pretrained(pretrained)
29
  model = AlbertForMaskedLM.from_pretrained(pretrained)
30
 
31
  inputtext = "今天[MASK]情很好"
@@ -33,11 +37,11 @@ inputtext = "今天[MASK]情很好"
33
  maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)
34
 
35
  input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0) # Batch size 1
36
- outputs = model(input_ids, masked_lm_labels=input_ids)
37
  loss, prediction_scores = outputs[:2]
38
- logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
39
  predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
40
  predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
41
- print(predicted_token,logit_prob[predicted_index])
42
  ```
43
  Result: `心 0.9942440390586853`
 
1
  ---
2
  language: zh
3
+ pipeline_tag: fill-mask
4
+ widget:
5
+ - text: "今天[MASK]情很好"
6
  ---
7
 
8
+
9
  # albert_chinese_xlarge
10
 
11
  This a albert_chinese_xlarge model from [Google's github](https://github.com/google-research/ALBERT)
12
  converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py)
13
 
14
+ ## Notice
15
+ *Support AutoTokenizer*
16
 
17
+ Since sentencepiece is not used in albert_chinese_base model
18
  you have to call BertTokenizer instead of AlbertTokenizer !!!
19
  we can eval it using an example on MaskedLM
20
 
21
+ 由於 albert_chinese_base 模型沒有用 sentencepiece
22
  用AlbertTokenizer會載不進詞表,因此需要改用BertTokenizer !!!
23
  我們可以跑MaskedLM預測來驗證這個做法是否正確
24
 
25
  ## Justify (驗證有效性)
 
26
  ```python
27
+ from transformers import AutoTokenizer, AlbertForMaskedLM
28
  import torch
29
  from torch.nn.functional import softmax
30
 
31
  pretrained = 'voidful/albert_chinese_xlarge'
32
+ tokenizer = AutoTokenizer.from_pretrained(pretrained)
33
  model = AlbertForMaskedLM.from_pretrained(pretrained)
34
 
35
  inputtext = "今天[MASK]情很好"
 
37
  maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)
38
 
39
  input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0) # Batch size 1
40
+ outputs = model(input_ids, labels=input_ids)
41
  loss, prediction_scores = outputs[:2]
42
+ logit_prob = softmax(prediction_scores[0, maskpos],dim=-1).data.tolist()
43
  predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
44
  predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
45
+ print(predicted_token, logit_prob[predicted_index])
46
  ```
47
  Result: `心 0.9942440390586853`
config.json CHANGED
@@ -1,4 +1,7 @@
1
  {
 
 
 
2
  "attention_probs_dropout_prob": 0,
3
  "bos_token_id": 2,
4
  "classifier_dropout_prob": 0.1,
 
1
  {
2
+ "architectures": [
3
+ "AlbertForMaskedLM"
4
+ ],
5
  "attention_probs_dropout_prob": 0,
6
  "bos_token_id": 2,
7
  "classifier_dropout_prob": 0.1,
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": true, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "tokenizer_file": null, "name_or_path": "voidful/albert_chinese_xlarge", "tokenizer_class": "BertTokenizer"}