voidful commited on
Commit
0a8fa99
1 Parent(s): 3cb884d

update auto tokenizer support

Browse files
Files changed (4) hide show
  1. README.md +12 -6
  2. config.json +3 -0
  3. special_tokens_map.json +1 -0
  4. tokenizer_config.json +1 -0
README.md CHANGED
@@ -1,5 +1,8 @@
1
  ---
2
  language: zh
 
 
 
3
  ---
4
 
5
  # albert_chinese_base
@@ -7,6 +10,9 @@ language: zh
7
  This a albert_chinese_base model from [Google's github](https://github.com/google-research/ALBERT)
8
  converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py)
9
 
 
 
 
10
  ## Attention (注意)
11
 
12
  Since sentencepiece is not used in albert_chinese_base model
@@ -20,12 +26,12 @@ we can eval it using an example on MaskedLM
20
  ## Justify (驗證有效性)
21
  [colab trial](https://colab.research.google.com/drive/1Wjz48Uws6-VuSHv_-DcWLilv77-AaYgj)
22
  ```python
23
- from transformers import *
24
  import torch
25
  from torch.nn.functional import softmax
26
 
27
- pretrained = 'voidful/albert_chinese_base'
28
- tokenizer = BertTokenizer.from_pretrained(pretrained)
29
  model = AlbertForMaskedLM.from_pretrained(pretrained)
30
 
31
  inputtext = "今天[MASK]情很好"
@@ -33,11 +39,11 @@ inputtext = "今天[MASK]情很好"
33
  maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)
34
 
35
  input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0) # Batch size 1
36
- outputs = model(input_ids, masked_lm_labels=input_ids)
37
  loss, prediction_scores = outputs[:2]
38
- logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
39
  predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
40
  predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
41
- print(predicted_token,logit_prob[predicted_index])
42
  ```
43
  Result: `感 0.36333346366882324`
1
  ---
2
  language: zh
3
+ pipeline_tag: fill-mask
4
+ widget:
5
+ - text: "今天[MASK]情很好"
6
  ---
7
 
8
  # albert_chinese_base
10
  This a albert_chinese_base model from [Google's github](https://github.com/google-research/ALBERT)
11
  converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py)
12
 
13
+ ## Update
14
+ Support AutoTokenizer
15
+
16
  ## Attention (注意)
17
 
18
  Since sentencepiece is not used in albert_chinese_base model
26
  ## Justify (驗證有效性)
27
  [colab trial](https://colab.research.google.com/drive/1Wjz48Uws6-VuSHv_-DcWLilv77-AaYgj)
28
  ```python
29
+ from transformers import AutoTokenizer, AlbertForMaskedLM
30
  import torch
31
  from torch.nn.functional import softmax
32
 
33
+ pretrained = './albert_chinese_base'
34
+ tokenizer = AutoTokenizer.from_pretrained(pretrained)
35
  model = AlbertForMaskedLM.from_pretrained(pretrained)
36
 
37
  inputtext = "今天[MASK]情很好"
39
  maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)
40
 
41
  input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0) # Batch size 1
42
+ outputs = model(input_ids, labels=input_ids)
43
  loss, prediction_scores = outputs[:2]
44
+ logit_prob = softmax(prediction_scores[0, maskpos],dim=-1).data.tolist()
45
  predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
46
  predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
47
+ print(predicted_token, logit_prob[predicted_index])
48
  ```
49
  Result: `感 0.36333346366882324`
config.json CHANGED
@@ -1,4 +1,7 @@
1
  {
 
 
 
2
  "attention_probs_dropout_prob": 0,
3
  "bos_token_id": 2,
4
  "classifier_dropout_prob": 0.1,
1
  {
2
+ "architectures": [
3
+ "AlbertForMaskedLM"
4
+ ],
5
  "attention_probs_dropout_prob": 0,
6
  "bos_token_id": 2,
7
  "classifier_dropout_prob": 0.1,
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
1
+ {"do_lower_case": true, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "tokenizer_file": null, "name_or_path": "voidful/albert_chinese_base", "tokenizer_class": "BertTokenizer"}