voidful's picture
update auto tokenizer support
d99b523
|
raw
history blame
1.76 kB
---
language: zh
pipeline_tag: fill-mask
widget:
- text: "今天[MASK]情很好"
---
# albert_chinese_small
This a albert_chinese_small model from [brightmart/albert_zh project](https://github.com/brightmart/albert_zh), albert_small_google_zh model
converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py)
## Notice
*Support AutoTokenizer*
Since sentencepiece is not used in albert_chinese_base model
you have to call BertTokenizer instead of AlbertTokenizer !!!
we can eval it using an example on MaskedLM
由於 albert_chinese_base 模型沒有用 sentencepiece
用AlbertTokenizer會載不進詞表,因此需要改用BertTokenizer !!!
我們可以跑MaskedLM預測來驗證這個做法是否正確
## Justify (驗證有效性)
```python
from transformers import AutoTokenizer, AlbertForMaskedLM
import torch
from torch.nn.functional import softmax
pretrained = 'voidful/albert_chinese_small'
tokenizer = AutoTokenizer.from_pretrained(pretrained)
model = AlbertForMaskedLM.from_pretrained(pretrained)
inputtext = "今天[MASK]情很好"
maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)
input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids, labels=input_ids)
loss, prediction_scores = outputs[:2]
logit_prob = softmax(prediction_scores[0, maskpos],dim=-1).data.tolist()
predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token, logit_prob[predicted_index])
```
Result: `感 0.6390823125839233`