Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Mistral擴充詞表只包含與教育部常用4808字的交集

  • 移除dummy token
  • 增加<|func_start|>, <|func_end|>
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(  
        'ocisd4/mistral_tokenizer_ext',  
        pad_token='<unk>',  
        add_bos_token=True,  
        add_eos_token=False  
)  

print('vocab size:', tokenizer.vocab_size)   
#vocab size: 35686

print(tokenizer.tokenize('今天天氣真好!'))   
#['▁', '今', '天', '天', '氣', '真', '好', '!']

print(tokenizer.encode('今天天氣真好!'))  
#[1, 28705, 30316, 29354, 29354, 32004, 29974, 29530, 29267]

print(tokenizer.decode(tokenizer.encode('今天天氣真好!')))  
#<s> 今天天氣真好!
Downloads last month
0
Unable to determine this model's library. Check the docs .

Collection including ocisd4/mistral_tokenizer_ext