tokenization mismatch

#9
by jq460494839 - opened

I'm not sure if there's something wrong with the tokenizer and I'm getting the WARNING: tokenization mismatch: 501 vs. 503 when fine-tuning the model with fastchat. #turn = 1. (ignored)
when fine-tuning the model with fastchat

python=3.9
transformers=4.34.1
fastchat=currently master branch

command:
torchrun --nproc_per_node=2 --master_port=20001 fastchat/train/train.py
--model_name_or_path /opt/jq/big_models/Yi-6B/
--data_path /opt/jq/nlp_data/evol-instruct-chinese-subset.json
--fp16 True
--output_dir output_Yi
--num_train_epochs 2
--per_device_train_batch_size 8
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 2000
--save_total_limit 200
--learning_rate 5e-5
--weight_decay 0.
--lr_scheduler_type "cosine"
--logging_steps 1
--fsdp "full_shard auto_wrap"
--model_max_length 512
--gradient_checkpointing True
--lazy_preprocess True

log:
WARNING: tokenization mismatch: 501 vs. 503. #turn = 1. (ignored)
WARNING: tokenization mismatch: 232 vs. 234. #turn = 1. (ignored)
WARNING: tokenization mismatch: 329 vs. 331. #turn = 1. (ignored)
{'loss': 0.0, 'learning_rate': 4.9918228349595606e-05, 'epoch': 0.05}
3%|▌ | 190/7378 [09:17<5:51:33, 2.93s/it]WARNING: tokenization mismatch: 288 vs. 290. #turn = 1. (ignored)
WARNING: tokenization mismatch: 339 vs. 341. #turn = 1. (ignored)
WARNING: tokenization mismatch: 300 vs. 302. #turn = 1. (ignored)
WARNING: tokenization mismatch: 274 vs. 276. #turn = 1. (ignored)
WARNING: tokenization mismatch: 171 vs. 173. #turn = 1. (ignored)

Could you post it at https://github.com/01-ai/Yi/issues ?
Our team member will look into it with you together.

yeah,may cause by this
https://github.com/01-ai/Yi/issues/24
we got the same question

FancyZhao changed discussion status to closed

Sign up or log in to comment