TypeError: 'NoneType' object cannot be interpreted as an integer

#3
by tanliboy - opened

Hi Qwen2 team,

I am trying to run Zephyr DPO recipe (https://github.com/huggingface/alignment-handbook/tree/main/recipes/zephyr-7b-beta) to fine-tune this model but consistently running into this error. (The SFT training works fine). Does this model use a special checkpoint configuration I need to configure? Any thoughts on the potential reason?

" [rank6]: TypeError: 'NoneType' object cannot be interpreted as an integer
[rank5]: Traceback (most recent call last):
[rank5]: File "/home/litan/alignment-handbook/scripts/run_dpo.py", line 261, in
[rank5]: main()
[rank5]: File "/home/litan/alignment-handbook/scripts/run_dpo.py", line 214, in main
[rank5]: train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank5]: File "/opt/conda/envs/handbook/lib/python3.10/site-packages/transformers/trainer.py", line 1850, in train
[rank5]: return inner_training_loop(
[rank5]: File "/opt/conda/envs/handbook/lib/python3.10/site-packages/transformers/trainer.py", line 2165, in _inner_training_loop
[rank5]: for step, inputs in enumerate(epoch_iterator):
[rank5]: File "/opt/conda/envs/handbook/lib/python3.10/site-packages/accelerate/data_loader.py", line 454, in iter
[rank5]: current_batch = next(dataloader_iter)
[rank5]: File "/opt/conda/envs/handbook/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next
[rank5]: data = self._next_data()
[rank5]: File "/opt/conda/envs/handbook/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data
[rank5]: data = self._dataset_fetcher.fetch(index) # may raise StopIteration
[rank5]: File "/opt/conda/envs/handbook/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
[rank5]: return self.collate_fn(data)
[rank5]: File "/opt/conda/envs/handbook/lib/python3.10/site-packages/trl/trainer/utils.py", line 338, in call
[rank5]: to_pad = [torch.LongTensor(ex[k]) for ex in features]
[rank5]: File "/opt/conda/envs/handbook/lib/python3.10/site-packages/trl/trainer/utils.py", line 338, in
[rank5]: to_pad = [torch.LongTensor(ex[k]) for ex in features]
[rank5]: TypeError: 'NoneType' object cannot be interpreted as an integer
[2024-06-15 02:51:57,401] [INFO] [utils.py:802:see_memory_usage] After initializing ZeRO optimizer"

In case anyone runs into the same problem, I figured out it is related the inconsistence between bos_token_id and bos_token.
I worked around it by changing
"bos_token": null to be "bos_token": <|endoftext|> in the tokenizer_config.json file.

please also refer to this comment. it is not needed to change the config file after the related PR in trl.

jklj077 changed discussion status to closed

Sign up or log in to comment