berkeley-nest/Starling-LM-7B-alpha · Issues in the tokenizer

Jan 1, 2024

how <sep> token id 32002 causes seg faults due to out of bounds accesses?

Jan 1, 2024

Can confirm as well during DPO training, it seems like the tokenizer's addition of <sep> is not in the embedding matrix of size 32002. The maximum id should be 32001.

Imran1

Jan 1, 2024

Yeah they also show this in dpo training.
I just add unk and pad token and they show 32004 tokenizer len.

Also the dpo run for a few steps and then show this error.

it showing this error

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Here is the Dpo code
https://colab.research.google.com/drive/1uC7LohnGJF-Y4vzPz14z6OgZknkeZqD2?usp=sharing

Imran1

Jan 1, 2024

This comment has been hidden

duguodong

Sep 5, 2024

hello, Imran ullah, I encounter the same problem, may I know how you overcame this problem? by modify the tokenizer config.json or modify the code ?