Issues in the tokenizer

#25
by Imran1 - opened

how <sep> token id 32002 causes seg faults due to out of bounds accesses?

Can confirm as well during DPO training, it seems like the tokenizer's addition of <sep> is not in the embedding matrix of size 32002. The maximum id should be 32001.

Yeah they also show this in dpo training.
I just add unk and pad token and they show 32004 tokenizer len.

Also the dpo run for a few steps and then show this error.

it showing this error

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Here is the Dpo code
https://colab.research.google.com/drive/1uC7LohnGJF-Y4vzPz14z6OgZknkeZqD2?usp=sharing

I training this with unsloth library.

Sign up or log in to comment