Words cannot be merged when tokenization

by gaetokk - opened Apr 2

Discussion

gaetokk

Apr 2

•

edited Apr 3

First of all, thank you so much for your contributions and sharing.

I think I found an issue when I am trying to use the model you trained.

Unlike the original phi-2 model, the word doesn't get converted to a single token id even though it exists in merges.txt. I think it could be inefficient to train the model. I’m wondering if this is something you intended or if it's just an issue.

Here is the code I tried to run.
Thank you.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment