Special tokens for instruction template
#3
by
Weker
- opened
I hope this is a good place to ask this but when I run the original Llama 3 model then the instructions like
- '<|begin_of_text|>'
- '<|start_header_id|>'
- '<|end_header_id|>'
get tokenized into one token each (128000, 128006 and 128006 respectivly) but when running this model every instruction gets tokenized into way more token for example '<|start_header_id|>' gets translated to:
- 128000 - ''
- 27 - '<'
- 91 - '|'
- 2527 - 'start'
- 8932 - '_header'
- 851 - '_id'
- 91 - '|'
- 29 - '>'
Is that intended behavior or am I doing something wrong? I noticed this when I used a lot of short "user" sections and ran out of context fast.
Edit: I am using Text generation web UI. I don't know if that is relevant.
Steelskull
changed discussion status to
closed
Why close the discussion?