Question about Tokenizer.

#7
by wonhosong - opened

@beomi
Thank you for sharing these impressive results!
This repo is really helpful to develop new model.

I have a couple of questions regarding the tokenizer.

  1. Could you specify which Korean vocabulary or tokenizer you are utilizing? I am under the impression that the foundational vocabulary might be llama-2. Is the new 'one' derived from the polyglot-ko vocabulary or something else?

  2. Could you elaborate on the process of merging two tokenizers while keeping the original one?

Thanks.

Hi, I created new sentencepiece tokenizer using HF tokenizers library, trained on Korean+English+Code corpus, which from various sources.
(*the reason why I used ko/en/code is to limit merges)
Using regex, I extracted Korean only vocab and merges and append them into original Llama-2 vocab.

it is totally unrelated with polyglot-ko models :)

beomi changed discussion status to closed

Sign up or log in to comment