Question about Tokenizer.
@beomi
Thank you for sharing these impressive results!
This repo is really helpful to develop new model.
I have a couple of questions regarding the tokenizer.
Could you specify which Korean vocabulary or tokenizer you are utilizing? I am under the impression that the foundational vocabulary might be
llama-2
. Is the new 'one' derived from thepolyglot-ko
vocabulary or something else?Could you elaborate on the process of merging two tokenizers while keeping the original one?
Thanks.
Hi, I created new sentencepiece tokenizer using HF tokenizers library, trained on Korean+English+Code corpus, which from various sources.
(*the reason why I used ko/en/code is to limit merges)
Using regex, I extracted Korean only vocab and merges and append them into original Llama-2 vocab.
it is totally unrelated with polyglot-ko models :)