Question about Tokenizer.

by wonhosong - opened Jul 28, 2023

Discussion

wonhosong

Jul 28, 2023

@beomi
Thank you for sharing these impressive results!
This repo is really helpful to develop new model.

I have a couple of questions regarding the tokenizer.

Could you specify which Korean vocabulary or tokenizer you are utilizing? I am under the impression that the foundational vocabulary might be llama-2. Is the new 'one' derived from the polyglot-ko vocabulary or something else?
Could you elaborate on the process of merging two tokenizers while keeping the original one?

Thanks.

beomi

Owner Aug 1, 2023

Hi, I created new sentencepiece tokenizer using HF tokenizers library, trained on Korean+English+Code corpus, which from various sources.
(*the reason why I used ko/en/code is to limit merges)
Using regex, I extracted Korean only vocab and merges and append them into original Llama-2 vocab.

it is totally unrelated with polyglot-ko models :)

beomi changed discussion status to closed Aug 1, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment