problem in tokenizing

#3
by jhflow - opened

Thank you for your remarkable work.

But I have found a problem in tokenizer.

when I tokenize text with the provided tokenizer, I find that the tokenizer makes too many spacings, and they are not grammatically correct.

Do you have any idea to remedy this problem?

image.png

Hi, Thank you for great work.

Despite of your great work, I encountered same strange results when encoding text using the tokenizer.

Here is an example:

sys_message = "당신은 도움이 되고 μ •μ€‘ν•˜λ©° μ •μ§ν•œ μ‘°μˆ˜μž…λ‹ˆλ‹€. μ•ˆμ „μ„ μœ μ§€ν•˜λ©΄μ„œ 항상 κ°€λŠ₯ν•œ ν•œ 도움이 λ˜λŠ” 닡변을 ν•΄μ£Όμ„Έμš”. κ·€ν•˜μ˜ λ‹΅λ³€μ—λŠ” μœ ν•΄ν•˜κ±°λ‚˜, λΉ„μœ€λ¦¬μ μ΄κ±°λ‚˜, μΈμ’…μ°¨λ³„μ μ΄κ±°λ‚˜, μ„±μ°¨λ³„μ μ΄κ±°λ‚˜, 독성이 μžˆκ±°λ‚˜, μœ„ν—˜ν•˜κ±°λ‚˜ λΆˆλ²•μ μΈ μ½˜ν…μΈ κ°€ ν¬ν•¨λ˜μ–΄μ„œλŠ” μ•ˆ λ©λ‹ˆλ‹€. κ·€ν•˜μ˜ 응닡은 μ‚¬νšŒμ μœΌλ‘œ 편견이 μ—†κ³  긍정적인 λ‚΄μš©μ΄μ–΄μ•Ό ν•©λ‹ˆλ‹€."

tokenizer.decode(tokenizer(sys_message)['input_ids'])
'<s>λ‹Ή 신은 도 움이 되고 정쀑 ν•˜λ©° μ • μ§ν•œ μ‘° μˆ˜μž…λ‹ˆλ‹€ . μ•ˆμ „ 을 μœ μ§€ ν•˜λ©΄μ„œ 항상 κ°€λŠ₯ν•œ ν•œ 도 움이 λ˜λŠ” λ‹΅ 변을 ν•΄μ£Όμ„Έμš” . κ·€ ν•˜μ˜ λ‹΅λ³€ μ—λŠ” μœ ν•΄ ν•˜κ±°λ‚˜ , λΉ„ 윀 리 적이 κ±°λ‚˜ , 인쒅 차별 적이 κ±°λ‚˜ , μ„± 차별 적이 κ±°λ‚˜ , 독 성이 있 κ±°λ‚˜ , μœ„ν—˜ ν•˜κ±°λ‚˜ λΆˆλ²• 적인 μ½˜ν…μΈ  κ°€ 포함 λ˜μ–΄ μ„œλŠ” μ•ˆ λ©λ‹ˆλ‹€ . κ·€ ν•˜μ˜ 응 닡은 μ‚¬νšŒμ  으둜 편 견이 μ—†κ³  긍 μ • 적인 λ‚΄μš© 이어야 ν•©λ‹ˆλ‹€ .'

Moreover, when using this model for conversation or text-generation pipelines, it's slower compared to other similar models with similar generation configs and parameter counts like beomi/llama-2-koen-13b.

Could this slowness be related to the tokenizer?

Thanks.

Yanolja org

Hello Jeonghwan and Young Woo,

Thank you for pointing out the issue. Yes, I'm aware of it. The problem seems to be because all the tokens from the "added_tokens.json" file are treated as special tokens. I'm unsure if this is intentional since there's a separate "special_tokens_map.json" file for special tokens. This causes the tokenizer to insert a space after each token I added.

Here's a workaround I've been using:

if prev_tokens is not None:
    last = tokenizer.convert_tokens_to_ids(prev_tokens[-1:])
    if last[0] > 32000:
        next = new_tokens[-1]
        if next[0] == "▁":
            suffix = ""
            if len(next) > 1:
                suffix = new_text[-(len(next)-1):]
            new_text = new_text[:-len(next)] + suffix
            new_tokens[-1] = next[1:]
            if new_tokens[-1] == "":
                new_tokens = new_tokens[:-1]
                output_tokens = output_tokens[:-1]
                prefix_offset -= 1
            else:
                output_tokens[-1] = new_tokens[-1]

It's a bit of a quick fix but it's working for now. I plan to address this issue by adding the tokens directly into the tokenizer. Sorry for any trouble this has caused.

Thanks,
Seungduk

Yanolja org

It looks like my previous answer was wrong. I totally misunderstood how the tokenizer works. Initially, I need to merge the added tokens into the original tokenizer model, but it is not straightforward. Also, the slowness that Young Woo mentioned could be related to the numerous added tokens. Let me get back to you with a solution as soon as possible. Thank you for your understanding.

Yanolja org

Hi Jeonghwan and Young Woo,

I've been investigating this issue for some time and realized that the 'merges' in the tokenizer configuration were the cause. Jaewon helped me solve this issue and also wrote a blog post about it here: https://seen-point-bd9.notion.site/Tokenizer-Expansion-ecb6d78211a54ba6b3cf8ebc0ec1d105
As you can see in the blog post, KoSOLAR v0.1's tokenizer does not function well, and the length of the encoded result is longer than it should be, although it is still shorter than that produced by the original tokenizer.

<s> λ‹ΉλΆ„κ°„ 주택 가격에 큰 쑰정이 μΌμ–΄λ‚˜κ±°λ‚˜ ν•˜λŠ” 계기가 λ°œμƒν•˜μ§€ μ•ŠλŠ” ν•œ μ΄λ“€μ˜ 주택 λ³΅κ·€λŠ” λ‹ΉλΆ„κ°„ μ–΄λ €μ›Œ λ³΄μΈλ‹€λŠ” 것이 쀑둠이닀

# KoSOLAR v0.1 tokenizer
[1, 28705, 30287, 41768, 259, 34740, 259, 35790, 28705, 29148, 28705, 31694, 28705, 37585, 28705, 29015, 28705, 29415, 32633, 32400, 259, 32029, 259, 30106, 32453, 259, 46354, 32208, 259, 30104, 29175, 28705, 29282, 28705, 29015, 32173, 259, 34740, 259, 30357, 46682, 28705, 29175, 28705, 30287, 41768, 259, 29433, 30710, 31126, 28705, 29477, 33020, 28705, 29175, 28705, 38655, 259, 30027, 39265, 28705, 29043]

# KoSOLAR v0.2 tokenizer
[1, 32119, 41768, 34375, 42984, 32386, 32052, 33335, 33725, 32400, 32254, 39212, 32512, 32208, 32440, 32026, 35964, 34375, 34822, 29175, 32119, 41768, 38294, 39093, 32264, 32212, 32039, 46611, 32034]

As demonstrated, the revised tokenizer outputs a much shorter list of token IDs, most of which are newly added tokens (>= 32000). This also means that many embeddings in embed_tokens and lm_head were not sufficiently trained in KoSOLAR v0.1 because the corresponding token IDs were not generated frequently enough by the tokenizer. Therefore, if I simply replace the tokenizer, its performance will significantly degrade. I confirmed it by running an eval with the new tokenizer. I had hoped that my mistake would only impact the decoding process, but it turned out to be the opposite; it was actually the encoding process that was affected.

I will upload a new version with a fix, but it will take some time. I hope to upload the new version by January 12.

Thanks,
Seungduk

Sign up or log in to comment