Edit model card

ByT5-Korean - large

ByT5-Korean is a Korean specific extension of Google's ByT5.

A Korean syllable has three components (called Jamo): a beginning consonant, a middle vowel, and an optional final consonant; they are like individual characters of alphabet. While the ByT5's utf-8 encoding allows generic encoding for multiple languages, it is unnatural for Korean because it splits the bits representation of each Jamo in the middle.

ByT5-Korean extends ByT5's utf-8 encoding with special care for Korean syllables; each Jamo is represented with a extra token. ByT5-Korean was pre-trained on mC4 with 70% Korean and 30% English.

Encoding Scheme

id: token
0: <pad>
1: <eos>
2: <unk>
3~258: utf-8 encoding
259~277: beginning consonants(μ΄ˆμ„±), 19개(γ„±γ„²γ„΄γ„·γ„Έγ„Ήγ…γ…‚γ…ƒγ……γ…†γ…‡γ…ˆγ…‰γ…Šγ…‹γ…Œγ…γ…Ž)
278~298: middle vowel(쀑성), 21개(γ…γ…γ…‘γ…’γ…“γ…”γ…•γ…–γ…—γ…˜γ…™γ…šγ…›γ…œγ…γ…žγ…Ÿγ… γ…‘γ…’γ…£)
299~326: final consonant(μ’…μ„±), 무쒅성+27개(γ„±γ„²γ„³γ„΄γ„΅γ„Άγ„·γ„Ήγ„Ίγ„»γ„Όγ„½γ„Ύγ„Ώγ…€γ…γ…‚γ…„γ……γ…†γ…‡γ…ˆγ…Šγ…‹γ…Œγ…γ…Ž)
327~384: from <extra_id_0> to <extra_id_57>

Example Inference

import torch
from tokenizer import ByT5KoreanTokenizer # https://huggingface.co/everdoubling/byt5-Korean-large/blob/main/tokenizer.py
from transformers import T5ForConditionalGeneration

tokenizer_jamo = ByT5KoreanTokenizer()
model = T5ForConditionalGeneration.from_pretrained('everdoubling/byt5-Korean-large')

input_sentence = 'ν•œκ΅­μ–΄ μœ„ν‚€λ°±κ³Ό(μ˜μ–΄: Korean Wikipedia)λŠ” ν•œκ΅­μ–΄λ‘œ μš΄μ˜λ˜λŠ” μœ„ν‚€λ°±κ³Όμ˜ λ‹€μ–Έμ–΄νŒ κ°€μš΄λ° ν•˜λ‚˜λ‘œμ„œ, 2002λ…„ 10μ›” 11일에 <extra_id_0>. λ˜ν•œ ν˜„μž¬ ν•œκ΅­μ–΄ μœ„ν‚€λ°±κ³Όμ—λŠ” λ„˜κ²¨μ£ΌκΈ°, ν† λ‘ , κ·Έλ¦Ό λ“± νŽ˜μ΄μ§€λ‘œ λΆˆλ¦¬λŠ” λͺ¨λ“  λ¬Έμ„œλ₯Ό ν¬ν•¨ν•˜λ©΄ 총 2,629,860κ°œκ°€ <extra_id_1>λ˜μ–΄ 있으며, λ„˜κ²¨μ£ΌκΈ°λ₯Ό ν¬ν•¨ν•œ 일반 λ¬Έμ„œ μˆ˜λŠ” 1,278,560개,[1] 그쀑 λ„˜κ²¨μ£ΌκΈ°, 막닀λ₯Έ λ¬Έμ„œλ₯Ό μ œμ™Έν•œ 일반 λ¬Έμ„œ μˆ˜λŠ” 573,149κ°œμ΄λ‹€.'

input_ids_jamo = tokenizer_jamo(input_sentence).input_ids
outputs_jamo = model_jamo.generate(torch.tensor([input_ids_jamo]))
print(tokenizer_jamo.decode(outputs_jamo[0]))
# <pad><extra_id_0>μ„€λ¦½λ˜μ—ˆλ‹€<extra_id_1>Δ‘Δ›

Additional information coming soon...

Downloads last month
5
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train everdoubling/byt5-Korean-large