Edit model card

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ฒ„์ „

  • transformers: 4.21.1
  • datasets: 2.4.0
  • tokenizers: 0.12.1

ํ›ˆ๋ จ ์ฝ”๋“œ

from datasets import load_dataset
from tokenizers import ByteLevelBPETokenizer

tokenizer = ByteLevelBPETokenizer(unicode_normalizer="nfkc", trim_offsets=True)
ds = load_dataset("Bingsu/my-korean-training-corpus", use_auth_token=True)
# ๊ณต๊ฐœ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ
# ds = load_dataset("cc100", lang="ko")  # 50GB


# ์ด ๋ฐ์ดํ„ฐ๋Š” 35GB์ด๊ณ , ๋ฐ์ดํ„ฐ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์œผ๋ฉด ์ปดํ“จํ„ฐ๊ฐ€ ํ„ฐ์ ธ์„œ ์ผ๋ถ€๋งŒ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
ds_sample = ds["train"].train_test_split(0.35, seed=20220819)["test"]


def gen_text(batch_size: int = 5000):
    for i in range(0, len(ds_sample), batch_size):
        yield ds_sample[i : i + batch_size]["text"]


tokenizer.train_from_iterator(
    gen_text(),
    vocab_size=50265,  # roberta-base์™€ ๊ฐ™์€ ํฌ๊ธฐ
    min_frequency=2,
    special_tokens=[
        "<s>",
        "<pad>",
        "</s>",
        "<unk>",
        "<mask>",
    ],
)
tokenizer.save("my_tokenizer.json")

์•ฝ 7์‹œ๊ฐ„ ์†Œ๋ชจ (i5-12600 non-k) image

์ดํ›„ ํ† ํฌ๋‚˜์ด์ €์˜ post-processor๋ฅผ RobertaProcessing์œผ๋กœ ๊ต์ฒดํ•ฉ๋‹ˆ๋‹ค.

from tokenizers import Tokenizer
from tokenizers.processors import RobertaProcessing

tokenizer = Tokenizer.from_file("my_tokenizer.json")
tokenizer.post_processor = RobertaProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
    add_prefix_space=False,
)

tokenizer.save("my_tokenizer2.json")

add_prefix_space=False์˜ต์…˜์€ roberta-base๋ฅผ ๊ทธ๋Œ€๋กœ ๋”ฐ๋ผํ•˜๊ธฐ ์œ„ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ทธ๋ฆฌ๊ณ  model_max_length ์„ค์ •์„ ํ•ด์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

from transformers import RobertaTokenizerFast

rt = RobertaTokenizerFast(tokenizer_file="tokenizer.json")
rt.save_pretrained("./my_roberta_tokenizer")

์ €์žฅ๋œ ํด๋”์˜ tokenizer_config.json ํŒŒ์ผ์— "model_max_length": 512,๋ฅผ ์ถ”๊ฐ€.

์‚ฌ์šฉ๋ฒ•

1.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Bingsu/BBPE_tokenizer_test")

# tokenizer๋Š” RobertaTokenizerFast ํด๋ž˜์Šค๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

2.

tokenizer.jsonํŒŒ์ผ์„ ๋จผ์ € ๋‹ค์šด๋ฐ›์Šต๋‹ˆ๋‹ค.

from transformers import BartTokenizerFast, BertTokenizerFast

bart_tokenizer = BartTokenizerFast(tokenizer_file="tokenizer.json")
bert_tokenizer = BertTokenizerFast(tokenizer_file="tokenizer.json")

roberta์™€ ๊ฐ™์ด BBPE๋ฅผ ์‚ฌ์šฉํ•œ bart๋Š” ๋ฌผ๋ก ์ด๊ณ  bert์—๋„ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ ์ด๋ ‡๊ฒŒ ๋ถˆ๋Ÿฌ์™”์„ ๊ฒฝ์šฐ, model_max_len์ด ์ง€์ •์ด ๋˜์–ด์žˆ์ง€ ์•Š์œผ๋‹ˆ ์ง€์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

Downloads last month
0
Unable to determine this model's library. Check the docs .