Transformers documentation

๐Ÿค— Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ํ† ํฌ๋‚˜์ด์ € ์‚ฌ์šฉํ•˜๊ธฐ

You are viewing v4.44.0 version. A newer version v4.46.2 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

๐Ÿค— Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ํ† ํฌ๋‚˜์ด์ € ์‚ฌ์šฉํ•˜๊ธฐ

PreTrainedTokenizerFast๋Š” ๐Ÿค— Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์— ๊ธฐ๋ฐ˜ํ•ฉ๋‹ˆ๋‹ค. ๐Ÿค— Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ํ† ํฌ๋‚˜์ด์ €๋Š” ๐Ÿค— Transformers๋กœ ๋งค์šฐ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ตฌ์ฒด์ ์ธ ๋‚ด์šฉ์— ๋“ค์–ด๊ฐ€๊ธฐ ์ „์—, ๋ช‡ ์ค„์˜ ์ฝ”๋“œ๋กœ ๋”๋ฏธ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๋งŒ๋“ค์–ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

>>> from tokenizers import Tokenizer
>>> from tokenizers.models import BPE
>>> from tokenizers.trainers import BpeTrainer
>>> from tokenizers.pre_tokenizers import Whitespace

>>> tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
>>> trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

>>> tokenizer.pre_tokenizer = Whitespace()
>>> files = [...]
>>> tokenizer.train(files, trainer)

์šฐ๋ฆฌ๊ฐ€ ์ •์˜ํ•œ ํŒŒ์ผ์„ ํ†ตํ•ด ์ด์ œ ํ•™์Šต๋œ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๊ฐ–๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋Ÿฐํƒ€์ž„์—์„œ ๊ณ„์† ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ JSON ํŒŒ์ผ๋กœ ์ €์žฅํ•˜์—ฌ ๋‚˜์ค‘์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ† ํฌ๋‚˜์ด์ € ๊ฐ์ฒด๋กœ๋ถ€ํ„ฐ ์ง์ ‘ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

๐Ÿค— Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ ์ด ํ† ํฌ๋‚˜์ด์ € ๊ฐ์ฒด๋ฅผ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. PreTrainedTokenizerFast ํด๋ž˜์Šค๋Š” ์ธ์Šคํ„ด์Šคํ™”๋œ ํ† ํฌ๋‚˜์ด์ € ๊ฐ์ฒด๋ฅผ ์ธ์ˆ˜๋กœ ๋ฐ›์•„ ์‰ฝ๊ฒŒ ์ธ์Šคํ„ด์Šคํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

>>> from transformers import PreTrainedTokenizerFast

>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

์ด์ œ fast_tokenizer ๊ฐ์ฒด๋Š” ๐Ÿค— Transformers ํ† ํฌ๋‚˜์ด์ €์—์„œ ๊ณต์œ ํ•˜๋Š” ๋ชจ๋“  ๋ฉ”์†Œ๋“œ์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค! ์ž์„ธํ•œ ๋‚ด์šฉ์€ ํ† ํฌ๋‚˜์ด์ € ํŽ˜์ด์ง€๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

JSON ํŒŒ์ผ์—์„œ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

JSON ํŒŒ์ผ์—์„œ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ ์œ„ํ•ด, ๋จผ์ € ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ €์žฅํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

>>> tokenizer.save("tokenizer.json")

JSON ํŒŒ์ผ์„ ์ €์žฅํ•œ ๊ฒฝ๋กœ๋Š” tokenizer_file ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ PreTrainedTokenizerFast ์ดˆ๊ธฐํ™” ๋ฉ”์†Œ๋“œ์— ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

>>> from transformers import PreTrainedTokenizerFast

>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")

์ด์ œ fast_tokenizer ๊ฐ์ฒด๋Š” ๐Ÿค— Transformers ํ† ํฌ๋‚˜์ด์ €์—์„œ ๊ณต์œ ํ•˜๋Š” ๋ชจ๋“  ๋ฉ”์†Œ๋“œ์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค! ์ž์„ธํ•œ ๋‚ด์šฉ์€ ํ† ํฌ๋‚˜์ด์ € ํŽ˜์ด์ง€๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

< > Update on GitHub