Transformers documentation

๐Ÿค— Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ํ† ํฌ๋‚˜์ด์ € ์‚ฌ์šฉํ•˜๊ธฐ

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v4.47.1).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

๐Ÿค— Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ํ† ํฌ๋‚˜์ด์ € ์‚ฌ์šฉํ•˜๊ธฐ

PreTrainedTokenizerFast๋Š” ๐Ÿค— Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์— ๊ธฐ๋ฐ˜ํ•ฉ๋‹ˆ๋‹ค. ๐Ÿค— Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ํ† ํฌ๋‚˜์ด์ €๋Š” ๐Ÿค— Transformers๋กœ ๋งค์šฐ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ตฌ์ฒด์ ์ธ ๋‚ด์šฉ์— ๋“ค์–ด๊ฐ€๊ธฐ ์ „์—, ๋ช‡ ์ค„์˜ ์ฝ”๋“œ๋กœ ๋”๋ฏธ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๋งŒ๋“ค์–ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

>>> from tokenizers import Tokenizer
>>> from tokenizers.models import BPE
>>> from tokenizers.trainers import BpeTrainer
>>> from tokenizers.pre_tokenizers import Whitespace

>>> tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
>>> trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

>>> tokenizer.pre_tokenizer = Whitespace()
>>> files = [...]
>>> tokenizer.train(files, trainer)

์šฐ๋ฆฌ๊ฐ€ ์ •์˜ํ•œ ํŒŒ์ผ์„ ํ†ตํ•ด ์ด์ œ ํ•™์Šต๋œ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๊ฐ–๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋Ÿฐํƒ€์ž„์—์„œ ๊ณ„์† ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ JSON ํŒŒ์ผ๋กœ ์ €์žฅํ•˜์—ฌ ๋‚˜์ค‘์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ† ํฌ๋‚˜์ด์ € ๊ฐ์ฒด๋กœ๋ถ€ํ„ฐ ์ง์ ‘ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

๐Ÿค— Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ ์ด ํ† ํฌ๋‚˜์ด์ € ๊ฐ์ฒด๋ฅผ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. PreTrainedTokenizerFast ํด๋ž˜์Šค๋Š” ์ธ์Šคํ„ด์Šคํ™”๋œ ํ† ํฌ๋‚˜์ด์ € ๊ฐ์ฒด๋ฅผ ์ธ์ˆ˜๋กœ ๋ฐ›์•„ ์‰ฝ๊ฒŒ ์ธ์Šคํ„ด์Šคํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

>>> from transformers import PreTrainedTokenizerFast

>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

์ด์ œ fast_tokenizer ๊ฐ์ฒด๋Š” ๐Ÿค— Transformers ํ† ํฌ๋‚˜์ด์ €์—์„œ ๊ณต์œ ํ•˜๋Š” ๋ชจ๋“  ๋ฉ”์†Œ๋“œ์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค! ์ž์„ธํ•œ ๋‚ด์šฉ์€ ํ† ํฌ๋‚˜์ด์ € ํŽ˜์ด์ง€๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

JSON ํŒŒ์ผ์—์„œ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

JSON ํŒŒ์ผ์—์„œ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ ์œ„ํ•ด, ๋จผ์ € ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ €์žฅํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

>>> tokenizer.save("tokenizer.json")

JSON ํŒŒ์ผ์„ ์ €์žฅํ•œ ๊ฒฝ๋กœ๋Š” tokenizer_file ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ PreTrainedTokenizerFast ์ดˆ๊ธฐํ™” ๋ฉ”์†Œ๋“œ์— ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

>>> from transformers import PreTrainedTokenizerFast

>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")

์ด์ œ fast_tokenizer ๊ฐ์ฒด๋Š” ๐Ÿค— Transformers ํ† ํฌ๋‚˜์ด์ €์—์„œ ๊ณต์œ ํ•˜๋Š” ๋ชจ๋“  ๋ฉ”์†Œ๋“œ์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค! ์ž์„ธํ•œ ๋‚ด์šฉ์€ ํ† ํฌ๋‚˜์ด์ € ํŽ˜์ด์ง€๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

< > Update on GitHub