Bingsu's picture
Update README.md
ba0082c
---
language:
- ko
tags:
- roberta
- tokenizer only
license:
- mit
---
## ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ฒ„์ „
- transformers: 4.21.1
- datasets: 2.4.0
- tokenizers: 0.12.1
## ํ›ˆ๋ จ ์ฝ”๋“œ
```python
from datasets import load_dataset
from tokenizers import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer(unicode_normalizer="nfkc", trim_offsets=True)
ds = load_dataset("Bingsu/my-korean-training-corpus", use_auth_token=True)
# ๊ณต๊ฐœ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ
# ds = load_dataset("cc100", lang="ko") # 50GB
# ์ด ๋ฐ์ดํ„ฐ๋Š” 35GB์ด๊ณ , ๋ฐ์ดํ„ฐ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์œผ๋ฉด ์ปดํ“จํ„ฐ๊ฐ€ ํ„ฐ์ ธ์„œ ์ผ๋ถ€๋งŒ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
ds_sample = ds["train"].train_test_split(0.35, seed=20220819)["test"]
def gen_text(batch_size: int = 5000):
for i in range(0, len(ds_sample), batch_size):
yield ds_sample[i : i + batch_size]["text"]
tokenizer.train_from_iterator(
gen_text(),
vocab_size=50265, # roberta-base์™€ ๊ฐ™์€ ํฌ๊ธฐ
min_frequency=2,
special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
],
)
tokenizer.save("my_tokenizer.json")
```
์•ฝ 7์‹œ๊ฐ„ ์†Œ๋ชจ (i5-12600 non-k)
![image](https://i.imgur.com/LNNbtGH.png)
์ดํ›„ ํ† ํฌ๋‚˜์ด์ €์˜ post-processor๋ฅผ RobertaProcessing์œผ๋กœ ๊ต์ฒดํ•ฉ๋‹ˆ๋‹ค.
```python
from tokenizers import Tokenizer
from tokenizers.processors import RobertaProcessing
tokenizer = Tokenizer.from_file("my_tokenizer.json")
tokenizer.post_processor = RobertaProcessing(
("</s>", tokenizer.token_to_id("</s>")),
("<s>", tokenizer.token_to_id("<s>")),
add_prefix_space=False,
)
tokenizer.save("my_tokenizer2.json")
```
`add_prefix_space=False`์˜ต์…˜์€ [roberta-base](https://huggingface.co/roberta-base)๋ฅผ ๊ทธ๋Œ€๋กœ ๋”ฐ๋ผํ•˜๊ธฐ ์œ„ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
๊ทธ๋ฆฌ๊ณ  `model_max_length` ์„ค์ •์„ ํ•ด์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.
```python
from transformers import RobertaTokenizerFast
rt = RobertaTokenizerFast(tokenizer_file="tokenizer.json")
rt.save_pretrained("./my_roberta_tokenizer")
```
์ €์žฅ๋œ ํด๋”์˜ `tokenizer_config.json` ํŒŒ์ผ์— `"model_max_length": 512,`๋ฅผ ์ถ”๊ฐ€.
## ์‚ฌ์šฉ๋ฒ•
#### 1.
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Bingsu/BBPE_tokenizer_test")
# tokenizer๋Š” RobertaTokenizerFast ํด๋ž˜์Šค๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.
```
#### 2.
`tokenizer.json`ํŒŒ์ผ์„ ๋จผ์ € ๋‹ค์šด๋ฐ›์Šต๋‹ˆ๋‹ค.
```python
from transformers import BartTokenizerFast, BertTokenizerFast
bart_tokenizer = BartTokenizerFast(tokenizer_file="tokenizer.json")
bert_tokenizer = BertTokenizerFast(tokenizer_file="tokenizer.json")
```
roberta์™€ ๊ฐ™์ด BBPE๋ฅผ ์‚ฌ์šฉํ•œ bart๋Š” ๋ฌผ๋ก ์ด๊ณ  bert์—๋„ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋‹ค๋งŒ ์ด๋ ‡๊ฒŒ ๋ถˆ๋Ÿฌ์™”์„ ๊ฒฝ์šฐ, model_max_len์ด ์ง€์ •์ด ๋˜์–ด์žˆ์ง€ ์•Š์œผ๋‹ˆ ์ง€์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.