Trofish's picture
Update README.md
0ee90b1 verified
|
raw
history blame
1.76 kB
# RoBERTa-base Korean
## λͺ¨λΈ μ„€λͺ…
이 RoBERTa λͺ¨λΈμ€ λ‹€μ–‘ν•œ ν•œκ΅­μ–΄ ν…μŠ€νŠΈ λ°μ΄ν„°μ…‹μ—μ„œ **음절** λ‹¨μœ„λ‘œ 사전 ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
자체 κ΅¬μΆ•ν•œ ν•œκ΅­μ–΄ 음절 λ‹¨μœ„ vocab을 μ‚¬μš©ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
## μ•„ν‚€ν…μ²˜
- **λͺ¨λΈ μœ ν˜•**: RoBERTa
- **μ•„ν‚€ν…μ²˜**: RobertaForMaskedLM
- **λͺ¨λΈ 크기**: 256 hidden size, 8 hidden layers, 8 attention heads
- **max_position_embeddings**: 514
- **intermediate_size**: 2048
- **vocab_size**: 1428
## ν•™μŠ΅ 데이터
μ‚¬μš©λœ 데이터셋은 λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:
- **λͺ¨λ‘μ˜λ§λ­‰μΉ˜**: μ±„νŒ…, κ²Œμ‹œνŒ, μΌμƒλŒ€ν™”, λ‰΄μŠ€, λ°©μ†‘λŒ€λ³Έ, μ±… λ“±
- **AIHUB**: SNS, 유튜브 λŒ“κΈ€, λ„μ„œ λ¬Έμž₯
- **기타**: λ‚˜λ¬΄μœ„ν‚€, ν•œκ΅­μ–΄ μœ„ν‚€ν”Όλ””μ•„
총 ν•©μ‚°λœ λ°μ΄ν„°λŠ” μ•½ 11GB μž…λ‹ˆλ‹€.
## ν•™μŠ΅ 상세
- **BATCH_SIZE**: 112 (GPUλ‹Ή)
- **ACCUMULATE**: 36
- **MAX_STEPS**: 12,500
- **Train Steps*Batch Szie**: **100M**
- **WARMUP_STEPS**: 2,400
- **μ΅œμ ν™”**: AdamW, LR 1e-3, BETA (0.9, 0.98), eps 1e-6
- **ν•™μŠ΅λ₯  감쇠**: linear
- **μ‚¬μš©λœ ν•˜λ“œμ›¨μ–΄**: 2x RTX 8000 GPU
![Evaluation Loss Graph](https://cdn-uploads.huggingface.co/production/uploads/64a0fd6fd3149e05bc5260dd/-64jKdcJAavwgUREwaywe.png)
![Evaluation Accuracy Graph](https://cdn-uploads.huggingface.co/production/uploads/64a0fd6fd3149e05bc5260dd/LPq5M6S8LTwkFSCepD33S.png)
## μ‚¬μš© 방법
```python
from transformers import AutoModel, AutoTokenizer
# λͺ¨λΈκ³Ό ν† ν¬λ‚˜μ΄μ € 뢈러였기
model = AutoModel.from_pretrained("your_model_name")
tokenizer = AutoTokenizer.from_pretrained("your_tokenizer_name")
# ν…μŠ€νŠΈλ₯Ό ν† ν°μœΌλ‘œ λ³€ν™˜ν•˜κ³  예츑 μˆ˜ν–‰
inputs = tokenizer("여기에 ν•œκ΅­μ–΄ ν…μŠ€νŠΈ μž…λ ₯", return_tensors="pt")
outputs = model(**inputs)