Edit model card

Model Card for KorSciDeBERTa

KorSciDeBERTa๋Š” Microsoft DeBERTa ๋ชจ๋ธ์˜ ์•„ํ‚คํ…์ณ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ, ๋…ผ๋ฌธ, ์—ฐ๊ตฌ ๋ณด๊ณ ์„œ, ํŠนํ—ˆ, ๋‰ด์Šค, ํ•œ๊ตญ์–ด ์œ„ํ‚ค ๋ง๋ญ‰์น˜ ์ด 146GB๋ฅผ ์‚ฌ์ „ํ•™์Šตํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

๋งˆ์Šคํ‚น๋œ ์–ธ์–ด ๋ชจ๋ธ๋ง ๋˜๋Š” ๋‹ค์Œ ๋ฌธ์žฅ ์˜ˆ์ธก์— ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ณ , ์ถ”๊ฐ€๋กœ ๋ฌธ์žฅ ๋ถ„๋ฅ˜, ๋‹จ์–ด ํ† ํฐ ๋ถ„๋ฅ˜ ๋˜๋Š” ์งˆ์˜์‘๋‹ต๊ณผ ๊ฐ™์€ ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์—์„œ ๋ฏธ์„ธ ์กฐ์ •์„ ํ†ตํ•ด ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Model Details

Model Description

  • Developed by: KISTI
  • Model type: deberta-v2
  • Language(s) (NLP): ํ•œ๊ธ€(ko)

Model Sources

Uses

Downstream Use

Load Huggingface model directly

  1. ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ(Mecab) ๋“ฑ ์„ค์น˜ ํ•„์ˆ˜ - KorSciDeBERTa ํ™˜๊ฒฝ์„ค์น˜+ํŒŒ์ธํŠœ๋‹.pdf
  • Mecab ์„ค์น˜ ์ฐธ๊ณ : ๋‹ค์Œ ๋งํฌ์—์„œ '์‚ฌ์šฉ๋ฐฉ๋ฒ•'. https://aida.kisti.re.kr/model/9bbabd2d-6ce8-44cc-b2a3-69578d23970a

  • Colab ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ Mecab ์„ค์น˜(์œ„์˜ ์‚ฌ์šฉ์ž ์‚ฌ์ „ ์ถ”๊ฐ€ ์„ค์น˜ํ•˜์ง€ ์•Š์„ ์‹œ ๋ฒ ์ด์Šค๋ผ์ธ ์ •ํ™•๋„ 0.786์œผ๋กœ ๊ฐ์†Œํ•จ):


!git clone https://github.com/SOMJANG/Mecab-ko-for-Google-Colab.git
%cd Mecab-ko-for-Google-Colab/
!bash install_mecab-ko_on_colab_light_220429.sh

  • ImportError: accelerate>=0.20.1 ์—๋Ÿฌ ๋ฐœ์ƒ์‹œ ํ•ด๊ฒฐ๋ฒ•

!pip install -U accelerate; pip install -U transformers; pip install pydantic==1.8 (์„ค์น˜ ํ›„ ๋Ÿฐํƒ€์ž„ ์žฌ์‹œ์ž‘)

  • ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ ์—๋Ÿฌ ๋ฐœ์ƒ์‹œ ํ•ด๊ฒฐ๋ฒ•

git-lfs ์„ค์น˜ ํ™•์ธ ๋ฐ spm.model ์ •์ƒ ๋‹ค์šด๋กœ๋“œ & ์šฉ๋Ÿ‰(2.74mb) ํ™•์ธ (apt-get install git git-lfs)

Make sure you have git-lfs installed (git lfs install)

  1. apt-get install git-lfs; git clone https://huggingface.co/kisti/korscideberta; cd korscideberta
  • korscideberta-abstractcls.ipynb

!pip install transformers==4.36.0
from tokenization_korscideberta_v2 import DebertaV2Tokenizer
from transformers import AutoModelForSequenceClassification


tokenizer = DebertaV2Tokenizer.from_pretrained("kisti/korscideberta")
model = AutoModelForSequenceClassification.from_pretrained("kisti/korscideberta", num_labels=7, hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1)
#model = AutoModelForMaskedLM.from_pretrained("kisti/korscideberta")
''''''
train_metrics = trainer.train().metrics
trainer.save_metrics("train", train_metrics)
trainer.push_to_hub()
  

KorSciDeBERTa native code

KorSciDeBERTa ํ™˜๊ฒฝ์„ค์น˜+ํŒŒ์ธํŠœ๋‹.pdf ์ฐธ์กฐ


apt-get install git git-lfs
git clone https://huggingface.co/kisti/korscideberta; cd korscideberta; unzip korscideberta.zip -d korscideberta
''''''
cd korscideberta/experiments/glue; chmod 777 *.sh;
./mnli.sh

Out-of-Scope Use

์ด ๋ชจ๋ธ์€ ์˜๋„์ ์œผ๋กœ ์‚ฌ๋žŒ๋“ค์—๊ฒŒ ์ ๋Œ€์ ์ด๋‚˜ ์†Œ์™ธ๋œ ํ™˜๊ฒฝ์„ ์กฐ์„ฑํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋˜์–ด์„œ๋Š” ์•ˆ ๋ฉ๋‹ˆ๋‹ค.

์ด ๋ชจ๋ธ์€ '๊ณ ์œ„ํ—˜ ์„ค์ •'์—์„œ ์‚ฌ์šฉ๋  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ ์‚ฌ๋žŒ์ด๋‚˜ ์‚ฌ๋ฌผ์— ๋Œ€ํ•œ ์ค‘์š”ํ•œ ๊ฒฐ์ •์„ ๋‚ด๋ฆด ์ˆ˜ ์žˆ๊ฒŒ ์„ค๊ณ„๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ์˜ ์ถœ๋ ฅ๋ฌผ์€ ์‚ฌ์‹ค์ด ์•„๋‹ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

'๊ณ ์œ„ํ—˜ ์„ค์ •'์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์‚ฌํ•ญ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค:

์˜๋ฃŒ/์ •์น˜/๋ฒ•๋ฅ /๊ธˆ์œต ๋ถ„์•ผ์—์„œ์˜ ์‚ฌ์šฉ, ๊ณ ์šฉ/๊ต์œก/์‹ ์šฉ ๋ถ„์•ผ์—์„œ์˜ ์ธ๋ฌผ ํ‰๊ฐ€, ์ž๋™์œผ๋กœ ์ค‘์š”ํ•œ ๊ฒƒ์„ ๊ฒฐ์ •ํ•˜๊ธฐ, (๊ฐ€์งœ)์‚ฌ์‹ค์„ ์ƒ์„ฑํ•˜๊ธฐ, ์‹ ๋ขฐ๋„ ๋†’์€ ์š”์•ฝ๋ฌธ ์ƒ์„ฑ, ํ•ญ์ƒ ์˜ณ์•„์•ผ๋งŒ ํ•˜๋Š” ์˜ˆ์ธก ์ƒ์„ฑ ๋“ฑ.

Bias, Risks, and Limitations

์—ฐ๊ตฌ๋ชฉ์ ์œผ๋กœ ์ €์ž‘๊ถŒ ๋ฌธ์ œ๊ฐ€ ์—†๋Š” ๋ง๋ญ‰์น˜ ๋ฐ์ดํ„ฐ๋งŒ์„ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์˜ ์‚ฌ์šฉ์ž๋Š” ์•„๋ž˜์˜ ์œ„ํ—˜ ์š”์ธ๋“ค์„ ์ธ์‹ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์‚ฌ์šฉ๋œ ๋ง๋ญ‰์น˜๋Š” ๋Œ€๋ถ€๋ถ„ ์ค‘๋ฆฝ์ ์ธ ์„ฑ๊ฒฉ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”๋ฐ๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ์–ธ์–ด ๋ชจ๋ธ์˜ ํŠน์„ฑ์ƒ ์•„๋ž˜์™€ ๊ฐ™์€ ์œค๋ฆฌ ๊ด€๋ จ ์š”์†Œ๋ฅผ ์ผ๋ถ€ ํฌํ•จํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

ํŠน์ • ๊ด€์ ์— ๋Œ€ํ•œ ๊ณผ๋Œ€/๊ณผ์†Œ ํ‘œํ˜„, ๊ณ ์ • ๊ด€๋…, ๊ฐœ์ธ ์ •๋ณด, ์ฆ์˜ค/๋ชจ์š• ๋˜๋Š” ํญ๋ ฅ์ ์ธ ์–ธ์–ด, ์ฐจ๋ณ„์ ์ด๊ฑฐ๋‚˜ ํŽธ๊ฒฌ์ ์ธ ์–ธ์–ด, ๊ด€๋ จ์ด ์—†๊ฑฐ๋‚˜ ๋ฐ˜๋ณต์ ์ธ ์ถœ๋ ฅ ์ƒ์„ฑ ๋“ฑ.

Training Details

Training Data

๋…ผ๋ฌธ, ์—ฐ๊ตฌ ๋ณด๊ณ ์„œ, ํŠนํ—ˆ, ๋‰ด์Šค, ํ•œ๊ตญ์–ด ์œ„ํ‚ค ๋ง๋ญ‰์น˜ ์ด 146GB

Training Procedure

KISTI HPC NVIDIA A100 80G GPU 24EA์—์„œ 2.5๊ฐœ์›”๋™์•ˆ 1,600,000 ์Šคํ… ํ•™์Šต

Preprocessing

  • ๊ณผํ•™๊ธฐ์ˆ ๋ถ„์•ผ ํ† ํฌ๋‚˜์ด์ € (KorSci Tokenizer)
  • ๋ณธ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉ๋œ ์ฝ”ํผ์Šค๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ช…์‚ฌ ๋ฐ ๋ณตํ•ฉ๋ช…์‚ฌ ์•ฝ 600๋งŒ๊ฐœ์˜ ์‚ฌ์šฉ์ž์‚ฌ์ „์ด ์ถ”๊ฐ€๋œ Mecab-ko Tokenizer์™€ ๊ธฐ์กด SentencePiece-BPE๊ฐ€ ๋ณ‘ํ•ฉ๋˜์–ด์ง„ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ง๋ญ‰์น˜๋ฅผ ์ „์ฒ˜๋ฆฌํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  • Total 128,100 words
  • Included special tokens ( < unk >, < cls >, < s >, < mask > )
  • File name : spm.model, vocab.txt

Training Hyperparameters

  • model_type: deberta-v2
  • model_size: base
  • parameters: 180M
  • hidden_size: 768
  • num_hidden_layers: 12
  • num_attention_heads: 12
  • num_train_steps: 1,600,000
  • train_batch_size: 4,096 * 4 accumulative update = 16,384
  • learning_rate: 1e-4
  • max_seq_length: 512
  • vocab_size: 128,100
  • Training regime: fp16 mixed precision

Evaluation

Testing Data, Factors & Metrics

Testing Data

๋ณธ ์–ธ์–ด๋ชจ๋ธ์˜ ์„ฑ๋Šฅํ‰๊ฐ€๋Š” ๋…ผ๋ฌธ ์—ฐ๊ตฌ๋ถ„์•ผ ๋ถ„๋ฅ˜ ๋ฐ์ดํ„ฐ์— ํŒŒ์ธํŠœ๋‹ํ•˜์—ฌ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ๊ทธ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • ๋…ผ๋ฌธ ์—ฐ๊ตฌ๋ถ„์•ผ ๋ถ„๋ฅ˜ ๋ฐ์ดํ„ฐ์…‹(doi.org/10.23057/50), ๋…ผ๋ฌธ 3๋งŒ ๊ฑด, ๋ถ„๋ฅ˜ ์นดํ…Œ๊ณ ๋ฆฌ ์ˆ˜ - ๋Œ€๋ถ„๋ฅ˜: 33๊ฐœ, ์ค‘๋ถ„๋ฅ˜: 372๊ฐœ, ์†Œ๋ถ„๋ฅ˜: 2898๊ฐœ

Metrics

F1-micro/macro: ์ •๋‹ต Top3 ์ค‘ ์ตœ์†Œ 1๊ฐœ ์˜ˆ์ธก์‹œ ์„ฑ๊ณต ๊ธฐ์ค€

F1-strict: ์ •๋‹ต Top3 ์ค‘ ์˜ˆ์ธกํ•œ ์ˆ˜ ๋งŒํผ ์„ฑ๊ณต ๊ธฐ์ค€

Results

F1-micro: 0.85, F1-macro: 0.52, F1-strict: 0.71

Technical Specifications

Model Objective

MLM is a technique in which you take your tokenized sample and replace some of the tokens with the < mask > token and train your model with it. The model then tries to predict what should come in the place of that < mask > token and gradually starts learning about the data. MLM teaches the model about the relationship between words.

Eg. Suppose you have a sentence - 'Deep Learning is so cool! I love neural networks.', now replace few words with the < mask > token.

Masked Sentence - 'Deep Learning is so < mask >! I love < mask > networks.'

Compute Infrastructure

KISTI ๊ตญ๊ฐ€์Šˆํผ์ปดํ“จํŒ…์„ผํ„ฐ NEURON ์‹œ์Šคํ…œ. HPE ClusterStor E1000, HP Apollo 6500 Gen10 Plus, Lustre, Slurm, CentOS 7.9

Hardware

NVIDIA A100 80G GPU 24EA

Software

Python 3.8, Cuda 10.2, PyTorch 1.10

Citation

ํ•œ๊ตญ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณด์—ฐ๊ตฌ์› (2023) : ํ•œ๊ตญ์–ด ๊ณผํ•™๊ธฐ์ˆ ๋ถ„์•ผ DeBERTa ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ (KorSciDeBERTa). Version 1.0. ํ•œ๊ตญ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณด์—ฐ๊ตฌ์›.

Model Card Authors

๊น€์„ฑ์ฐฌ, ๊น€๊ฒฝ๋ฏผ, ๊น€์€ํฌ, ์ด๋ฏผํ˜ธ, ์ด์Šน์šฐ. ํ•œ๊ตญ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณด์—ฐ๊ตฌ์› ์ธ๊ณต์ง€๋Šฅ๋ฐ์ดํ„ฐ์—ฐ๊ตฌ๋‹จ

Model Card Contact

๊น€์„ฑ์ฐฌ, sckim kisti.re.kr ๊น€๊ฒฝ๋ฏผ, kkmkorea kisti.re.kr

Downloads last month
67