Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

2023 ์„ฑ๊ท ๊ด€๋Œ€ ํ•˜๊ณ„์ง‘์ค‘ ์‚ฐํ•™ํ˜‘๋ ฅํ”„๋กœ์ ํŠธ VAIV

GPT ๊ธฐ๋ฐ˜์˜ ์ž์—ฐ์Šค๋Ÿฝ๊ณ (Friendly) ์œค๋ฆฌ์ ์ธ(Harmless) ์ผ์ƒ ๋Œ€ํ™”ํ˜• ์ฑ—๋ด‡ ๋ชจ๋ธ

Github : https://github.com/VAIV-2023/RLHF-Korean-Friendly-LLM

์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ ๋ฐ ๋ชฉ์ 

GPT-NEOX(Polyglot-ko) ๊ธฐ๋ฐ˜ ์ž์—ฐ์Šค๋Ÿฝ๊ณ  ์œค๋ฆฌ์ ์ธ ํ•œ๊ตญ์–ด ๊ธฐ๋ฐ˜ ์ผ์ƒ ๋Œ€ํ™”ํ˜• ์ฑ—๋ด‡ ๋ชจ๋ธ ๊ตฌํ˜„

image

๊ฐœ๋ฐœ ๋‚ด์šฉ

  • Self-Instruct: GPT4๋ฅผ ์ด์šฉํ•œ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•

  • RLHF(Reinforcement Learning from Human Feedback): ์‚ฌ๋žŒ์˜ ์„ ํ˜ธ๋„๋ฅผ ๋ฐ˜์˜ํ•œ ๊ฐ•ํ™”ํ•™์Šต

  • DeepSpeed: ๋Œ€๊ทœ๋ชจ ๋ถ„์‚ฐ ๋”ฅ๋Ÿฌ๋‹์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ๊ธฐ์ˆ 

    • Task 1: ๊ฐ•ํ™”ํ•™์Šต ๋‹จ๊ณ„๋ณ„ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•
    • Task 2: SFT ๋ชจ๋ธ Instruction-tuning
    • Task 3: Reward ๋ชจ๋ธ ver1,2,3 ๊ตฌํ˜„
    • Task 4: RLHF์™€ DeepSpeedChat์„ ํ†ตํ•œ ์ตœ์ข… ๋ชจ๋ธ ๊ตฌํ˜„ (https://huggingface.co/Trofish/KULLM-RLHF)

Task1. ๊ฐ•ํ™”ํ•™์Šต ๋‹จ๊ณ„๋ณ„ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•

image Screenshot 2024-06-18 at 11 05 55โ€ฏAM Screenshot 2024-06-18 at 11 06 08โ€ฏAM

Task2. SFT ๋ชจ๋ธ Fine-tuning

Baseline Model

- ๊ณ ๋ ค๋Œ€ํ•™๊ต NLP & AI ์—ฐ๊ตฌ์‹ค๊ณผ HIAI ์—ฐ๊ตฌ์†Œ๊ฐ€ ๊ฐœ๋ฐœํ•œ ํ•œ๊ตญ์–ด LLM "KULLM" ์‚ฌ์šฉ

Datasets

image

SFT Model Finetuning

image

  • ๋ชจ๋ธํ•™์Šต์—๋Š” Google Colab์—์„œ ์ œ๊ณตํ•˜๋Š” A100 40GB GPU ์‚ฌ์šฉ

SFT Model Evaluation

image image

Task3-1. Reward Model ver1 ๊ตฌํ˜„

Baseline Model

  • EleutherAI์—์„œ ๊ฐœ๋ฐœํ•œ ์ดˆ๊ฑฐ๋Œ€ ํ•œ๊ตญ์–ด ์–ธ์–ด ๋ชจ๋ธ Polyglot-Ko ์‚ฌ์šฉ
  • 1.3b ๋ชจ๋ธ๊ณผ 5.8b ๋ชจ๋ธ์„ ๊ฐ๊ฐ ์‹คํ—˜

Datasets

image

  • InstructGPT์˜ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ• ๋ฐฉ๋ฒ•
    • Reward ๋ชจ๋ธ ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ SFT ํ•™์Šต์— ์‚ฌ์šฉํ•œ prompt(1,500๊ฐœ - ์ผ์ƒ๋Œ€ํ™”:ํ˜์˜คํ‘œํ˜„=2:1)์™€ ์ƒˆ๋กœ์šด prompt(1,000๊ฐœ - DeepSpeedChat ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹) ์‚ฌ์šฉ
    • SFT ๋ชจ๋ธ์—์„œ ํ•œ๊ฐœ์˜ prompt๋‹น K๊ฐœ์˜ Response๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ์ˆœ์œ„๋ฅผ Labeling
  • ๋ฐ์ดํ„ฐ์…‹ ๋ผ๋ฒจ๋ง
    • Instruct GPT์˜ ๊ฒฝ์šฐ ์‚ฌ๋žŒ์ด ์ง์ ‘ Labeling์„ ํ•˜์—ฟ์ง€๋งŒ, ์ผ๊ด€๋œ ํ‰๊ฐ€์™€ ์‹œ๊ฐ„ ๋‹จ์ถ•์„ ์œ„ํ•ด GPt-4์™€ G-Eval์„ ์ด์šฉ
    • SFT์—์„œ ์ƒ์„ฑํ•œ ๋‘ Response ์ค‘ G-Eval ํ‰๊ฐ€ ์ ์ˆ˜ ํ•ฉ์ด ๋†’์€ ๊ฒƒ์„ Chosen response๋กœ ๊ฒฐ์ •
    • ๋ฐ์ดํ„ฐ์…‹ ์œ ํ˜•๋ณ„๋กœ G-Eval ํ‰๊ฐ€ Prompt์— ์ฐจ์ด๋ฅผ ๋‘์—ˆ์Œ
    • image

Reward v1 Model Finetuning

image

  • InstructGPT ๋…ผ๋ฌธ์— ๋”ฐ๋ฅด๋ฉด, Reward ๋ชจ๋ธ์€ overfitting๋˜๋ฉด ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ์ €ํ•˜๋œ๋‹ค๊ณ  ํ•จ --> epoch ์ˆ˜๋ฅผ 1๋กœ ์„ค์ •
  • batch size๋‚˜ learning rate ๋“ฑ ๋‹ค๋ฅธ hyper-parameter๋Š” ์„ฑ๋Šฅ์— ํฐ ์˜ํ–ฅ์ด ์—†๋‹ค๊ณ  ํ•จ
  • Colab A100 40GB ๊ธฐ์ค€ ์ด ํ•™์Šต ์‹œ๊ฐ„ 4๋ถ„

Reward v1 Model Evaluation

image

  • Reward Model Template
    • "์•„๋ž˜๋Š” ์ž‘์—…์„ ์„ค๋ช…ํ•˜๋Š” ๋ช…๋ น์–ด์ž…๋‹ˆ๋‹ค. ์š”์ฒญ์„ ์ ์ ˆํžˆ ์™„๋ฃŒํ•˜๋Š” ์‘๋‹ต์„ ์ž‘์„ฑํ•˜์„ธ์š”. \n\n ### ๋ช…๋ น์–ด:\n{prompt}\n\n ### ์‘๋‹ต:\n"

Task3-2. Reward Model ver2 ๊ตฌํ˜„

Reward Model ver1 Issues

  • ๊ตฌํ˜„๋œ Reward Model์˜ ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์Œ (Accuracy 0.65)
  • Reward Model ver1์„ ์‚ฌ์šฉํ•˜์—ฌ Step3 ํ•™์Šต์‹œ ํ˜์˜คํ‘œํ˜„์ด ์•„๋‹Œ๋ฐ๋„ ํ˜์˜คํ‘œํ˜„์ด๋ผ๊ณ  ์ธ์‹ํ•˜๊ณ  ๋‹ต๋ณ€ํ•˜๋Š” ๋ฌธ์ œ ๋ฐœ์ƒ

Issue ํ•ด๊ฒฐ๋ฐฉ์•ˆ

image

  • SFT ๋ชจ๋ธ๋กœ ๋‹ต๋ณ€์„ 2๊ฐœ ์ƒ์„ฑํ•˜์˜€์„ ๋•Œ(Ver1), Chosen, Rejected ๋‹ต๋ณ€์˜ ์ฐจ์ด๊ฐ€ ํฌ๊ฒŒ ์—†์–ด ๋ชจ๋ธ์ด ํ•™์Šต๋˜์ง€ ์•Š๋Š” ํ˜„์ƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ 2๊ฐœ์˜ ๋ชจ๋ธ **(ChatGPT, SFT)**๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ต๋ณ€์„ ์ƒ์„ฑ(Ver2)
  • General Task ๋‹ต๋ณ€์— ๋Œ€ํ•œ ํ‰๊ฐ€ ์„ฑ๋Šฅ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด Evol-instruct ๋ฐ์ดํ„ฐ ์ถ”๊ฐ€
  • ํ•™์Šต์— ์‚ฌ์šฉํ•œ ๋ชจ๋“  ๋ฐ์ดํ„ฐ์…‹์€ 15 token ์ดํ•˜, cosine ์œ ์‚ฌ๋„ 0.5 ์ด์ƒ์ผ ๊ฒฝ์šฐ ์ œ๊ฑฐํ•˜๋Š” Filtering ์ž‘์—… ์ˆ˜ํ–‰
  • ํ˜์˜คํ‘œํ˜„ ํ•™์Šต์‹œ(Ver1) Step3 ๊ฐ•ํ™”ํ•™์Šต ์ดํ›„์— ๋‹ต๋ณ€์ด ์ด์ƒํ•˜๊ฒŒ ์ƒ์„ฑ๋˜๋Š” Issue๊ฐ€ ์žˆ์–ด, ํ˜์˜คํ‘œํ˜„์„ ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ํ•™์Šต(Ver2)
  • RM-ver1์€ GPT4๊ฐ€ Chosen, Rejected ๋ ˆ์ด๋ธ”๋ง์„ ์ง„ํ–‰ํ•˜์˜€์ง€๋งŒ, Resource ์ด์Šˆ๋กœ ์ธํ•ด ์ผ๋ถ€๋งŒ ์‚ฌ๋žŒ์ด ๋ผ๋ฒจ๋ง ์ง„ํ–‰
    • ์ผ์ƒ๋Œ€ํ™” ๋ฐ์ดํ„ฐ์…‹
      • ChatGPT์™€ SFT ๋ชจ๋‘ ์ผ๊ด€๋˜๊ฒŒ ๋†’์€ ํ€„๋ฆฌํ‹ฐ์˜ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜์ง€ ์•Š์•„, ์‚ฌ๋žŒ์ด ์ง์ ‘ ๋ผ๋ฒจ๋ง ์ง„ํ–‰
    • RLHF ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ, Evol-Instruct ๋ฐ์ดํ„ฐ์…‹
      • ChatGPT๊ฐ€ ์ผ๊ด€๋˜๊ฒŒ ๋†’์€ ํ€„๋ฆฌํ‹ฐ์˜ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜์—ฌ ChatGPT๋ฅผ Chosen, SFT๋ฅผ Rejected๋กœ ๋ผ๋ฒจ๋ง ์ง„ํ–‰

Reward Model ver2 Evaluation

image

Task4. RLHF์™€ DeepSpeedChat์„ ํ†ตํ•œ ์ตœ์ข… ๋ชจ๋ธ ๊ตฌํ˜„

  • Microsoft์—์„œ ๋งŒ๋“  ๋Œ€๊ทœ๋ชจ ๋ถ„์‚ฐ ๋”ฅ๋Ÿฌ๋‹์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ๊ธฐ์ˆ (DeepSpeed)์„ RLHF Process์— ์ ์šฉํ•œ DeepSpeedChat ์‚ฌ์šฉ
  • Human preference๋กœ ํ•™์Šต์„ ์‹œํ‚จ Reward ๋ชจ๋ธ๊ณผ ๊ฐ•ํ™”ํ•™์Šต์„ ํ†ตํ•ด SFT ๋ชจ๋ธ์— ์‚ฌ๋žŒ์˜ ์„ ํ˜ธ๋„๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ ์ž์—ฐ์Šค๋Ÿฝ๊ณ (FRIENDLY), ์œค๋ฆฌ์ ์ธ (HARMLESS) ์ฑ—๋ด‡ ์ƒ์„ฑ

Baseline Models

  • Actor Model: KULLM-SFT-V2
  • Reward Model: Polyglot-Ko-Reward-V3

Training Options

image

RLHF Training

image

  • ํ•™์Šต ๊ฒฐ๊ณผ, SFT ๋ชจ๋ธ์˜ ๋‹ต๋ณ€์— ๋Œ€ํ•œ ํ€„๋ฆฌํ‹ฐ์ธ Reward๊ฐ€ ์ƒ์Šนํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธ (์‚ฌ๋žŒ์˜ ์„ ํ˜ธ๋„๊ฐ€ ๋†’์€ ๋‹ต๋ณ€์„ ์ƒ์„ฑ)

RLFH Model Evaluation

image image

Final RLHF Model

Contributors ๐Ÿ™Œ

  • ๋ฐ•์„ฑ์™„ (์„ฑ๊ท ๊ด€๋Œ€ํ•™๊ต ์†Œํ”„ํŠธ์›จ์–ดํ•™๊ณผ 20ํ•™๋ฒˆ, waniboyy@gmail.com)
  • ์†กํ˜„๋นˆ (์„ฑ๊ท ๊ด€๋Œ€ํ•™๊ต ์†Œํ”„ํŠธ์›จ์–ดํ•™๊ณผ 20ํ•™๋ฒˆ, shbin0519@gmail.com)
  • ํ—ˆ์œ ๋ฏผ (์„ฑ๊ท ๊ด€๋Œ€ํ•™๊ต ์†Œํ”„ํŠธ์›จ์–ดํ•™๊ณผ 21ํ•™๋ฒˆ, ymheo1123@gmail.com)
  • ํ™์—ฌ์› (์„ฑ๊ท ๊ด€๋Œ€ํ•™๊ต ์†Œํ”„ํŠธ์›จ์–ดํ•™๊ณผ 20ํ•™๋ฒˆ, ryeowon13@gmail.com)
Downloads last month
1,296
Inference API
Model is too large to load in Inference API (serverless). To try the model, launch it on Inference Endpoints (dedicated) instead.