YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

2023 ์„ฑ๊ท ๊ด€๋Œ€ ํ•˜๊ณ„์ง‘์ค‘ ์‚ฐํ•™ํ˜‘๋ ฅํ”„๋กœ์ ํŠธ VAIV

GPT ๊ธฐ๋ฐ˜์˜ ์ž์—ฐ์Šค๋Ÿฝ๊ณ (Friendly) ์œค๋ฆฌ์ ์ธ(Harmless) ์ผ์ƒ ๋Œ€ํ™”ํ˜• ์ฑ—๋ด‡ ๋ชจ๋ธ

Github : https://github.com/VAIV-2023/RLHF-Korean-Friendly-LLM

์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ ๋ฐ ๋ชฉ์ 

GPT-NEOX(Polyglot-ko) ๊ธฐ๋ฐ˜ ์ž์—ฐ์Šค๋Ÿฝ๊ณ  ์œค๋ฆฌ์ ์ธ ํ•œ๊ตญ์–ด ๊ธฐ๋ฐ˜ ์ผ์ƒ ๋Œ€ํ™”ํ˜• ์ฑ—๋ด‡ ๋ชจ๋ธ ๊ตฌํ˜„

image

๊ฐœ๋ฐœ ๋‚ด์šฉ

  • Self-Instruct: GPT4๋ฅผ ์ด์šฉํ•œ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•

  • RLHF(Reinforcement Learning from Human Feedback): ์‚ฌ๋žŒ์˜ ์„ ํ˜ธ๋„๋ฅผ ๋ฐ˜์˜ํ•œ ๊ฐ•ํ™”ํ•™์Šต

  • DeepSpeed: ๋Œ€๊ทœ๋ชจ ๋ถ„์‚ฐ ๋”ฅ๋Ÿฌ๋‹์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ๊ธฐ์ˆ 

    • Task 1: ๊ฐ•ํ™”ํ•™์Šต ๋‹จ๊ณ„๋ณ„ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•
    • Task 2: SFT ๋ชจ๋ธ Instruction-tuning
    • Task 3: Reward ๋ชจ๋ธ ver1,2,3 ๊ตฌํ˜„
    • Task 4: RLHF์™€ DeepSpeedChat์„ ํ†ตํ•œ ์ตœ์ข… ๋ชจ๋ธ ๊ตฌํ˜„ (https://huggingface.co/Trofish/KULLM-RLHF)

Task1. ๊ฐ•ํ™”ํ•™์Šต ๋‹จ๊ณ„๋ณ„ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•

image Screenshot 2024-06-18 at 11 05 55โ€ฏAM Screenshot 2024-06-18 at 11 06 08โ€ฏAM

Task2. SFT ๋ชจ๋ธ Fine-tuning

Baseline Model

- ๊ณ ๋ ค๋Œ€ํ•™๊ต NLP & AI ์—ฐ๊ตฌ์‹ค๊ณผ HIAI ์—ฐ๊ตฌ์†Œ๊ฐ€ ๊ฐœ๋ฐœํ•œ ํ•œ๊ตญ์–ด LLM "KULLM" ์‚ฌ์šฉ

Datasets

image

SFT Model Finetuning

image

  • ๋ชจ๋ธํ•™์Šต์—๋Š” Google Colab์—์„œ ์ œ๊ณตํ•˜๋Š” A100 40GB GPU ์‚ฌ์šฉ

SFT Model Evaluation

image image

Task3-1. Reward Model ver1 ๊ตฌํ˜„

Baseline Model

  • EleutherAI์—์„œ ๊ฐœ๋ฐœํ•œ ์ดˆ๊ฑฐ๋Œ€ ํ•œ๊ตญ์–ด ์–ธ์–ด ๋ชจ๋ธ Polyglot-Ko ์‚ฌ์šฉ
  • 1.3b ๋ชจ๋ธ๊ณผ 5.8b ๋ชจ๋ธ์„ ๊ฐ๊ฐ ์‹คํ—˜

Datasets

image

  • InstructGPT์˜ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ• ๋ฐฉ๋ฒ•
    • Reward ๋ชจ๋ธ ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ SFT ํ•™์Šต์— ์‚ฌ์šฉํ•œ prompt(1,500๊ฐœ - ์ผ์ƒ๋Œ€ํ™”:ํ˜์˜คํ‘œํ˜„=2:1)์™€ ์ƒˆ๋กœ์šด prompt(1,000๊ฐœ - DeepSpeedChat ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹) ์‚ฌ์šฉ
    • SFT ๋ชจ๋ธ์—์„œ ํ•œ๊ฐœ์˜ prompt๋‹น K๊ฐœ์˜ Response๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ์ˆœ์œ„๋ฅผ Labeling
  • ๋ฐ์ดํ„ฐ์…‹ ๋ผ๋ฒจ๋ง
    • Instruct GPT์˜ ๊ฒฝ์šฐ ์‚ฌ๋žŒ์ด ์ง์ ‘ Labeling์„ ํ•˜์—ฟ์ง€๋งŒ, ์ผ๊ด€๋œ ํ‰๊ฐ€์™€ ์‹œ๊ฐ„ ๋‹จ์ถ•์„ ์œ„ํ•ด GPt-4์™€ G-Eval์„ ์ด์šฉ
    • SFT์—์„œ ์ƒ์„ฑํ•œ ๋‘ Response ์ค‘ G-Eval ํ‰๊ฐ€ ์ ์ˆ˜ ํ•ฉ์ด ๋†’์€ ๊ฒƒ์„ Chosen response๋กœ ๊ฒฐ์ •
    • ๋ฐ์ดํ„ฐ์…‹ ์œ ํ˜•๋ณ„๋กœ G-Eval ํ‰๊ฐ€ Prompt์— ์ฐจ์ด๋ฅผ ๋‘์—ˆ์Œ
    • image

Reward v1 Model Finetuning

image

  • InstructGPT ๋…ผ๋ฌธ์— ๋”ฐ๋ฅด๋ฉด, Reward ๋ชจ๋ธ์€ overfitting๋˜๋ฉด ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ์ €ํ•˜๋œ๋‹ค๊ณ  ํ•จ --> epoch ์ˆ˜๋ฅผ 1๋กœ ์„ค์ •
  • batch size๋‚˜ learning rate ๋“ฑ ๋‹ค๋ฅธ hyper-parameter๋Š” ์„ฑ๋Šฅ์— ํฐ ์˜ํ–ฅ์ด ์—†๋‹ค๊ณ  ํ•จ
  • Colab A100 40GB ๊ธฐ์ค€ ์ด ํ•™์Šต ์‹œ๊ฐ„ 4๋ถ„

Reward v1 Model Evaluation

image

  • Reward Model Template
    • "์•„๋ž˜๋Š” ์ž‘์—…์„ ์„ค๋ช…ํ•˜๋Š” ๋ช…๋ น์–ด์ž…๋‹ˆ๋‹ค. ์š”์ฒญ์„ ์ ์ ˆํžˆ ์™„๋ฃŒํ•˜๋Š” ์‘๋‹ต์„ ์ž‘์„ฑํ•˜์„ธ์š”. \n\n ### ๋ช…๋ น์–ด:\n{prompt}\n\n ### ์‘๋‹ต:\n"

Task3-2. Reward Model ver2 ๊ตฌํ˜„

Reward Model ver1 Issues

  • ๊ตฌํ˜„๋œ Reward Model์˜ ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์Œ (Accuracy 0.65)
  • Reward Model ver1์„ ์‚ฌ์šฉํ•˜์—ฌ Step3 ํ•™์Šต์‹œ ํ˜์˜คํ‘œํ˜„์ด ์•„๋‹Œ๋ฐ๋„ ํ˜์˜คํ‘œํ˜„์ด๋ผ๊ณ  ์ธ์‹ํ•˜๊ณ  ๋‹ต๋ณ€ํ•˜๋Š” ๋ฌธ์ œ ๋ฐœ์ƒ

Issue ํ•ด๊ฒฐ๋ฐฉ์•ˆ

image

  • SFT ๋ชจ๋ธ๋กœ ๋‹ต๋ณ€์„ 2๊ฐœ ์ƒ์„ฑํ•˜์˜€์„ ๋•Œ(Ver1), Chosen, Rejected ๋‹ต๋ณ€์˜ ์ฐจ์ด๊ฐ€ ํฌ๊ฒŒ ์—†์–ด ๋ชจ๋ธ์ด ํ•™์Šต๋˜์ง€ ์•Š๋Š” ํ˜„์ƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ 2๊ฐœ์˜ ๋ชจ๋ธ **(ChatGPT, SFT)**๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ต๋ณ€์„ ์ƒ์„ฑ(Ver2)
  • General Task ๋‹ต๋ณ€์— ๋Œ€ํ•œ ํ‰๊ฐ€ ์„ฑ๋Šฅ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด Evol-instruct ๋ฐ์ดํ„ฐ ์ถ”๊ฐ€
  • ํ•™์Šต์— ์‚ฌ์šฉํ•œ ๋ชจ๋“  ๋ฐ์ดํ„ฐ์…‹์€ 15 token ์ดํ•˜, cosine ์œ ์‚ฌ๋„ 0.5 ์ด์ƒ์ผ ๊ฒฝ์šฐ ์ œ๊ฑฐํ•˜๋Š” Filtering ์ž‘์—… ์ˆ˜ํ–‰
  • ํ˜์˜คํ‘œํ˜„ ํ•™์Šต์‹œ(Ver1) Step3 ๊ฐ•ํ™”ํ•™์Šต ์ดํ›„์— ๋‹ต๋ณ€์ด ์ด์ƒํ•˜๊ฒŒ ์ƒ์„ฑ๋˜๋Š” Issue๊ฐ€ ์žˆ์–ด, ํ˜์˜คํ‘œํ˜„์„ ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ํ•™์Šต(Ver2)
  • RM-ver1์€ GPT4๊ฐ€ Chosen, Rejected ๋ ˆ์ด๋ธ”๋ง์„ ์ง„ํ–‰ํ•˜์˜€์ง€๋งŒ, Resource ์ด์Šˆ๋กœ ์ธํ•ด ์ผ๋ถ€๋งŒ ์‚ฌ๋žŒ์ด ๋ผ๋ฒจ๋ง ์ง„ํ–‰
    • ์ผ์ƒ๋Œ€ํ™” ๋ฐ์ดํ„ฐ์…‹
      • ChatGPT์™€ SFT ๋ชจ๋‘ ์ผ๊ด€๋˜๊ฒŒ ๋†’์€ ํ€„๋ฆฌํ‹ฐ์˜ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜์ง€ ์•Š์•„, ์‚ฌ๋žŒ์ด ์ง์ ‘ ๋ผ๋ฒจ๋ง ์ง„ํ–‰
    • RLHF ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ, Evol-Instruct ๋ฐ์ดํ„ฐ์…‹
      • ChatGPT๊ฐ€ ์ผ๊ด€๋˜๊ฒŒ ๋†’์€ ํ€„๋ฆฌํ‹ฐ์˜ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜์—ฌ ChatGPT๋ฅผ Chosen, SFT๋ฅผ Rejected๋กœ ๋ผ๋ฒจ๋ง ์ง„ํ–‰

Reward Model ver2 Evaluation

image

Task4. RLHF์™€ DeepSpeedChat์„ ํ†ตํ•œ ์ตœ์ข… ๋ชจ๋ธ ๊ตฌํ˜„

  • Microsoft์—์„œ ๋งŒ๋“  ๋Œ€๊ทœ๋ชจ ๋ถ„์‚ฐ ๋”ฅ๋Ÿฌ๋‹์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ๊ธฐ์ˆ (DeepSpeed)์„ RLHF Process์— ์ ์šฉํ•œ DeepSpeedChat ์‚ฌ์šฉ
  • Human preference๋กœ ํ•™์Šต์„ ์‹œํ‚จ Reward ๋ชจ๋ธ๊ณผ ๊ฐ•ํ™”ํ•™์Šต์„ ํ†ตํ•ด SFT ๋ชจ๋ธ์— ์‚ฌ๋žŒ์˜ ์„ ํ˜ธ๋„๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ ์ž์—ฐ์Šค๋Ÿฝ๊ณ (FRIENDLY), ์œค๋ฆฌ์ ์ธ (HARMLESS) ์ฑ—๋ด‡ ์ƒ์„ฑ

Baseline Models

  • Actor Model: KULLM-SFT-V2
  • Reward Model: Polyglot-Ko-Reward-V3

Training Options

image

RLHF Training

image

  • ํ•™์Šต ๊ฒฐ๊ณผ, SFT ๋ชจ๋ธ์˜ ๋‹ต๋ณ€์— ๋Œ€ํ•œ ํ€„๋ฆฌํ‹ฐ์ธ Reward๊ฐ€ ์ƒ์Šนํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธ (์‚ฌ๋žŒ์˜ ์„ ํ˜ธ๋„๊ฐ€ ๋†’์€ ๋‹ต๋ณ€์„ ์ƒ์„ฑ)

RLFH Model Evaluation

image image

Final RLHF Model

Contributors ๐Ÿ™Œ

  • ๋ฐ•์„ฑ์™„ (์„ฑ๊ท ๊ด€๋Œ€ํ•™๊ต ์†Œํ”„ํŠธ์›จ์–ดํ•™๊ณผ 20ํ•™๋ฒˆ, waniboyy@gmail.com)
  • ์†กํ˜„๋นˆ (์„ฑ๊ท ๊ด€๋Œ€ํ•™๊ต ์†Œํ”„ํŠธ์›จ์–ดํ•™๊ณผ 20ํ•™๋ฒˆ, shbin0519@gmail.com)
  • ํ—ˆ์œ ๋ฏผ (์„ฑ๊ท ๊ด€๋Œ€ํ•™๊ต ์†Œํ”„ํŠธ์›จ์–ดํ•™๊ณผ 21ํ•™๋ฒˆ, ymheo1123@gmail.com)
  • ํ™์—ฌ์› (์„ฑ๊ท ๊ด€๋Œ€ํ•™๊ต ์†Œํ”„ํŠธ์›จ์–ดํ•™๊ณผ 20ํ•™๋ฒˆ, ryeowon13@gmail.com)
Downloads last month
4,906
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.