voice_clone_v3 / transformers /docs /source /ko /perf_infer_gpu_one.md
ahassoun's picture
Upload 3018 files
ee6e328

๋‹จ์ผ GPU์—์„œ ํšจ์œจ์ ์ธ ์ถ”๋ก  [[efficient-inference-on-a-single-gpu]]

์ด ๊ฐ€์ด๋“œ ์™ธ์—๋„, ๋‹จ์ผ GPU์—์„œ์˜ ํ›ˆ๋ จ ๊ฐ€์ด๋“œ์™€ CPU์—์„œ์˜ ์ถ”๋ก  ๊ฐ€์ด๋“œ์—์„œ๋„ ๊ด€๋ จ ์ •๋ณด๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Better Transformer: PyTorch ๋„ค์ดํ‹ฐ๋ธŒ Transformer ํŒจ์ŠคํŠธํŒจ์Šค [[better-transformer-pytorchnative-transformer-fastpath]]

PyTorch ๋„ค์ดํ‹ฐ๋ธŒ nn.MultiHeadAttention ์–ดํ…์…˜ ํŒจ์ŠคํŠธํŒจ์Šค์ธ BetterTransformer๋Š” ๐Ÿค— Optimum ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ํ†ตํ•ฉ์„ ํ†ตํ•ด Transformers์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

PyTorch์˜ ์–ดํ…์…˜ ํŒจ์ŠคํŠธํŒจ์Šค๋Š” ์ปค๋„ ํ“จ์ „๊ณผ ์ค‘์ฒฉ๋œ ํ…์„œ์˜ ์‚ฌ์šฉ์„ ํ†ตํ•ด ์ถ”๋ก  ์†๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋ฒค์น˜๋งˆํฌ๋Š” ์ด ๋ธ”๋กœ๊ทธ ๊ธ€์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

optimum ํŒจํ‚ค์ง€๋ฅผ ์„ค์น˜ํ•œ ํ›„์—๋Š” ์ถ”๋ก  ์ค‘ Better Transformer๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก [~PreTrainedModel.to_bettertransformer]๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ ๊ด€๋ จ ๋‚ด๋ถ€ ๋ชจ๋“ˆ์„ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค:

model = model.to_bettertransformer()

[~PreTrainedModel.reverse_bettertransformer] ๋ฉ”์†Œ๋“œ๋Š” ์ •๊ทœํ™”๋œ transformers ๋ชจ๋ธ๋ง์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์„ ์ €์žฅํ•˜๊ธฐ ์ „ ์›๋ž˜์˜ ๋ชจ๋ธ๋ง์œผ๋กœ ๋Œ์•„๊ฐˆ ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค๋‹ˆ๋‹ค:

model = model.reverse_bettertransformer()
model.save_pretrained("saved_model")

PyTorch 2.0๋ถ€ํ„ฐ๋Š” ์–ดํ…์…˜ ํŒจ์ŠคํŠธํŒจ์Šค๊ฐ€ ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋” ๋ชจ๋‘์—์„œ ์ง€์›๋ฉ๋‹ˆ๋‹ค. ์ง€์›๋˜๋Š” ์•„ํ‚คํ…์ฒ˜ ๋ชฉ๋ก์€ ์—ฌ๊ธฐ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

FP4 ํ˜ผํ•ฉ ์ •๋ฐ€๋„ ์ถ”๋ก ์„ ์œ„ํ•œ bitsandbytes ํ†ตํ•ฉ [[bitsandbytes-integration-for-fp4-mixedprecision-inference]]

bitsandbytes๋ฅผ ์„ค์น˜ํ•˜๋ฉด GPU์—์„œ ์†์‰ฝ๊ฒŒ ๋ชจ๋ธ์„ ์••์ถ•ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. FP4 ์–‘์žํ™”๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์›๋ž˜์˜ ์ „์ฒด ์ •๋ฐ€๋„ ๋ฒ„์ „๊ณผ ๋น„๊ตํ•˜์—ฌ ๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ ์ตœ๋Œ€ 8๋ฐฐ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜์—์„œ ์‹œ์ž‘ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ™•์ธํ•˜์„ธ์š”.

์ด ๊ธฐ๋Šฅ์€ ๋‹ค์ค‘ GPU ์„ค์ •์—์„œ๋„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์š”๊ตฌ ์‚ฌํ•ญ [[requirements-for-fp4-mixedprecision-inference]]

  • ์ตœ์‹  bitsandbytes ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ pip install bitsandbytes>=0.39.0

  • ์ตœ์‹  accelerate๋ฅผ ์†Œ์Šค์—์„œ ์„ค์น˜ pip install git+https://github.com/huggingface/accelerate.git

  • ์ตœ์‹  transformers๋ฅผ ์†Œ์Šค์—์„œ ์„ค์น˜ pip install git+https://github.com/huggingface/transformers.git

FP4 ๋ชจ๋ธ ์‹คํ–‰ - ๋‹จ์ผ GPU ์„ค์ • - ๋น ๋ฅธ ์‹œ์ž‘ [[running-fp4-models-single-gpu-setup-quickstart]]

๋‹ค์Œ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜์—ฌ ๋‹จ์ผ GPU์—์„œ ๋น ๋ฅด๊ฒŒ FP4 ๋ชจ๋ธ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

from transformers import AutoModelForCausalLM

model_name = "bigscience/bloom-2b5"
model_4bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True)

device_map์€ ์„ ํƒ ์‚ฌํ•ญ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ device_map = 'auto'๋กœ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์ด ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฆฌ์†Œ์Šค๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๋””์ŠคํŒจ์น˜ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ถ”๋ก ์— ์žˆ์–ด ๊ถŒ์žฅ๋ฉ๋‹ˆ๋‹ค.

FP4 ๋ชจ๋ธ ์‹คํ–‰ - ๋‹ค์ค‘ GPU ์„ค์ • [[running-fp4-models-multi-gpu-setup]]

๋‹ค์ค‘ GPU์—์„œ ํ˜ผํ•ฉ 4๋น„ํŠธ ๋ชจ๋ธ์„ ๊ฐ€์ ธ์˜ค๋Š” ๋ฐฉ๋ฒ•์€ ๋‹จ์ผ GPU ์„ค์ •๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค(๋™์ผํ•œ ๋ช…๋ น์–ด ์‚ฌ์šฉ):

model_name = "bigscience/bloom-2b5"
model_4bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True)

ํ•˜์ง€๋งŒ accelerate๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ GPU์— ํ• ๋‹นํ•  GPU RAM์„ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ๊ณผ ๊ฐ™์ด max_memory ์ธ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์„ธ์š”:

max_memory_mapping = {0: "600MB", 1: "1GB"}
model_name = "bigscience/bloom-3b"
model_4bit = AutoModelForCausalLM.from_pretrained(
    model_name, device_map="auto", load_in_4bit=True, max_memory=max_memory_mapping
)

์ด ์˜ˆ์—์„œ๋Š” ์ฒซ ๋ฒˆ์งธ GPU๊ฐ€ 600MB์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ๋‘ ๋ฒˆ์งธ GPU๊ฐ€ 1GB๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๊ณ ๊ธ‰ ์‚ฌ์šฉ๋ฒ• [[advanced-usage]]

์ด ๋ฐฉ๋ฒ•์˜ ๋” ๊ณ ๊ธ‰ ์‚ฌ์šฉ๋ฒ•์— ๋Œ€ํ•ด์„œ๋Š” ์–‘์žํ™” ๋ฌธ์„œ ํŽ˜์ด์ง€๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

Int8 ํ˜ผํ•ฉ ์ •๋ฐ€๋„ ํ–‰๋ ฌ ๋ถ„ํ•ด๋ฅผ ์œ„ํ•œ bitsandbytes ํ†ตํ•ฉ [[bitsandbytes-integration-for-int8-mixedprecision-matrix-decomposition]]

์ด ๊ธฐ๋Šฅ์€ ๋‹ค์ค‘ GPU ์„ค์ •์—์„œ๋„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale ๋…ผ๋ฌธ์—์„œ ์šฐ๋ฆฌ๋Š” ๋ช‡ ์ค„์˜ ์ฝ”๋“œ๋กœ Hub์˜ ๋ชจ๋“  ๋ชจ๋ธ์— ๋Œ€ํ•œ Hugging Face ํ†ตํ•ฉ์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ float16 ๋ฐ bfloat16 ๊ฐ€์ค‘์น˜์— ๋Œ€ํ•ด nn.Linear ํฌ๊ธฐ๋ฅผ 2๋ฐฐ๋กœ ์ค„์ด๊ณ , float32 ๊ฐ€์ค‘์น˜์— ๋Œ€ํ•ด 4๋ฐฐ๋กœ ์ค„์ž…๋‹ˆ๋‹ค. ์ด๋Š” ์ ˆ๋ฐ˜ ์ •๋ฐ€๋„์—์„œ ์ด์ƒ์น˜๋ฅผ ์ฒ˜๋ฆฌํ•จ์œผ๋กœ์จ ํ’ˆ์งˆ์— ๊ฑฐ์˜ ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

HFxbitsandbytes.png

Int8 ํ˜ผํ•ฉ ์ •๋ฐ€๋„ ํ–‰๋ ฌ ๋ถ„ํ•ด๋Š” ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ๋‘ ๊ฐœ์˜ ์ŠคํŠธ๋ฆผ์œผ๋กœ ๋ถ„๋ฆฌํ•ฉ๋‹ˆ๋‹ค: (1) fp16๋กœ ๊ณฑํ•ด์ง€๋Š” ์ฒด๊ณ„์ ์ธ ํŠน์ด๊ฐ’ ์ด์ƒ์น˜ ์ŠคํŠธ๋ฆผ ํ–‰๋ ฌ(0.01%) ๋ฐ (2) int8 ํ–‰๋ ฌ ๊ณฑ์…ˆ์˜ ์ผ๋ฐ˜์ ์ธ ์ŠคํŠธ๋ฆผ(99.9%). ์ด ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋ฉด ๋งค์šฐ ํฐ ๋ชจ๋ธ์— ๋Œ€ํ•ด ์˜ˆ์ธก ์ €ํ•˜ ์—†์ด int8 ์ถ”๋ก ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋…ผ๋ฌธ์ด๋‚˜ ํ†ตํ•ฉ์— ๊ด€ํ•œ ๋ธ”๋กœ๊ทธ ๊ธ€์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

MixedInt8.gif

์ปค๋„์€ GPU ์ „์šฉ์œผ๋กœ ์ปดํŒŒ์ผ๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ˜ผํ•ฉ 8๋น„ํŠธ ๋ชจ๋ธ์„ ์‹คํ–‰ํ•˜๋ ค๋ฉด GPU๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ธฐ๋Šฅ์„ ์‚ฌ์šฉํ•˜๊ธฐ ์ „์— ๋ชจ๋ธ์˜ 1/4(๋˜๋Š” ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๊ฐ€ ์ ˆ๋ฐ˜ ์ •๋ฐ€๋„์ธ ๊ฒฝ์šฐ ์ ˆ๋ฐ˜)์„ ์ €์žฅํ•  ์ถฉ๋ถ„ํ•œ GPU ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”. ์ด ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋˜๋Š” ๋ช‡ ๊ฐ€์ง€ ์ฐธ๊ณ  ์‚ฌํ•ญ์ด ์•„๋ž˜์— ๋‚˜์™€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜๋Š” Google colab์—์„œ ๋ฐ๋ชจ๋ฅผ ๋”ฐ๋ผํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

์š”๊ตฌ ์‚ฌํ•ญ [[requirements-for-int8-mixedprecision-matrix-decomposition]]

  • bitsandbytes<0.37.0์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ, 8๋น„ํŠธ ํ…์„œ ์ฝ”์–ด(Turing, Ampere ๋˜๋Š” ์ดํ›„ ์•„ํ‚คํ…์ฒ˜ - ์˜ˆ: T4, RTX20s RTX30s, A40-A100)๋ฅผ ์ง€์›ํ•˜๋Š” NVIDIA GPU์—์„œ ์‹คํ–‰ํ•˜๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”. bitsandbytes>=0.37.0์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ, ๋ชจ๋“  GPU๊ฐ€ ์ง€์›๋ฉ๋‹ˆ๋‹ค.
  • ์˜ฌ๋ฐ”๋ฅธ ๋ฒ„์ „์˜ bitsandbytes๋ฅผ ๋‹ค์Œ ๋ช…๋ น์œผ๋กœ ์„ค์น˜ํ•˜์„ธ์š”: pip install bitsandbytes>=0.31.5
  • accelerate๋ฅผ ์„ค์น˜ํ•˜์„ธ์š” pip install accelerate>=0.12.0

ํ˜ผํ•ฉ Int8 ๋ชจ๋ธ ์‹คํ–‰ - ๋‹จ์ผ GPU ์„ค์ • [[running-mixedint8-models-single-gpu-setup]]

ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ํ•œ ํ›„ ํ˜ผํ•ฉ 8๋น„ํŠธ ๋ชจ๋ธ์„ ๊ฐ€์ ธ์˜ค๋Š” ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

from transformers import AutoModelForCausalLM

model_name = "bigscience/bloom-2b5"
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)

ํ…์ŠคํŠธ ์ƒ์„ฑ์˜ ๊ฒฝ์šฐ:

  • pipeline() ํ•จ์ˆ˜ ๋Œ€์‹  ๋ชจ๋ธ์˜ generate() ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค. pipeline() ํ•จ์ˆ˜๋กœ๋Š” ์ถ”๋ก ์ด ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, ํ˜ผํ•ฉ 8๋น„ํŠธ ๋ชจ๋ธ์— ์ตœ์ ํ™”๋˜์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์— generate() ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ๋Š๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, nucleus ์ƒ˜ํ”Œ๋ง๊ณผ ๊ฐ™์€ ์ผ๋ถ€ ์ƒ˜ํ”Œ๋ง ์ „๋žต์€ ํ˜ผํ•ฉ 8๋น„ํŠธ ๋ชจ๋ธ์— ๋Œ€ํ•ด pipeline() ํ•จ์ˆ˜์—์„œ ์ง€์›๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
  • ์ž…๋ ฅ์„ ๋ชจ๋ธ๊ณผ ๋™์ผํ•œ GPU์— ๋ฐฐ์น˜ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ์€ ๊ฐ„๋‹จํ•œ ์˜ˆ์ž…๋‹ˆ๋‹ค:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "bigscience/bloom-2b5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)

prompt = "Hello, my llama is cute"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
generated_ids = model.generate(**inputs)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

ํ˜ผํ•ฉ Int8 ๋ชจ๋ธ ์‹คํ–‰ - ๋‹ค์ค‘ GPU ์„ค์ • [[running-mixedint8-models-multi-gpu-setup]]

๋‹ค์ค‘ GPU์—์„œ ํ˜ผํ•ฉ 8๋น„ํŠธ ๋ชจ๋ธ์„ ๋กœ๋“œํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋‹จ์ผ GPU ์„ค์ •๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค(๋™์ผํ•œ ๋ช…๋ น์–ด ์‚ฌ์šฉ):

model_name = "bigscience/bloom-2b5"
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)

ํ•˜์ง€๋งŒ accelerate๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ GPU์— ํ• ๋‹นํ•  GPU RAM์„ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ๊ณผ ๊ฐ™์ด max_memory ์ธ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์„ธ์š”:

max_memory_mapping = {0: "1GB", 1: "2GB"}
model_name = "bigscience/bloom-3b"
model_8bit = AutoModelForCausalLM.from_pretrained(
    model_name, device_map="auto", load_in_8bit=True, max_memory=max_memory_mapping
)

์ด ์˜ˆ์‹œ์—์„œ๋Š” ์ฒซ ๋ฒˆ์งธ GPU๊ฐ€ 1GB์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ๋‘ ๋ฒˆ์งธ GPU๊ฐ€ 2GB๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Colab ๋ฐ๋ชจ [[colab-demos]]

์ด ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋ฉด ์ด์ „์— Google Colab์—์„œ ์ถ”๋ก ํ•  ์ˆ˜ ์—†์—ˆ๋˜ ๋ชจ๋ธ์— ๋Œ€ํ•ด ์ถ”๋ก ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Google Colab์—์„œ 8๋น„ํŠธ ์–‘์žํ™”๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ T5-11b(42GB in fp32)๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๋ฐ๋ชจ๋ฅผ ํ™•์ธํ•˜์„ธ์š”:

Open In Colab: T5-11b demo

๋˜๋Š” BLOOM-3B์— ๋Œ€ํ•œ ๋ฐ๋ชจ๋ฅผ ํ™•์ธํ•˜์„ธ์š”:

Open In Colab: BLOOM-3b demo