Spaces:
Runtime error
Qwen-7B ð€ | ð€ ïœ Qwen-7B-Chat ð€ | ð€ ïœ Demo ïœ Report | Discord
äžæ ïœ English ïœ æ¥æ¬èª
Japanese document maintainer: Ikko Eltociear Ashimine
ç§ãã¡ã¯ãQwen-7B ãš Qwen-7B-Chat ã ð€ ModelScope ãš ð€ Hugging Face ã®äž¡æ¹ã§ãªãŒãã³ãœãŒã¹åããŠããŸã(äžéšã®ããŽãã¯ãªãã¯ãããšãã³ãŒããšãã§ãã¯ãã€ã³ãã®ãããªããžããªã«ç§»åããŸã)ããã®ã¬ãã«ã¯ãQwen-7B ã®ç°¡åãªçŽ¹ä»ãšã䜿ãæ¹ã®æåŒããããã«è©³ããæ å ±ãæäŸããæè¡ã¡ã¢ link ãå«ãŸããŠããŸãã
Qwen-7Bã¯ãã¢ãªããã¯ã©ãŠããæå±ãã倧èŠæš¡èšèªã¢ãã«ã·ãªãŒãºQwenïŒç¥ç§°ïŒTongyi QianwenïŒã®7Bãã©ã¡ãŒã¿çã§ããQwen-7Bã¯TransformerããŒã¹ã®å€§èŠæš¡èšèªã¢ãã«ã§ããããŠã§ãããã¹ããæžç±ãã³ãŒããªã©ãå«ã倧éã®ããŒã¿ã§äºååŠç¿ããããããã«ãäºååŠç¿ãããQwen-7BãããŒã¹ã«ãã¢ã©ã€ã¡ã³ãæè¡ã§åŠç¿ããã倧èŠæš¡ã¢ãã«ããŒã¹ã®AIã¢ã·ã¹ã¿ã³ãã§ããQwen-7B-ChatããªãªãŒã¹ãããQwen-7Bã·ãªãŒãºã®ç¹åŸŽã¯ä»¥äžã®éãã§ã:
- é«å質ãªäºåãã¬ãŒãã³ã°ããŒã¿ã§ãã¬ãŒãã³ã°ãQwen-7B 㯠2.2 å 以äžã®ããŒã¯ã³ãå«ã倧èŠæš¡ã§é«å質ãªããŒã¿ã»ããã«å¯ŸããŠäºååŠç¿ãè¡ã£ãããã®ããŒã¿ã»ããã«ã¯å¹³æãšã³ãŒããå«ãŸããäžè¬çãªãã¡ã€ã³ããŒã¿ãšå°éçãªãã¡ã€ã³ããŒã¿ãå«ãå¹ åºããã¡ã€ã³ãã«ããŒããŠããã
- 匷ãããã©ãŒãã³ã¹ãèªç¶èšèªç解ãæ°åŠãã³ãŒãã£ã³ã°ãªã©ãè©äŸ¡ããäžé£ã®ãã³ãããŒã¯ããŒã¿ã»ããã«ãããŠãåçšåºŠã®ã¢ãã«ãµã€ãºã®ã¢ãã«ãšæ¯èŒããŠã競åä»ç€Ÿãåé§ããŠããŸãã
- èšèªãµããŒãã®åäžãQwen-7B ã®ããŒã¯ãã€ã¶ã¯ã15 äžä»¥äžã®ããŒã¯ã³ã®èªåœãããŒã¹ã«ããŠãããä»ã®ããŒã¯ãã€ã¶ã«æ¯ã¹ãŠå¹ççã§ããå€ãã®èšèªã«å¯Ÿå¿ããŠããããŠãŒã¶ãç¹å®ã®èšèªãç解ããããã« Qwen-7B ãããã«åŸ®èª¿æŽããã®ã«åœ¹ç«ã¡ãŸãã
- 8K ã³ã³ããã¹ãé·ããµããŒããQwen-7B ãš Qwen-7B-Chat ã¯ãšãã« 8K ã®ã³ã³ããã¹ãé·ããµããŒãããŠãããé·ãã³ã³ããã¹ãã§ã®å ¥åãå¯èœã«ããŠããã
- ãã©ã°ã€ã³ã®ãµããŒããQwen-7B-Chat ã¯ããã©ã°ã€ã³é¢é£ã®ã¢ã©ã€ã¡ã³ãããŒã¿ã§ãã¬ãŒãã³ã°ãããŠãããããAPIãã¢ãã«ãããŒã¿ããŒã¹ãªã©ã®ããŒã«ã䜿çšããããšãã§ãããšãŒãžã§ã³ããšããŠãã¬ã€ããããšãã§ããã
以äžã®ã»ã¯ã·ã§ã³ã«ã¯ãåèã«ãªãæ å ±ãèšèŒãããŠããŸããç¹ã«ãissueãç«ã¡äžããåã«FAQã»ã¯ã·ã§ã³ããèªã¿ã«ãªãããšããå§ãããŸãã
ãã¥ãŒã¹
- 2023.8.3 Qwen-7B ãš Qwen-7B-Chat ã ModelScope ãš Hugging Face ã§å ¬éããŸãããã¬ãŒãã³ã°ã®è©³çŽ°ãã¢ãã«ã®æ§èœãªã©ãã¢ãã«ã®è©³çŽ°ã«ã€ããŠã¯ãã¯ãã«ã«ã¡ã¢ãæäŸããŠããŸãã
ããã©ãŒãã³ã¹
äžè¬çã«ãQwen-7B ã¯ãMMLUãC-EvalãGSM8KãHumanEvalãWMT22ãCMMLU ãªã©ã®èªç¶èšèªç解ãæ°åŠçåé¡è§£æ±ºãã³ãŒãã£ã³ã°ãªã©ã«é¢ããã¢ãã«ã®èœåãè©äŸ¡ããäžé£ã®ãã³ãããŒã¯ããŒã¿ã»ããã«ãããŠãåçšåºŠã®ã¢ãã«ãµã€ãºã®ããŒã¹ã©ã€ã³ã¢ãã«ãåé§ããããã«ã¯ 13B çšåºŠã®ãã©ã¡ãŒã¿ãæã€ãã倧èŠæš¡ãªã¢ãã«ããåé§ããŠããã以äžã®çµæãã芧ãã ããã
Model | MMLU | C-Eval | GSM8K | HumanEval | WMT22 (en-zh) | CMMLU |
---|---|---|---|---|---|---|
LLaMA-7B | 35.1 | - | 11.0 | 10.5 | 8.7 | - |
LLaMA 2-7B | 45.3 | - | 14.6 | 12.8 | 17.9 | - |
Baichuan-7B | 42.3 | 42.8 | 9.7 | 9.2 | 26.6 | 44.4 |
ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 9.2 | - | 48.8 |
InternLM-7B | 51.0 | 52.8 | 31.2 | 10.4 | 14.8 | - |
Baichuan-13B | 51.6 | 53.6 | 26.6 | 12.8 | 30.0 | 55.8 |
LLaMA-13B | 46.9 | 35.5 | 17.8 | 15.8 | 12.0 | - |
LLaMA 2-13B | 54.8 | - | 28.7 | 18.3 | 24.2 | - |
ChatGLM2-12B | 56.2 | 61.6 | 40.9 | - | - | - |
Qwen-7B | 56.7 | 59.6 | 51.6 | 24.4 | 30.6 | 58.8 |
ããã«ãOpenCompassãå®æœãã倧èŠæš¡èšèªã¢ãã«ã®ç¬¬äžè è©äŸ¡ã«ãããšãQwen-7BãšQwen-7B-Chatã¯7Bãã©ã¡ãŒã¿ã¢ãã«ã®ãããã§ããããã®è©äŸ¡ã¯ãèšèªç解ã»çæãã³ãŒãã£ã³ã°ãæ°åŠãæšè«ãªã©ã®è©äŸ¡ã®ããã®å€§éã®å ¬éãã³ãããŒã¯ã§æ§æãããŠããã
ãã詳现ãªå®éšçµæïŒããå€ãã®ãã³ãããŒã¯ããŒã¿ã»ããã§ã®è©³çŽ°ãªã¢ãã«æ§èœïŒã詳现ã«ã€ããŠã¯ããã¡ããã¯ãªãã¯ããŠæè¡ã¡ã¢ãåç §ããŠãã ããã
å¿ èŠæ¡ä»¶
- python 3.8 以äž
- pytorch 1.12 以äžã2.0 以äžãæšå¥š
- CUDA 11.4 以äžãæšå¥šïŒGPU ãŠãŒã¶ãŒããã©ãã·ã¥ã¢ãã³ã·ã§ã³ãŠãŒã¶ãŒåããªã©ïŒ
ã¯ã€ãã¯ã¹ã¿ãŒã
以äžã§ã¯ãQwen-7B ãš ð€ ModelScope ãš ð€ Transformers ã®ç°¡åãªäœ¿çšäŸã瀺ããŸãã
ã³ãŒããå®è¡ããåã«ãç°å¢ã®ã»ããã¢ãããšå¿ èŠãªããã±ãŒãžã®ã€ã³ã¹ããŒã«ãæžãã§ããããšã確èªããŠãã ãããäžèšã®èŠä»¶ãæºãããŠããããšã確èªããŠãããäŸåããã©ã€ãã©ãªãã€ã³ã¹ããŒã«ããŠãã ããã
pip install -r requirements.txt
ã䜿ãã®ããã€ã¹ã fp16 ãŸã㯠bf16 ããµããŒãããŠããå Žåãflash-attention ãã€ã³ã¹ããŒã«ããããšã§ãããé«ãå¹çãšã¡ã¢ãªäœ¿çšéãæããããšãã§ããŸãã(flash-attention ã¯ãªãã·ã§ã³ã§ãããã€ã³ã¹ããŒã«ããªããŠããããžã§ã¯ãã¯æ£åžžã«å®è¡ã§ããŸã)
git clone -b v1.0.8 https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
# 以äžã¯ãªãã·ã§ã³ã§ããã€ã³ã¹ããŒã«ã«æéããããå ŽåããããŸãã
# pip install csrc/layer_norm
# pip install csrc/rotary
ãã㧠ModelScope ã Transformers ã§å§ããããšãã§ããŸãã
ð€ Transformers
Qwen-7B-Chat ãæšè«ã«äœ¿çšããã«ã¯ã以äžã®ããã«æ°è¡ã®ã³ãŒããå ¥åããã ãã§ããææ°ã®ã³ãŒãã䜿çšããŠããããšã確èªããŠãã ããã
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
# 泚: ããã©ã«ãã®åäœã§ã¯ãã€ã³ãžã§ã¯ã·ã§ã³æ»æé²æ¢æ©èœããªãã«ãªã£ãŠããŸãã
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
# bf16 ã䜿çš
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# fp16 ã䜿çš
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# CPU ã®ã¿äœ¿çš
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
# ãªãŒãã¢ãŒãã䜿çšãããšãããã€ã¹ã«å¿ããŠèªåçã«ç²ŸåºŠãéžæãããŸãã
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True).eval()
# çæã®ããã®ãã€ããŒãã©ã¡ãŒã¿ãæå®
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
# 第äžèœ®å¯¹è¯ 第äžå察話ã¿ãŒã³
response, history = model.chat(tokenizer, "äœ å¥œ", history=None)
print(response)
# ããã«ã¡ã¯ïŒ ã圹ã«ç«ãŠãŠããããã§ãã
# 第äºèœ®å¯¹è¯ 第äºå察話ã¿ãŒã³
response, history = model.chat(tokenizer, "ç»æ讲äžäžªå¹Žèœ»äººå¥æåäžæç»ååŸæåçæ
äºã", history=history)
print(response)
# ããã¯ãèªåã®ããžãã¹ãå§ããããšå¥®éãããããŠæåããè¥è
ã®ç©èªã§ããã
# ãã®ç©èªã®äž»äººå
¬ã¯ãå¹³å¡ãªå®¶åºã«çãŸããå¹³å¡ãªåŽåè
ã§ãã䞡芪ãæã€ææã§ããã ææã¯åäŸã®é ããèµ·æ¥å®¶ãšããŠæåããããšãç®æšãšããŠããã
# ãã®ç®æšãéæãããããææã¯çå匷ããŠå€§åŠã«å
¥ã£ãã 倧åŠæ代ã«ã¯ãããŸããŸãªèµ·æ¥å®¶ã³ã³ãã¹ãã«ç©æ¥µçã«åå ããå€ãã®è³ãç²åŸããã ãŸããäœæãå©çšããŠã€ã³ã¿ãŒã³ã·ããã«ãåå ãã貎éãªçµéšãç©ãã ã
# åæ¥åŸãææã¯èµ·æ¥ã決æããã æè³å
ãæ¢ãå§ããããäœåºŠãæãããã ãããã圌ã¯ãããããªãã£ãã 圌ã¯æžåœã«åãç¶ããããžãã¹ãã©ã³ãæ¹åããæ°ããªæè³æ©äŒãæ¢ããã
# ãããŠææã¯æè³ãåããããšã«æåããèªåã®ããžãã¹ãå§ããã 圌ã¯æ°ããã¿ã€ãã®ãœãããŠã§ã¢ã®éçºã«çŠç¹ãåœãŠããã¯ãããžãŒäŒç€Ÿãèšç«ããã 圌ã®ãªãŒããŒã·ããã®äžãäŒç€Ÿã¯æ¥éã«æé·ãããã¯ãããžãŒäŒæ¥ãšããŠæåãåããã
# ææã®æåã¯å¶ç¶ã§ã¯ãªãã 圌ã¯å€åã§ããããŸãããåéºå¥œãã§ãåžžã«åŠã³ãèªåãé«ããŠããã 圌ã®æåã¯ãŸããåªåããã°èª°ã§ãæåã§ããããšã蚌æããŠããã
# 第äžèœ®å¯¹è¯ 第äžå察話ã¿ãŒã³
response, history = model.chat(tokenizer, "ç»è¿äžªæ
äºèµ·äžäžªæ é¢", history=history)
print(response)
# ãèµ·æ¥ãžã®å¥®éïŒããè¥è
ã®æåãžã®éã
Qwen-7B ã®åŠç¿æžã¿ããŒã¹ã¢ãã«ã®å®è¡ãç°¡åã§ãã
Qwen-7B ã®å®è¡
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
# bf16 ã䜿çš
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, bf16=True).eval()
# fp16 ã䜿çš
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval()
# CPU ã®ã¿äœ¿çš
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="cpu", trust_remote_code=True).eval()
# ãªãŒãã¢ãŒãã䜿çšãããšãããã€ã¹ã«å¿ããŠèªåçã«ç²ŸåºŠãéžæãããŸãã
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True).eval()
# çæã®ããã®ãã€ããŒãã©ã¡ãŒã¿ãæå®
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
inputs = tokenizer('ã¢ã³ãŽã«ã®éŠéœã¯ãŠã©ã³ããŒãã«ïŒUlaanbaatarïŒ\nã¢ã€ã¹ã©ã³ãã®éŠéœã¯ã¬ã€ãã£ãã¯ïŒReykjavikïŒ\nãšããªãã¢ã®éŠéœã¯', return_tensors='pt')
inputs = inputs.to(model.device)
pred = model.generate(**inputs)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# ã¢ã³ãŽã«ã®éŠéœã¯ãŠã©ã³ããŒãã«ïŒUlaanbaatarïŒ\nã¢ã€ã¹ã©ã³ãã®éŠéœã¯ã¬ã€ãã£ãã¯ïŒReykjavikïŒ\nãšããªãã¢ã®éŠéœã¯ã¢ãã£ã¹ã¢ããïŒAddis AbabaïŒ...
ð€ ModelScope
ModelScope ã¯ãMaaSïŒModel-as-a-ServiceïŒ ã®ããã®ãªãŒãã³ãœãŒã¹ãã©ãããã©ãŒã ã§ãããAI éçºè ã«æè»ã§è²»çšå¯Ÿå¹æã®é«ãã¢ãã«ãµãŒãã¹ãæäŸããŸããåæ§ã«ã以äžã®ããã« ModelScope ã§ã¢ãã«ãå®è¡ããããšãã§ããŸã:
import os
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope import snapshot_download
model_id = 'QWen/qwen-7b-chat'
revision = 'v1.0.0'
model_dir = snapshot_download(model_id, revision)
pipe = pipeline(
task=Tasks.chat, model=model_dir, device_map='auto')
history = None
text = 'æµæ±çã®çéœã¯ã©ãã§ããïŒ'
results = pipe(text, history=history)
response, history = results['response'], results['history']
print(f'Response: {response}')
text = 'äœããããªã«é¢çœãã®ãïŒ'
results = pipe(text, history=history)
response, history = results['response'], results['history']
print(f'Response: {response}')
ããŒã¯ãã€ã¶ãŒ
tiktoken ã«åºã¥ãããŒã¯ãã€ã¶ãŒã¯ãä»ã®ããŒã¯ãã€ã¶ãŒãäŸãã°ã»ã³ãã³ã¹ããŒã¹ããŒã¯ãã€ã¶ãŒãšã¯ç°ãªããŸããç¹ã«ãã¡ã€ã³ãã¥ãŒãã³ã°ã®éã«ã¯ãç¹æ®ãªããŒã¯ã³ã«æ³šæãæãå¿ èŠããããŸããããŒã¯ãã€ã¶ã«é¢ãã詳现ãªæ å ±ãããã¡ã€ã³ãã¥ãŒãã³ã°ã«ããã䜿çšæ¹æ³ã«ã€ããŠã¯ãããã¥ã¡ã³ããåç §ããŠãã ããã
éåå
NF4
ãš Int8
ã®ã¢ãã«ãããŒãããæ¹æ³ã瀺ãäŸãæäŸããŸããæå§ãã«ãbitsandbytes
ãå®è£
ãããŠããããšã確èªããŠäžãããbitsandbytes
ã®èŠä»¶ã¯ä»¥äžã®éãã«ãªããŸã:
**å¿
èŠæ¡ä»¶** Python >= 3.8ãLinux ãã£ã¹ããªãã¥ãŒã·ã§ã³ïŒUbuntuãMacOS ãªã©ïŒ+ CUDA > 10.0ã
ãããŠã以äžã®ã³ãã³ããå®è¡ã㊠bitsandbytes
ãã€ã³ã¹ããŒã«ããïŒ
pip install bitsandbytes
Windows ãŠãŒã¶ã¯ãbitsandbytes-windows-webui ãšããå¥ã®ãªãã·ã§ã³ãèŠã€ããå¿ èŠããããŸãã
ãããŠãéååã®èšå®ã AutoModelForCausalLM.from_pretrained
ã«è¿œå ããã ããšãªããŸãã以äžã®äŸãåç
§ããŠãã ãã:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# NF4ïŒ4ãããïŒã®éååèšå®
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.bfloat16
)
# Int8ïŒ8ãããïŒã®éååèšå®
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
args.checkpoint_path,
device_map="cuda:0",
quantization_config=quantization_config,
max_memory=max_memory,
trust_remote_code=True,
).eval()
ãã®æ¹æ³ã§ã¯ãQwen-7B ã NF4
ãš Int8
ã§ããŒãããããšãã§ããã¡ã¢ãªäœ¿çšéãç¯çŽã§ããã以äžã«ã¢ãã«æ§èœã®é¢é£çµ±èšéã瀺ããŸããéååã«ãããæå¹æ§ã¯è¥å¹²äœäžããããæšè«å¹çã¯å€§å¹
ã«åäžããã¡ã¢ãªã³ã¹ããåæžãããããšãããããŸãã
Precision | MMLU | GPU Memory for Loading Model |
---|---|---|
BF16 | 56.7 | 16.38G |
Int8 | 52.8 | 10.44G |
NF4 | 48.9 | 7.79G |
泚ïŒäžè¡šã®GPUã¡ã¢ãªäœ¿çšéãããã¡ã€ãªã³ã°ã¯ãã·ã³ã°ã«A100-SXM4-80G GPUãPyTorch 2.0.1ãCUDA 11.8ããã©ãã·ã¥ã¢ãã³ã·ã§ã³äœ¿çšã§å®è¡ãããŠããŸãã
æšè«å¹ç
æšè«ã¹ããŒã
BF16粟床ãéååã¬ãã«Int8ãŸãã¯NF4ã§ããããã2KããŒã¯ã³ãçæããå¹³åæšè«é床ã枬å®ããã
Quantization Level | Inference Speed with flash_attn (tokens/s) | Inference Speed w/o flash_attn (tokens/s) |
---|---|---|
BF16 (no quantization) | 30.06 | 27.55 |
Int8 (bnb) | 7.94 | 7.86 |
NF4 (bnb) | 21.43 | 20.37 |
詳现ã«ã¯ããããã¡ã€ãªã³ã°ã®èšå®ã¯ã1ã³ã³ãã¯ã¹ãã»ããŒã¯ã³ã§2048ã®æ°ããããŒã¯ã³ãçæããŠããããããã¡ã€ãªã³ã°ã¯ãPyTorch 2.0.1ãšCUDA 11.8ãæèŒããã·ã³ã°ã«A100-SXM4-80G GPUã§å®è¡ããããæšè«é床ã¯çæããã2048åã®ããŒã¯ã³ã®å¹³åã§ãã
GPUã¡ã¢ãªäœ¿çšé
ãŸããBF16ãŸãã¯Int8/NF4éååã¬ãã«ã®äžã§ã2048åã®ããŒã¯ã³ãã³ã³ããã¹ããšããŠãšã³ã³ãŒãããå ŽåïŒããã³åäžã®ããŒã¯ã³ãçæããå ŽåïŒãšã8192åã®ããŒã¯ã³ãçæããå ŽåïŒåäžã®ããŒã¯ã³ãã³ã³ããã¹ããšããŠçæããå ŽåïŒã®GPUã¡ã¢ãªäœ¿çšéã®ããŒã¯å€ããããããããã¡ã€ãªã³ã°ããŸãããçµæã以äžã«ç€ºãã
Flash attentionã䜿çšããå Žåã®ã¡ã¢ãªäœ¿çšéã¯ä»¥äžã®éãã§ããïŒ
Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
---|---|---|
BF16 | 18.11GB | 23.52GB |
Int8 | 12.17GB | 17.60GB |
NF4 | 9.52GB | 14.93GB |
Flash attentionã䜿çšããªãå Žåãã¡ã¢ãªäœ¿çšéã¯æ¬¡ã®ããã«ãªãïŒ
Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
---|---|---|
BF16 | 18.11GB | 24.40GB |
Int8 | 12.18GB | 18.47GB |
NF4 | 9.52GB | 15.81GB |
äžèšã®ã¹ããŒããšã¡ã¢ãªãŒã®ãããã¡ã€ãªã³ã°ã¯ããã®ã¹ã¯ãªããã䜿ã£ãŠè¡ãããã
ãã¢
ãŠã§ã UI
ãŠã§ãUIãã¢ãæ§ç¯ããããã®ã³ãŒããæäŸããŸãïŒ@wysaidã«æè¬ïŒãå§ããåã«ã以äžã®ããã±ãŒãžãã€ã³ã¹ããŒã«ãããŠããããšã確èªããŠãã ããïŒ
pip install -r requirements_web_demo.txt
ãããŠã以äžã®ã³ãã³ããå®è¡ããçæããããªã³ã¯ãã¯ãªãã¯ããïŒ
python web_demo.py
CLI ãã¢
cli_demo.py
ã« CLI ã®ãã¢äŸãçšæããŠããŸãããŠãŒã¶ã¯ããã³ãããå
¥åããããšã§ Qwen-7B-Chat ãšå¯Ÿè©±ããããšãã§ããã¢ãã«ã¯ã¹ããªãŒãã³ã°ã¢ãŒãã§ã¢ãã«ã®åºåãè¿ããŸãã以äžã®ã³ãã³ããå®è¡ããïŒ
python cli_demo.py
API
OpenAI APIãããŒã¹ã«ããŒã«ã«APIããããã€ããæ¹æ³ãæäŸããïŒ@hanpenggitã«æè¬ïŒãå§ããåã«ãå¿ èŠãªããã±ãŒãžãã€ã³ã¹ããŒã«ããŠãã ããïŒ
pip install fastapi uvicorn openai pydantic sse_starlette
ãããããAPIããããã€ããã³ãã³ããå®è¡ããïŒ
python openai_api.py
ãã§ãã¯ãã€ã³ãåããã¹ã«ã¯ -c
ãCPU ãããã€ã¡ã³ãã«ã¯ --cpu-only
ãªã©ãåŒæ°ãå€æŽã§ããŸããAPIãããã€ã¡ã³ããèµ·åããéã«åé¡ãçºçããå Žåã¯ãããã±ãŒãžãææ°ããŒãžã§ã³ã«æŽæ°ããããšã§è§£æ±ºã§ããå¯èœæ§ããããŸãã
APIã®äœ¿ãæ¹ãç°¡åã ã以äžã®äŸãã芧ãã ããïŒ
import openai
openai.api_base = "http://localhost:8000/v1"
openai.api_key = "none"
# create a request activating streaming response
for chunk in openai.ChatCompletion.create(
model="Qwen-7B",
messages=[
{"role": "user", "content": "äœ å¥œ"}
],
stream=True
):
if hasattr(chunk.choices[0].delta, "content"):
print(chunk.choices[0].delta.content, end="", flush=True)
# create a request not activating streaming response
response = openai.ChatCompletion.create(
model="Qwen-7B",
messages=[
{"role": "user", "content": "äœ å¥œ"}
],
stream=False
)
print(response.choices[0].message.content)
ããŒã«ã®äœ¿çš
Qwen-7B-Chat ã¯ãAPIãããŒã¿ããŒã¹ãã¢ãã«ãªã©ãããŒã«ã®å©çšã«ç¹åããŠæé©åãããŠããããŠãŒã¶ã¯ç¬èªã® Qwen-7B ããŒã¹ã® LangChainããšãŒãžã§ã³ããã³ãŒãã€ã³ã¿ããªã¿ãæ§ç¯ããããšãã§ããŸããããŒã«å©çšèœåãè©äŸ¡ããããã®è©äŸ¡ãã³ãããŒã¯ã§ã¯ãQwen-7B ã¯å®å®ããæ§èœã«éããŠããŸãã
Model | Tool Selection (Acc.â) | Tool Input (Rouge-Lâ) | False Positive Errorâ |
---|---|---|---|
GPT-4 | 95% | 0.90 | 15% |
GPT-3.5 | 85% | 0.88 | 75% |
Qwen-7B | 99% | 0.89 | 9.7% |
ReAct ããã³ããã®æžãæ¹ã䜿ãæ¹ã«ã€ããŠã¯ãReAct ã®äŸãåç §ããŠãã ãããããŒã«ã䜿çšããããšã§ãã¢ãã«ãããããã¿ã¹ã¯ãå®è¡ã§ããããã«ãªããŸãã
ããã«ããšãŒãžã§ã³ããšããŠã®èœåã瀺ãå®éšçµæãæäŸããã詳现㯠Hugging Face Agent ãåç §ãHugging Face ãæäŸããã©ã³ã¢ãŒããã³ãããŒã¯ã§ã®æ§èœã¯ä»¥äžã®éãã§ã:
Model | Tool Selectionâ | Tool Usedâ | Codeâ |
---|---|---|---|
GPT-4 | 100 | 100 | 97.41 |
GPT-3.5 | 95.37 | 96.30 | 87.04 |
StarCoder-15.5B | 87.04 | 87.96 | 68.89 |
Qwen-7B | 90.74 | 92.59 | 74.07 |
é·ãæèã®ç解
ã³ã³ããã¹ãã®é·ããæ¡åŒµããèšç·Žã·ãŒã±ã³ã¹ã®é·ãã®ããã«ããã¯ã解æ¶ããããã«ãNTK ãèæ ®ããè£éããŠã£ã³ããŠã¢ãã³ã·ã§ã³ãLogN ã¢ãã³ã·ã§ã³ã¹ã±ãŒãªã³ã°ãªã©ã®æè¡ãå°å ¥ããã³ã³ããã¹ãã®é·ãã 8K ããŒã¯ã³ä»¥äžã«æ¡åŒµãããarXiv ããŒã¿ã»ãããçšã㊠PPL è©äŸ¡ã«ããèšèªã¢ããªã³ã°å®éšãè¡ããQwen-7B ãé·ãã³ã³ããã¹ãã®ã·ããªãªã«ãããŠåè¶ããæ§èœãéæã§ããããšãèŠåºããã以äžã«çµæã瀺ããŸã:
Model | Sequence Length | ||||
---|---|---|---|---|---|
1024 | 2048 | 4096 | 8192 | 16384 | |
Qwen-7B | 4.23 | 3.78 | 39.35 | 469.81 | 2645.09 |
+ dynamic_ntk | 4.23 | 3.78 | 3.59 | 3.66 | 5.71 |
+ dynamic_ntk + logn | 4.23 | 3.78 | 3.58 | 3.56 | 4.62 |
+ dynamic_ntk + logn + window_attn | 4.23 | 3.78 | 3.58 | 3.49 | 4.32 |
åçŸ
ãã³ãããŒã¯ããŒã¿ã»ããã§ã®ã¢ãã«æ§èœã®åçŸã®ããã«ãçµæãåçŸããã¹ã¯ãªãããæäŸããŠããŸãã詳ãã㯠eval/EVALUATION.md ã確èªããŠãã ããããªããåçŸã®çµæãæã ã®å ±åçµæãšè¥å¹²ç°ãªãå Žåãããã
FAQ
åé¡ãçºçããå Žåã¯ãFAQãissueãåç §ããæ°ããissueãç«ã¡äžããåã«è§£æ±ºçãæ¢ããŠãã ããã
ã©ã€ã»ã³ã¹å¥çŽ
Qwen-7B ãš Qwen-7B-Chat ã®ã³ãŒããšã¢ãã«ãŠã§ã€ãã¯ãç 究è ãéçºè ãèªç±ã«äœ¿çšããããšãã§ããŸãããŸããåçšå©çšãå¯èœã§ãã詳ãã㯠LICENSE ãã芧ãã ãããåçšå©çšãåžæãããæ¹ã¯ããªã¯ãšã¹ããã©ãŒã ã«å¿ èŠäºé ããèšå ¥ã®äžããç³ã蟌ã¿ãã ããã
ãåãåãã
ç 究ããŒã ãŸãã¯è£œåããŒã ãžã®ã¡ãã»ãŒãžã¯ãqianwen_opensource@alibabacloud.com ãŸã§ãæ°è»œã«ãéããã ããã