michaelfeil/ct2fast-Llama-2-13b-chat-hf

philschmid

Jul 20, 2023

Super cool conversion. Whats the latency you see on which hardware?

michaelfeil

Owner Jul 20, 2023

•

edited Jul 20, 2023

Most impressive thing first - loading latency. 13B model into VRAM - from first line of the script until forward takes 6.6s.

Here are some rough numbers on throughput, which are okay:

Nvidia A6000
input shape: (1, 65) output shape: (1, 128) Total time 3.440488675609231 seconds avg tokens/s 37.204017239595814
input shape: (16, 65) output shape: (16, 128) Total time 8.294885652139783 seconds avg tokens/s 246.89912385611842 (divide by 16)
memory: 13462 MiB Vram (idle) -> max 16GiB Vram (batch size 16, at 192 tokens)
Note: Output does not contain the input tokens. Example for „static batching“.
Note for the 7B variant:
input shape: (1, 65) output shape: (1, 128) Total time 2.0009103175252676 seconds avg tokens/s 63.97088309200725

michaelfeil

Owner Jul 21, 2023

•

edited Jul 22, 2023

@philschmid To give some more context - ran the same benchmark (aka A6000, 13B, input shape: (1, 65) output shape: (1, 128) ) on text-generation-interface 0.9.3.

generate{parameters=GenerateParameters { best_of: Some(1), temperature: Some(0.5),
 repetition_penalty: Some(1.03), top_k: Some(10), top_p: Some(0.95), typical_p: Some(0.95), do_sample: true, max_new_tokens: 128,
 return_full_text: Some(false), stop: ["photographer"], truncate: None, watermark: true,
 details: true, decoder_input_details: true, seed: None } total_time="5.575435704s" validation_time="846.928µs"
 queue_time="151.59µs" inference_time="5.574437698s" time_per_token="43.550294ms" seed="Some(1584686632614596462)"

Roughly it takes 162% of the time of ctranslate2 3.17.1 and also more ram.
bitsandbytes would take less ram, but would also be slower.

Disclaimer: results are biased:

strengths of text-generation-interface are not used. batch size=1, gpus=1 (dynamic and continuous batching, ct2 can't do tensor-parallel / multi-gpu inference)
no statistical eval (tried it 2-3 times)

philschmid

Jul 22, 2023

Thank you for sharing! BTW. TGI also support GPTQ now. Maybe that's also a good way to decrease memory needed.

michaelfeil
/

ct2fast-Llama-2-13b-chat-hf

Latency