Run TheBloke/Xwin-LM-70B-V0.1-AWQ on Multiple GPU

by batawfic - opened Oct 17, 2023

Oct 17, 2023

I hope someone can help me with this, I'm trying to do inference with 2GPU using vLLM

I'm using:
llm = LLM(model=MODEL_NAME, quantization="awq", dtype="half", tensor_parallel_size=2)

But it failed with assertion error. My question is does that model run inference on multiple GPU. If yes what below error mean:
559 return ignored
561 # Execute the model.
--> 562 output = self._run_workers(
563 "execute_model",
564 seq_group_metadata_list=seq_group_metadata_list,
565 blocks_to_swap_in=scheduler_outputs.blocks_to_swap_in,
566 blocks_to_swap_out=scheduler_outputs.blocks_to_swap_out,
567 blocks_to_copy=scheduler_outputs.blocks_to_copy,
568 )
570 return self._process_model_outputs(output, scheduler_outputs) + ignored

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/vllm/engine/llm_engine.py:712, in LLMEngine._run_workers(self, method, get_all_outputs, *args, **kwargs)
710 output = all_outputs[0]
711 for other_output in all_outputs[1:]:
--> 712 assert output == other_output
713 return output

AssertionError:

batawfic

Oct 18, 2023

This is was solved with fix in vLLM and had nothing to do with model

batawfic changed discussion status to closed Oct 18, 2023

TheBloke

Owner Oct 18, 2023

Thanks for letting us know

Out of interest, what two GPUs are you using and how is performance compared to one GPU?

With unquantised, my experience is that servers with tensor parallelism like TGI and vLLM do not scale great with multiple GPUs. Like the second GPU adds only 60-80% more performance, not 100%. So I prefer to scale across multiple separate instances (with a load balancer). But I've not yet tried it with AWQ models.

onealeph0cc

Oct 29, 2023

Thanks for letting us know

Out of interest, what two GPUs are you using and how is performance compared to one GPU?

With unquantised, my experience is that servers with tensor parallelism like TGI and vLLM do not scale great with multiple GPUs. Like the second GPU adds only 60-80% more performance, not 100%. So I prefer to scale across multiple separate instances (with a load balancer). But I've not yet tried it with AWQ models.

thanks for the insight, have you had a chance to try since?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment