Run TheBloke/Xwin-LM-70B-V0.1-AWQ on Multiple GPU

#2
by batawfic - opened

I hope someone can help me with this, I'm trying to do inference with 2GPU using vLLM

I'm using:
llm = LLM(model=MODEL_NAME, quantization="awq", dtype="half", tensor_parallel_size=2)

But it failed with assertion error. My question is does that model run inference on multiple GPU. If yes what below error mean:
559 return ignored
561 # Execute the model.
--> 562 output = self._run_workers(
563 "execute_model",
564 seq_group_metadata_list=seq_group_metadata_list,
565 blocks_to_swap_in=scheduler_outputs.blocks_to_swap_in,
566 blocks_to_swap_out=scheduler_outputs.blocks_to_swap_out,
567 blocks_to_copy=scheduler_outputs.blocks_to_copy,
568 )
570 return self._process_model_outputs(output, scheduler_outputs) + ignored

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/vllm/engine/llm_engine.py:712, in LLMEngine._run_workers(self, method, get_all_outputs, *args, **kwargs)
710 output = all_outputs[0]
711 for other_output in all_outputs[1:]:
--> 712 assert output == other_output
713 return output

AssertionError:

This is was solved with fix in vLLM and had nothing to do with model

batawfic changed discussion status to closed

Thanks for letting us know

Out of interest, what two GPUs are you using and how is performance compared to one GPU?

With unquantised, my experience is that servers with tensor parallelism like TGI and vLLM do not scale great with multiple GPUs. Like the second GPU adds only 60-80% more performance, not 100%. So I prefer to scale across multiple separate instances (with a load balancer). But I've not yet tried it with AWQ models.

Thanks for letting us know

Out of interest, what two GPUs are you using and how is performance compared to one GPU?

With unquantised, my experience is that servers with tensor parallelism like TGI and vLLM do not scale great with multiple GPUs. Like the second GPU adds only 60-80% more performance, not 100%. So I prefer to scale across multiple separate instances (with a load balancer). But I've not yet tried it with AWQ models.

thanks for the insight, have you had a chance to try since?

Sign up or log in to comment