|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
# π£ The Quantized LLaMA 3.3 70B Instruct Model |
|
|
|
Original Base Model: `meta-llama/Llama-3.3-70B-Instruct`.<br> |
|
Link: [https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) |
|
|
|
# π Model Inference |
|
|
|
```python |
|
from vllm import LLM, SamplingParams |
|
|
|
|
|
# Model name and hyperparameters |
|
model_name = "shuyuej/Llama-3.3-70B-Instruct-GPTQ" |
|
num_gpus_vllm = 4 # The number of GPUs you wanna use |
|
gpu_utilization_vllm = 0.95 # The GPU utilization (from 0 to 1) |
|
max_model_len_vllm = 2048 # The maximum input token length |
|
max_new_tokens = 1024 # The maximum number of generated tokens |
|
|
|
# The input prompts and sampling parameters for text generation |
|
prompts = [ |
|
"Hello, my name is", |
|
"The president of the United States is", |
|
"The capital of France is", |
|
"The future of AI is", |
|
] |
|
sampling_params = SamplingParams( |
|
temperature=0, |
|
top_p=1, |
|
max_tokens=max_new_tokens, |
|
) |
|
|
|
# Initialize vLLM engine |
|
llm = LLM( |
|
model=model_name, |
|
tokenizer=model_name, |
|
# While using the GPTQ quantization, the current vLLM only supports float16, as of Dec. 14th, 2024 |
|
dtype='float16', |
|
quantization="GPTQ", |
|
# Acknowledgement: Benjamin Kitor |
|
# https://github.com/vllm-project/vllm/issues/2794 |
|
# Reference: https://github.com/vllm-project/vllm/issues/1908 |
|
distributed_executor_backend="mp", |
|
tensor_parallel_size=num_gpus_vllm, |
|
gpu_memory_utilization=gpu_utilization_vllm, |
|
# Note: We add this only to save the GPU Memories! |
|
max_model_len=max_model_len_vllm, |
|
disable_custom_all_reduce=True, |
|
enable_lora=False, |
|
) |
|
|
|
# Generate responses using the vLLM LLM |
|
completions = llm.generate(prompts, sampling_params) |
|
for output in completions: |
|
prompt = output.prompt |
|
generated_text = output.outputs[0].text |
|
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") |
|
``` |
|
|
|
# π₯ Real-world deployment |
|
For real-world deployment, please refer to the [vLLM Distributed Inference and Serving](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) and [OpenAI Compatible Server](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html). |
|
|
|
vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. By default, it starts the server at `http://localhost:8000`. |
|
```shell |
|
vllm serve shuyuej/Llama-3.3-70B-Instruct-GPTQ \ |
|
--quantization gptq \ |
|
--trust-remote-code \ |
|
--dtype float16 \ |
|
--max-model-len 4096 \ |
|
--distributed-executor-backend mp \ |
|
--pipeline-parallel-size 4 \ |
|
--api-key token-abc123 |
|
``` |
|
Please check [here](https://docs.vllm.ai/en/stable/models/engine_args.html) if you wanna change `Engine Arguments`. |
|
|
|
If you would like to deploy your LoRA adapter, please refer to the [vLLM documentation](https://docs.vllm.ai/en/latest/usage/lora.html#serving-lora-adapters) for a detailed guide.<br> |
|
It provides step-by-step instructions on how to serve LoRA adapters effectively in a vLLM environment.<br> |
|
**We have also shared our trained LoRA adapter** [here](https://huggingface.co/shuyuej/Public-Shared-LoRA-for-Llama-3.3-70B-Instruct-GPTQ). Please download it manually if needed. |
|
```shell |
|
git clone https://huggingface.co/shuyuej/Public-Shared-LoRA-for-Llama-3.3-70B-Instruct-GPTQ |
|
``` |
|
|
|
Then, use the vLLM to serve the base model with the LoRA adapter by including the `--enable-lora` flag and specifying `--lora-modules`: |
|
```shell |
|
vllm serve shuyuej/Llama-3.3-70B-Instruct-GPTQ \ |
|
--quantization gptq \ |
|
--trust-remote-code \ |
|
--dtype float16 \ |
|
--max-model-len 4096 \ |
|
--distributed-executor-backend mp \ |
|
--pipeline-parallel-size 4 \ |
|
--api-key token-abc123 \ |
|
--enable-lora \ |
|
--lora-modules adapter=Public-Shared-LoRA-for-Llama-3.3-70B-Instruct-GPTQ/checkpoint-18640 |
|
``` |
|
|
|
Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. |
|
For example, another way to query the server is via the openai python package: |
|
```python |
|
#!/usr/bin/env python |
|
# coding=utf-8 |
|
|
|
import time |
|
import asyncio |
|
|
|
from openai import AsyncOpenAI |
|
|
|
# Our system prompt |
|
SYSTEM_PROMPT = ( |
|
"I am PodGPT, a large language model developed by the Kolachalama Lab in Boston, " |
|
"specializing in science, technology, engineering, mathematics, and medicine " |
|
"(STEMM)-related research and education, powered by podcast audio.\n" |
|
"I provide information based on established scientific knowledge but must not offer " |
|
"personal medical advice or present myself as a licensed medical professional.\n" |
|
"I will maintain a consistently professional and informative tone, avoiding humor, " |
|
"sarcasm, and pop culture references.\n" |
|
"I will prioritize factual accuracy and clarity while ensuring my responses are " |
|
"educational and non-harmful, adhering to the principle of 'do no harm'.\n" |
|
"My responses are for informational purposes only and should not be considered a " |
|
"substitute for professional consultation." |
|
) |
|
|
|
# Initialize the AsyncOpenAI client |
|
client = AsyncOpenAI( |
|
base_url="http://localhost:8000/v1", |
|
api_key="token-abc123", |
|
) |
|
|
|
|
|
async def main(message): |
|
""" |
|
Streaming responses with async usage and "await" with each API call: |
|
Reference: https://github.com/openai/openai-python?tab=readme-ov-file#streaming-responses |
|
:param message: The user query |
|
""" |
|
start_time = time.time() |
|
stream = await client.chat.completions.create( |
|
model="shuyuej/Llama-3.3-70B-Instruct-GPTQ", |
|
messages=[ |
|
{ |
|
"role": "system", |
|
"content": SYSTEM_PROMPT, |
|
}, |
|
{ |
|
"role": "user", |
|
"content": message, |
|
} |
|
], |
|
max_tokens=2048, |
|
temperature=0.2, |
|
top_p=1, |
|
stream=True, |
|
extra_body={ |
|
"ignore_eos": False, |
|
# https://huggingface.co/shuyuej/Llama-3.3-70B-Instruct-GPTQ/blob/main/config.json#L10-L14 |
|
"stop_token_ids": [128001, 128008, 128009], |
|
}, |
|
) |
|
|
|
print(f"The user's query is\n {message}\n ") |
|
print("The model's response is\n") |
|
async for chunk in stream: |
|
print(chunk.choices[0].delta.content or "", end="") |
|
print(f"\nInference time: {time.time() - start_time:.2f} seconds\n") |
|
print("=" * 100) |
|
|
|
|
|
if __name__ == "__main__": |
|
# Some random user queries |
|
prompts = [ |
|
"Hello, my name is", |
|
"The president of the United States is", |
|
"The capital of France is", |
|
"The future of AI is", |
|
"Can you tell me more about Bruce Lee?", |
|
"What are the differences between DNA and RNA?", |
|
"What is dementia and Alzheimer's disease?", |
|
"Tell me the differences between Alzheimer's disease and dementia" |
|
] |
|
|
|
# Conduct model inference |
|
for message in prompts: |
|
asyncio.run(main(message=message)) |
|
print("\n\n") |
|
``` |
|
|
|
<summary>Here is a demo of the real-world model inference and deployment</summary> |
|
<p align="center"> |
|
<a href="https://www.medrxiv.org/content/10.1101/2024.07.11.24310304v2"> <img src="https://github.com/vkola-lab/PodGPT/raw/main/figures/inference_demo.gif"></a> |
|
</p> |
|
|
|
## π€ Single Checkpoint |
|
Please note that if you need **a single checkpoint** (39.8 GB) instead of **sharded checkpoints** (each one is 5.34 GB), you can use `revision="f77c1b3"`.<br> |
|
Here are our model loader codes for fine-tuning a LoRA adapter: [https://github.com/vkola-lab/PodGPT/blob/main/lib/model_loader_quantization.py](https://github.com/vkola-lab/PodGPT/blob/main/lib/model_loader_quantization.py).<br> |
|
Also, please check our inference codes if you are interested: [https://github.com/vkola-lab/PodGPT/blob/main/utils/eval_utils.py#L75-L98](https://github.com/vkola-lab/PodGPT/blob/main/utils/eval_utils.py#L75-L98). |
|
|
|
## π Quantization Configurations |
|
This model was quantized using the [GPTQModel](https://github.com/ModelCloud/GPTQModel) library. |
|
The source codes to build this quantized model are available [here](https://github.com/vkola-lab/PodGPT/blob/main/quantization/quantization_GPTQModel.py). |
|
```shell |
|
"quantization_config": { |
|
"bits": 4, |
|
"checkpoint_format": "gptq", |
|
"desc_act": true, |
|
"dynamic": null, |
|
"group_size": 128, |
|
"lm_head": false, |
|
"meta": { |
|
"damp_auto_increment": 0.0015, |
|
"damp_percent": 0.01, |
|
"quantizer": [ |
|
"gptqmodel:1.4.0-dev" |
|
], |
|
"static_groups": false, |
|
"true_sequential": true, |
|
"uri": "https://github.com/modelcloud/gptqmodel" |
|
}, |
|
"quant_method": "gptq", |
|
"sym": true |
|
}, |
|
``` |
|
|
|
## βοΈ Source Codes |
|
**Source Codes** for quantization:<br> |
|
[https://github.com/vkola-lab/medpodgpt/tree/main/quantization](https://github.com/vkola-lab/medpodgpt/tree/main/quantization).<br> |
|
**Source Codes** for training the LoRA adapter:<br> |
|
[https://github.com/vkola-lab/PodGPT/blob/main/lib/model_loader_quantization.py](https://github.com/vkola-lab/PodGPT/blob/main/lib/model_loader_quantization.py).<br> |
|
**Source Codes** for inference (without LoRA and with LoRA):<br> |
|
[https://github.com/vkola-lab/PodGPT/blob/main/utils/vllm_utils.py#L61-L121](https://github.com/vkola-lab/PodGPT/blob/main/utils/vllm_utils.py#L61-L121). |
|
|