Update README.md

3a7f7f7 verified 2 months ago

9.34 kB

	---
	license: apache-2.0
	---

	# 🐣 The Quantized LLaMA 3.3 70B Instruct Model

	Original Base Model: `meta-llama/Llama-3.3-70B-Instruct`.<br>
	Link: [https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)

	# 🚀 Model Inference

	```python
	from vllm import LLM, SamplingParams


	# Model name and hyperparameters
	model_name = "shuyuej/Llama-3.3-70B-Instruct-GPTQ"
	num_gpus_vllm = 4 # The number of GPUs you wanna use
	gpu_utilization_vllm = 0.95 # The GPU utilization (from 0 to 1)
	max_model_len_vllm = 2048 # The maximum input token length
	max_new_tokens = 1024 # The maximum number of generated tokens

	# The input prompts and sampling parameters for text generation
	prompts = [
	"Hello, my name is",
	"The president of the United States is",
	"The capital of France is",
	"The future of AI is",
	]
	sampling_params = SamplingParams(
	temperature=0,
	top_p=1,
	max_tokens=max_new_tokens,
	)

	# Initialize vLLM engine
	llm = LLM(
	model=model_name,
	tokenizer=model_name,
	# While using the GPTQ quantization, the current vLLM only supports float16, as of Dec. 14th, 2024
	dtype='float16',
	quantization="GPTQ",
	# Acknowledgement: Benjamin Kitor
	# https://github.com/vllm-project/vllm/issues/2794
	# Reference: https://github.com/vllm-project/vllm/issues/1908
	distributed_executor_backend="mp",
	tensor_parallel_size=num_gpus_vllm,
	gpu_memory_utilization=gpu_utilization_vllm,
	# Note: We add this only to save the GPU Memories!
	max_model_len=max_model_len_vllm,
	disable_custom_all_reduce=True,
	enable_lora=False,
	)

	# Generate responses using the vLLM LLM
	completions = llm.generate(prompts, sampling_params)
	for output in completions:
	prompt = output.prompt
	generated_text = output.outputs[0].text
	print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
	```

	# 🔥 Real-world deployment
	For real-world deployment, please refer to the [vLLM Distributed Inference and Serving](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) and [OpenAI Compatible Server](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html).

	vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. By default, it starts the server at `http://localhost:8000`.
	```shell
	vllm serve shuyuej/Llama-3.3-70B-Instruct-GPTQ \
	--quantization gptq \
	--trust-remote-code \
	--dtype float16 \
	--max-model-len 4096 \
	--distributed-executor-backend mp \
	--pipeline-parallel-size 4 \
	--api-key token-abc123
	```
	Please check [here](https://docs.vllm.ai/en/stable/models/engine_args.html) if you wanna change `Engine Arguments`.

	If you would like to deploy your LoRA adapter, please refer to the [vLLM documentation](https://docs.vllm.ai/en/latest/usage/lora.html#serving-lora-adapters) for a detailed guide.<br>
	It provides step-by-step instructions on how to serve LoRA adapters effectively in a vLLM environment.<br>
	We have also shared our trained LoRA adapter [here](https://huggingface.co/shuyuej/Public-Shared-LoRA-for-Llama-3.3-70B-Instruct-GPTQ). Please download it manually if needed.
	```shell
	git clone https://huggingface.co/shuyuej/Public-Shared-LoRA-for-Llama-3.3-70B-Instruct-GPTQ
	```

	Then, use the vLLM to serve the base model with the LoRA adapter by including the `--enable-lora` flag and specifying `--lora-modules`:
	```shell
	vllm serve shuyuej/Llama-3.3-70B-Instruct-GPTQ \
	--quantization gptq \
	--trust-remote-code \
	--dtype float16 \
	--max-model-len 4096 \
	--distributed-executor-backend mp \
	--pipeline-parallel-size 4 \
	--api-key token-abc123 \
	--enable-lora \
	--lora-modules adapter=Public-Shared-LoRA-for-Llama-3.3-70B-Instruct-GPTQ/checkpoint-18640
	```

	Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API.
	For example, another way to query the server is via the openai python package:
	```python
	#!/usr/bin/env python
	# coding=utf-8

	import time
	import asyncio

	from openai import AsyncOpenAI

	# Our system prompt
	SYSTEM_PROMPT = (
	"I am PodGPT, a large language model developed by the Kolachalama Lab in Boston, "
	"specializing in science, technology, engineering, mathematics, and medicine "
	"(STEMM)-related research and education, powered by podcast audio.\n"
	"I provide information based on established scientific knowledge but must not offer "
	"personal medical advice or present myself as a licensed medical professional.\n"
	"I will maintain a consistently professional and informative tone, avoiding humor, "
	"sarcasm, and pop culture references.\n"
	"I will prioritize factual accuracy and clarity while ensuring my responses are "
	"educational and non-harmful, adhering to the principle of 'do no harm'.\n"
	"My responses are for informational purposes only and should not be considered a "
	"substitute for professional consultation."
	)

	# Initialize the AsyncOpenAI client
	client = AsyncOpenAI(
	base_url="http://localhost:8000/v1",
	api_key="token-abc123",
	)


	async def main(message):
	"""
	Streaming responses with async usage and "await" with each API call:
	Reference: https://github.com/openai/openai-python?tab=readme-ov-file#streaming-responses
	:param message: The user query
	"""
	start_time = time.time()
	stream = await client.chat.completions.create(
	model="shuyuej/Llama-3.3-70B-Instruct-GPTQ",
	messages=[
	{
	"role": "system",
	"content": SYSTEM_PROMPT,
	},
	{
	"role": "user",
	"content": message,
	}
	],
	max_tokens=2048,
	temperature=0.2,
	top_p=1,
	stream=True,
	extra_body={
	"ignore_eos": False,
	# https://huggingface.co/shuyuej/Llama-3.3-70B-Instruct-GPTQ/blob/main/config.json#L10-L14
	"stop_token_ids": [128001, 128008, 128009],
	},
	)

	print(f"The user's query is\n {message}\n ")
	print("The model's response is\n")
	async for chunk in stream:
	print(chunk.choices[0].delta.content or "", end="")
	print(f"\nInference time: {time.time() - start_time:.2f} seconds\n")
	print("=" * 100)


	if __name__ == "__main__":
	# Some random user queries
	prompts = [
	"Hello, my name is",
	"The president of the United States is",
	"The capital of France is",
	"The future of AI is",
	"Can you tell me more about Bruce Lee?",
	"What are the differences between DNA and RNA?",
	"What is dementia and Alzheimer's disease?",
	"Tell me the differences between Alzheimer's disease and dementia"
	]

	# Conduct model inference
	for message in prompts:
	asyncio.run(main(message=message))
	print("\n\n")
	```

	<summary>Here is a demo of the real-world model inference and deployment</summary>
	<p align="center">
	<a href="https://www.medrxiv.org/content/10.1101/2024.07.11.24310304v2"> <img src="https://github.com/vkola-lab/PodGPT/raw/main/figures/inference_demo.gif"></a>
	</p>

	## 🐤 Single Checkpoint
	Please note that if you need a single checkpoint (39.8 GB) instead of sharded checkpoints (each one is 5.34 GB), you can use `revision="f77c1b3"`.<br>
	Here are our model loader codes for fine-tuning a LoRA adapter: [https://github.com/vkola-lab/PodGPT/blob/main/lib/model_loader_quantization.py](https://github.com/vkola-lab/PodGPT/blob/main/lib/model_loader_quantization.py).<br>
	Also, please check our inference codes if you are interested: [https://github.com/vkola-lab/PodGPT/blob/main/utils/eval_utils.py#L75-L98](https://github.com/vkola-lab/PodGPT/blob/main/utils/eval_utils.py#L75-L98).

	## 📖 Quantization Configurations
	This model was quantized using the [GPTQModel](https://github.com/ModelCloud/GPTQModel) library.
	The source codes to build this quantized model are available [here](https://github.com/vkola-lab/PodGPT/blob/main/quantization/quantization_GPTQModel.py).
	```shell
	"quantization_config": {
	"bits": 4,
	"checkpoint_format": "gptq",
	"desc_act": true,
	"dynamic": null,
	"group_size": 128,
	"lm_head": false,
	"meta": {
	"damp_auto_increment": 0.0015,
	"damp_percent": 0.01,
	"quantizer": [
	"gptqmodel:1.4.0-dev"
	],
	"static_groups": false,
	"true_sequential": true,
	"uri": "https://github.com/modelcloud/gptqmodel"
	},
	"quant_method": "gptq",
	"sym": true
	},
	```

	## ⛏️ Source Codes
	Source Codes for quantization:<br>
	[https://github.com/vkola-lab/medpodgpt/tree/main/quantization](https://github.com/vkola-lab/medpodgpt/tree/main/quantization).<br>
	Source Codes for training the LoRA adapter:<br>
	[https://github.com/vkola-lab/PodGPT/blob/main/lib/model_loader_quantization.py](https://github.com/vkola-lab/PodGPT/blob/main/lib/model_loader_quantization.py).<br>
	Source Codes for inference (without LoRA and with LoRA):<br>
	[https://github.com/vkola-lab/PodGPT/blob/main/utils/vllm_utils.py#L61-L121](https://github.com/vkola-lab/PodGPT/blob/main/utils/vllm_utils.py#L61-L121).