Yujivus
/

Phi-4-Health-CoT-1.1-AWQ

4-bit precision

Model card Files Files and versions Community

Phi-4-Health-CoT-1.1-AWQ / README.md

Yujivus's picture

Update README.md

51df2b1 verified 9 days ago

|

history blame contribute delete

2.94 kB

	---
	datasets:
	- FreedomIntelligence/medical-o1-reasoning-SFT
	base_model:
	- unsloth/phi-4
	---
	This model is fine tuned with FreedomIntelligence/medical-o1-reasoning-SFT dataset and awq gemm quantized. RLAIF part is not completed. You can read their paper : https://arxiv.org/pdf/2412.18925

	Trained with Unsloth for faster fine tuning and training more parameters.

	To use it with vLLM :

	docker network create vllm
	docker run --runtime=nvidia --gpus all --network vllm --name vllm -v vllm_cache:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=..." --env "HF_HUB_ENABLE_HF_TRANSFER=0" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model Yujivus/Phi-4-Health-CoT-1.1-AWQ --quantization awq_marlin --dtype float16 --gpu_memory-utilization 0.95 --max-model-len 2500

	You can test vLLM's speed :

	import asyncio
	from openai import AsyncOpenAI

	async def get_chat_response_streaming(prompt, index):

	client = AsyncOpenAI(
	base_url="http://vllm:8000/v1",
	api_key="EMPTY",
	)

	messages = [
	{"role": "user", "content": prompt},
	]

	print(f"Request {index+1}: Starting", flush=True)

	stream = await client.chat.completions.create(
	model="Yujivus/Phi-4-Health-CoT-1.1-AWQ",
	messages=messages,
	max_tokens=200,
	temperature=0.7,
	stream=True,
	)

	accumulated_response = ""
	async for chunk in stream:
	if chunk.choices[0].delta.content is not None:
	delta_content = chunk.choices[0].delta.content
	accumulated_response += delta_content
	print(delta_content, end="", flush=True)

	print(f"\nRequest {index+1}: Finished", flush=True)

	await asyncio.sleep(index * 0.5)
	print(f"\nResult {index + 1}: {accumulated_response}\n", flush=True)

	return accumulated_response

	async def main():

	prompts = [
	"What are the symptoms of diabetes?",
	"How is diabetes diagnosed?",
	"What are the complications of hypertension?",
	"How is pneumonia treated?",
	"What are the symptoms of diabetes?",
	"How is diabetes diagnosed?",
	"What are the complications of hypertension?",
	"How is pneumonia treated?",
	]

	tasks = [get_chat_response_streaming(prompt, i) for i, prompt in enumerate(prompts)]

	for future in asyncio.as_completed(tasks):
	await future

	if __name__ == "__main__":
	asyncio.run(main())


	Since the model is quantized awq-gemm, you should see max throughtput for 8 requests.


	To use it with TGI :

	docker network create tgi
	docker run --name tgi-server --gpus all -p 80:81 --network tgi -v volume:/data --env HUGGING_FACE_HUB_TOKEN=... ghcr.io/huggingface/text-generation-inference:latest --model-id Yujivus/Phi-4-Health-CoT-1.1-AWQ --quantize awq

	To use it with llamacpp or Ollama : mradermacher/Phi-4-Health-CoT-1.1-GGUF

	Thanks to my company for their supports: Istechsoft Software Technologies