|
--- |
|
datasets: |
|
- FreedomIntelligence/medical-o1-reasoning-SFT |
|
base_model: |
|
- unsloth/phi-4 |
|
--- |
|
This model is fine tuned with FreedomIntelligence/medical-o1-reasoning-SFT dataset and awq gemm quantized. RLAIF part is not completed. You can read their paper : https://arxiv.org/pdf/2412.18925 |
|
|
|
Trained with Unsloth for faster fine tuning and training more parameters. |
|
|
|
To use it with vLLM : |
|
|
|
docker network create vllm |
|
docker run --runtime=nvidia --gpus all --network vllm --name vllm -v vllm_cache:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=..." --env "HF_HUB_ENABLE_HF_TRANSFER=0" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model Yujivus/Phi-4-Health-CoT-1.1-AWQ --quantization awq_marlin --dtype float16 --gpu_memory-utilization 0.95 --max-model-len 2500 |
|
|
|
You can test vLLM's speed : |
|
|
|
import asyncio |
|
from openai import AsyncOpenAI |
|
|
|
async def get_chat_response_streaming(prompt, index): |
|
|
|
client = AsyncOpenAI( |
|
base_url="http://vllm:8000/v1", |
|
api_key="EMPTY", |
|
) |
|
|
|
messages = [ |
|
{"role": "user", "content": prompt}, |
|
] |
|
|
|
print(f"Request {index+1}: Starting", flush=True) |
|
|
|
stream = await client.chat.completions.create( |
|
model="Yujivus/Phi-4-Health-CoT-1.1-AWQ", |
|
messages=messages, |
|
max_tokens=200, |
|
temperature=0.7, |
|
stream=True, |
|
) |
|
|
|
accumulated_response = "" |
|
async for chunk in stream: |
|
if chunk.choices[0].delta.content is not None: |
|
delta_content = chunk.choices[0].delta.content |
|
accumulated_response += delta_content |
|
print(delta_content, end="", flush=True) |
|
|
|
print(f"\nRequest {index+1}: Finished", flush=True) |
|
|
|
await asyncio.sleep(index * 0.5) |
|
print(f"\nResult {index + 1}: {accumulated_response}\n", flush=True) |
|
|
|
return accumulated_response |
|
|
|
async def main(): |
|
|
|
prompts = [ |
|
"What are the symptoms of diabetes?", |
|
"How is diabetes diagnosed?", |
|
"What are the complications of hypertension?", |
|
"How is pneumonia treated?", |
|
"What are the symptoms of diabetes?", |
|
"How is diabetes diagnosed?", |
|
"What are the complications of hypertension?", |
|
"How is pneumonia treated?", |
|
] |
|
|
|
tasks = [get_chat_response_streaming(prompt, i) for i, prompt in enumerate(prompts)] |
|
|
|
for future in asyncio.as_completed(tasks): |
|
await future |
|
|
|
if __name__ == "__main__": |
|
asyncio.run(main()) |
|
|
|
|
|
Since the model is quantized awq-gemm, you should see max throughtput for 8 requests. |
|
|
|
|
|
To use it with TGI : |
|
|
|
docker network create tgi |
|
docker run --name tgi-server --gpus all -p 80:81 --network tgi -v volume:/data --env HUGGING_FACE_HUB_TOKEN=... ghcr.io/huggingface/text-generation-inference:latest --model-id Yujivus/Phi-4-Health-CoT-1.1-AWQ --quantize awq |
|
|
|
To use it with llamacpp or Ollama : mradermacher/Phi-4-Health-CoT-1.1-GGUF |
|
|
|
Thanks to my company for their supports: Istechsoft Software Technologies |