File size: 2,939 Bytes
be89931
 
 
 
 
 
 
9ad9b7c
be89931
 
 
 
 
 
 
 
014e08b
be89931
 
 
 
fd5b0be
be89931
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fd5b0be
be89931
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
014e08b
be89931
 
 
 
 
 
 
ee4db95
 
d638ca7
 
51df2b1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
datasets:
- FreedomIntelligence/medical-o1-reasoning-SFT
base_model:
- unsloth/phi-4
---
This model is fine tuned with FreedomIntelligence/medical-o1-reasoning-SFT dataset and awq gemm quantized. RLAIF part is not completed. You can read their paper : https://arxiv.org/pdf/2412.18925

Trained with Unsloth for faster fine tuning and training more parameters.

To use it with vLLM : 

docker network create vllm
docker run --runtime=nvidia --gpus all --network vllm --name vllm -v vllm_cache:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=..." --env "HF_HUB_ENABLE_HF_TRANSFER=0" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model Yujivus/Phi-4-Health-CoT-1.1-AWQ --quantization awq_marlin --dtype float16 --gpu_memory-utilization 0.95 --max-model-len 2500

You can test vLLM's speed :

import asyncio
from openai import AsyncOpenAI

async def get_chat_response_streaming(prompt, index):

    client = AsyncOpenAI(
        base_url="http://vllm:8000/v1",
        api_key="EMPTY",   
    )

    messages = [
        {"role": "user", "content": prompt},
    ]

    print(f"Request {index+1}: Starting", flush=True)

    stream = await client.chat.completions.create(
        model="Yujivus/Phi-4-Health-CoT-1.1-AWQ",
        messages=messages,
        max_tokens=200, 
        temperature=0.7,
        stream=True,  
    )

    accumulated_response = ""
    async for chunk in stream:
        if chunk.choices[0].delta.content is not None:
            delta_content = chunk.choices[0].delta.content
            accumulated_response += delta_content
            print(delta_content, end="", flush=True)

    print(f"\nRequest {index+1}: Finished", flush=True) 

    await asyncio.sleep(index * 0.5)
    print(f"\nResult {index + 1}: {accumulated_response}\n", flush=True) 

    return accumulated_response

async def main():

    prompts = [
        "What are the symptoms of diabetes?",
        "How is diabetes diagnosed?",
        "What are the complications of hypertension?",
        "How is pneumonia treated?",
        "What are the symptoms of diabetes?",
        "How is diabetes diagnosed?",
        "What are the complications of hypertension?",
        "How is pneumonia treated?",
    ]

    tasks = [get_chat_response_streaming(prompt, i) for i, prompt in enumerate(prompts)]

    for future in asyncio.as_completed(tasks):
        await future

if __name__ == "__main__":
    asyncio.run(main())


Since the model is quantized awq-gemm, you should see max throughtput for 8 requests.


To use it with TGI : 

docker network create tgi
docker run --name tgi-server --gpus all -p 80:81 --network tgi -v volume:/data --env HUGGING_FACE_HUB_TOKEN=... ghcr.io/huggingface/text-generation-inference:latest --model-id Yujivus/Phi-4-Health-CoT-1.1-AWQ --quantize awq 

To use it with llamacpp or Ollama : mradermacher/Phi-4-Health-CoT-1.1-GGUF

Thanks to my company for their supports: Istechsoft Software Technologies