Ngixdev commited on
Commit
d7860c8
·
verified ·
1 Parent(s): 31b5080

Use llama.cpp server with OpenAI-compatible API

Browse files
Files changed (4) hide show
  1. Dockerfile +19 -25
  2. README.md +59 -23
  3. app.py +0 -280
  4. requirements.txt +0 -2
Dockerfile CHANGED
@@ -1,28 +1,22 @@
1
- FROM nvidia/cuda:12.4.0-devel-ubuntu22.04
2
-
3
- ENV DEBIAN_FRONTEND=noninteractive
4
- ENV CMAKE_ARGS="-DGGML_CUDA=on"
5
- ENV FORCE_CMAKE=1
6
-
7
- RUN apt-get update && apt-get install -y \
8
- python3 \
9
- python3-pip \
10
- git \
11
- cmake \
12
- build-essential \
13
- && rm -rf /var/lib/apt/lists/*
14
 
15
  WORKDIR /app
16
 
17
- RUN pip3 install --no-cache-dir --upgrade pip
18
-
19
- RUN pip3 install --no-cache-dir llama-cpp-python
20
-
21
- COPY requirements.txt .
22
- RUN pip3 install --no-cache-dir -r requirements.txt
23
-
24
- COPY app.py .
25
-
26
- EXPOSE 7860
27
-
28
- CMD ["python3", "app.py"]
 
 
 
 
 
 
 
1
+ FROM ghcr.io/ggml-org/llama.cpp:full
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
  WORKDIR /app
4
 
5
+ RUN apt update && apt install -y python3-pip
6
+ RUN pip install -U huggingface_hub
7
+
8
+ RUN python3 -c 'from huggingface_hub import hf_hub_download; \
9
+ repo="HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive"; \
10
+ hf_hub_download(repo_id=repo, filename="Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf", local_dir="/app"); \
11
+ hf_hub_download(repo_id=repo, filename="mmproj-Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-BF16.gguf", local_dir="/app")'
12
+
13
+ CMD ["--server", \
14
+ "-m", "/app/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf", \
15
+ "--mmproj", "/app/mmproj-Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-BF16.gguf", \
16
+ "--host", "0.0.0.0", \
17
+ "--port", "7860", \
18
+ "-t", "2", \
19
+ "--cache-type-k", "q8_0", \
20
+ "--cache-type-v", "iq4_nl", \
21
+ "-c", "32768", \
22
+ "-n", "8192"]
README.md CHANGED
@@ -4,7 +4,6 @@ emoji: 🤖
4
  colorFrom: blue
5
  colorTo: purple
6
  sdk: docker
7
- app_file: app.py
8
  pinned: false
9
  license: apache-2.0
10
  tags:
@@ -12,12 +11,13 @@ tags:
12
  - uncensored
13
  - llama-cpp
14
  - gguf
 
15
  suggested_hardware: a10g-small
16
  ---
17
 
18
- # Qwen3.5-9B Uncensored API Interface
19
 
20
- API interface for [HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive](https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive).
21
 
22
  ## Features
23
 
@@ -26,49 +26,85 @@ API interface for [HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive](https://h
26
  - Multimodal capable (text, image, video)
27
  - Supports 201 languages
28
  - Q4_K_M quantization via llama.cpp
 
29
 
30
  ## API Usage
31
 
32
- ### Python
33
 
34
  ```python
35
- from gradio_client import Client
36
 
37
- client = Client("Ngixdev/qwen-api")
 
 
 
38
 
39
- result = client.predict(
40
- prompt="Your question here",
41
- system_prompt="You are a helpful assistant",
 
 
 
42
  temperature=0.7,
43
- top_p=0.8,
44
- max_tokens=1024,
45
- api_name="/api_generate"
46
  )
47
- print(result)
 
48
  ```
49
 
50
  ### cURL
51
 
52
  ```bash
53
- curl -X POST https://ngixdev-qwen-api.hf.space/api/api_generate \
54
  -H "Content-Type: application/json" \
55
  -d '{
56
- "data": [
57
- "Your question here",
58
- "You are a helpful assistant",
59
- 0.7,
60
- 0.8,
61
- 1024
62
- ]
63
  }'
64
  ```
65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
  ## Parameters
67
 
68
  | Parameter | Type | Default | Description |
69
  |-----------|------|---------|-------------|
70
- | prompt | string | required | User prompt/question |
71
- | system_prompt | string | "" | System instruction |
72
  | temperature | float | 0.7 | Sampling temperature (0.0-2.0) |
73
  | top_p | float | 0.8 | Nucleus sampling (0.0-1.0) |
74
  | max_tokens | int | 1024 | Maximum tokens to generate |
 
 
4
  colorFrom: blue
5
  colorTo: purple
6
  sdk: docker
 
7
  pinned: false
8
  license: apache-2.0
9
  tags:
 
11
  - uncensored
12
  - llama-cpp
13
  - gguf
14
+ - openai-compatible
15
  suggested_hardware: a10g-small
16
  ---
17
 
18
+ # Qwen3.5-9B Uncensored API
19
 
20
+ OpenAI-compatible API for [HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive](https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive).
21
 
22
  ## Features
23
 
 
26
  - Multimodal capable (text, image, video)
27
  - Supports 201 languages
28
  - Q4_K_M quantization via llama.cpp
29
+ - OpenAI-compatible API
30
 
31
  ## API Usage
32
 
33
+ ### Python (OpenAI SDK)
34
 
35
  ```python
36
+ from openai import OpenAI
37
 
38
+ client = OpenAI(
39
+ base_url="https://ngixdev-qwen-api.hf.space/v1",
40
+ api_key="not-needed"
41
+ )
42
 
43
+ response = client.chat.completions.create(
44
+ model="qwen",
45
+ messages=[
46
+ {"role": "system", "content": "You are a helpful assistant."},
47
+ {"role": "user", "content": "Hello, who are you?"}
48
+ ],
49
  temperature=0.7,
50
+ max_tokens=1024
 
 
51
  )
52
+
53
+ print(response.choices[0].message.content)
54
  ```
55
 
56
  ### cURL
57
 
58
  ```bash
59
+ curl https://ngixdev-qwen-api.hf.space/v1/chat/completions \
60
  -H "Content-Type: application/json" \
61
  -d '{
62
+ "model": "qwen",
63
+ "messages": [
64
+ {"role": "system", "content": "You are a helpful assistant."},
65
+ {"role": "user", "content": "Hello!"}
66
+ ],
67
+ "temperature": 0.7,
68
+ "max_tokens": 1024
69
  }'
70
  ```
71
 
72
+ ### Streaming
73
+
74
+ ```python
75
+ from openai import OpenAI
76
+
77
+ client = OpenAI(
78
+ base_url="https://ngixdev-qwen-api.hf.space/v1",
79
+ api_key="not-needed"
80
+ )
81
+
82
+ stream = client.chat.completions.create(
83
+ model="qwen",
84
+ messages=[{"role": "user", "content": "Tell me a story"}],
85
+ stream=True
86
+ )
87
+
88
+ for chunk in stream:
89
+ if chunk.choices[0].delta.content:
90
+ print(chunk.choices[0].delta.content, end="")
91
+ ```
92
+
93
+ ## Endpoints
94
+
95
+ | Endpoint | Description |
96
+ |----------|-------------|
97
+ | `/v1/chat/completions` | Chat completions (OpenAI-compatible) |
98
+ | `/v1/completions` | Text completions |
99
+ | `/v1/models` | List available models |
100
+ | `/health` | Health check |
101
+
102
  ## Parameters
103
 
104
  | Parameter | Type | Default | Description |
105
  |-----------|------|---------|-------------|
106
+ | messages | array | required | Chat messages |
 
107
  | temperature | float | 0.7 | Sampling temperature (0.0-2.0) |
108
  | top_p | float | 0.8 | Nucleus sampling (0.0-1.0) |
109
  | max_tokens | int | 1024 | Maximum tokens to generate |
110
+ | stream | bool | false | Enable streaming response |
app.py DELETED
@@ -1,280 +0,0 @@
1
- import os
2
- import gradio as gr
3
- from huggingface_hub import hf_hub_download
4
- from llama_cpp import Llama
5
-
6
- MODEL_REPO = "HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive"
7
- MODEL_FILE = "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf"
8
-
9
- print("Downloading model...")
10
- model_path = hf_hub_download(repo_id=MODEL_REPO, filename=MODEL_FILE)
11
- print(f"Model downloaded to: {model_path}")
12
-
13
- print("Loading model...")
14
- llm = Llama(
15
- model_path=model_path,
16
- n_ctx=8192,
17
- n_gpu_layers=-1,
18
- verbose=False,
19
- )
20
- print("Model loaded!")
21
-
22
-
23
- def format_messages(message: str, history: list, system_prompt: str = "") -> str:
24
- formatted = ""
25
-
26
- if system_prompt.strip():
27
- formatted += f"<|im_start|>system\n{system_prompt}<|im_end|>\n"
28
-
29
- for user_msg, assistant_msg in history:
30
- if user_msg:
31
- formatted += f"<|im_start|>user\n{user_msg}<|im_end|>\n"
32
- if assistant_msg:
33
- formatted += f"<|im_start|>assistant\n{assistant_msg}<|im_end|>\n"
34
-
35
- formatted += f"<|im_start|>user\n{message}<|im_end|>\n<|im_start|>assistant\n"
36
- return formatted
37
-
38
-
39
- def generate_response(
40
- message: str,
41
- history: list,
42
- system_prompt: str = "",
43
- temperature: float = 0.7,
44
- top_p: float = 0.8,
45
- top_k: int = 20,
46
- max_tokens: int = 2048,
47
- ) -> str:
48
- prompt = format_messages(message, history, system_prompt)
49
-
50
- output = llm(
51
- prompt,
52
- max_tokens=max_tokens,
53
- temperature=temperature,
54
- top_p=top_p,
55
- top_k=top_k,
56
- stop=["<|im_end|>", "<|im_start|>"],
57
- )
58
-
59
- return output["choices"][0]["text"].strip()
60
-
61
-
62
- def api_generate(
63
- prompt: str,
64
- system_prompt: str = "",
65
- temperature: float = 0.7,
66
- top_p: float = 0.8,
67
- max_tokens: int = 2048,
68
- ) -> dict:
69
- """
70
- API endpoint for text generation.
71
-
72
- Args:
73
- prompt: The user prompt/question
74
- system_prompt: Optional system instruction
75
- temperature: Sampling temperature (0.0-2.0)
76
- top_p: Nucleus sampling parameter (0.0-1.0)
77
- max_tokens: Maximum tokens to generate
78
-
79
- Returns:
80
- Dictionary with 'response' key containing generated text
81
- """
82
- try:
83
- response = generate_response(
84
- message=prompt,
85
- history=[],
86
- system_prompt=system_prompt,
87
- temperature=temperature,
88
- top_p=top_p,
89
- max_tokens=max_tokens,
90
- )
91
- return {"response": response, "status": "success"}
92
- except Exception as e:
93
- return {"response": None, "status": "error", "error": str(e)}
94
-
95
-
96
- with gr.Blocks(title="Qwen3.5-9B Uncensored API", theme=gr.themes.Soft()) as demo:
97
- gr.Markdown(
98
- """
99
- # 🤖 Qwen3.5-9B Uncensored API Interface
100
-
101
- Powered by [HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive](https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive)
102
-
103
- **Features:**
104
- - 9B parameters with 262K context window
105
- - Fully uncensored (0/465 refusals)
106
- - Multimodal capable (text, image, video)
107
- - Supports 201 languages
108
- - Running with Q4_K_M quantization via llama.cpp
109
-
110
- Use the chat interface below or access via API.
111
- """
112
- )
113
-
114
- with gr.Tab("💬 Chat"):
115
- chatbot = gr.Chatbot(height=500, label="Conversation")
116
-
117
- with gr.Row():
118
- msg = gr.Textbox(
119
- label="Message",
120
- placeholder="Type your message here...",
121
- scale=4,
122
- lines=2,
123
- )
124
- submit_btn = gr.Button("Send", variant="primary", scale=1)
125
-
126
- with gr.Accordion("⚙️ Settings", open=False):
127
- system_prompt = gr.Textbox(
128
- label="System Prompt",
129
- placeholder="Optional: Set behavior/personality for the model",
130
- lines=3,
131
- )
132
- with gr.Row():
133
- temperature = gr.Slider(
134
- minimum=0.0,
135
- maximum=2.0,
136
- value=0.7,
137
- step=0.1,
138
- label="Temperature",
139
- )
140
- top_p = gr.Slider(
141
- minimum=0.0,
142
- maximum=1.0,
143
- value=0.8,
144
- step=0.05,
145
- label="Top P",
146
- )
147
- with gr.Row():
148
- top_k = gr.Slider(
149
- minimum=1,
150
- maximum=100,
151
- value=20,
152
- step=1,
153
- label="Top K",
154
- )
155
- max_tokens = gr.Slider(
156
- minimum=64,
157
- maximum=4096,
158
- value=1024,
159
- step=64,
160
- label="Max Tokens",
161
- )
162
-
163
- clear_btn = gr.Button("🗑️ Clear Chat")
164
-
165
- def user_submit(message, history):
166
- return "", history + [[message, None]]
167
-
168
- def bot_response(history, system_prompt, temperature, top_p, top_k, max_tokens):
169
- if not history:
170
- return history
171
-
172
- message = history[-1][0]
173
- history_without_last = history[:-1]
174
-
175
- response = generate_response(
176
- message,
177
- history_without_last,
178
- system_prompt,
179
- temperature,
180
- top_p,
181
- top_k,
182
- max_tokens
183
- )
184
- history[-1][1] = response
185
- return history
186
-
187
- msg.submit(
188
- user_submit,
189
- [msg, chatbot],
190
- [msg, chatbot]
191
- ).then(
192
- bot_response,
193
- [chatbot, system_prompt, temperature, top_p, top_k, max_tokens],
194
- chatbot,
195
- )
196
-
197
- submit_btn.click(
198
- user_submit,
199
- [msg, chatbot],
200
- [msg, chatbot]
201
- ).then(
202
- bot_response,
203
- [chatbot, system_prompt, temperature, top_p, top_k, max_tokens],
204
- chatbot,
205
- )
206
-
207
- clear_btn.click(lambda: [], None, chatbot)
208
-
209
- with gr.Tab("🔌 API"):
210
- gr.Markdown(
211
- """
212
- ## API Usage
213
-
214
- This Space provides a REST API for programmatic access.
215
-
216
- ### Python Example
217
-
218
- ```python
219
- from gradio_client import Client
220
-
221
- client = Client("Ngixdev/qwen-api")
222
-
223
- result = client.predict(
224
- prompt="Explain quantum computing in simple terms",
225
- system_prompt="You are a helpful assistant",
226
- temperature=0.7,
227
- top_p=0.8,
228
- max_tokens=1024,
229
- api_name="/api_generate"
230
- )
231
- print(result)
232
- ```
233
-
234
- ### cURL Example
235
-
236
- ```bash
237
- curl -X POST https://ngixdev-qwen-api.hf.space/api/api_generate \\
238
- -H "Content-Type: application/json" \\
239
- -d '{
240
- "data": [
241
- "Explain quantum computing",
242
- "You are a helpful assistant",
243
- 0.7,
244
- 0.8,
245
- 1024
246
- ]
247
- }'
248
- ```
249
- """
250
- )
251
-
252
- with gr.Row():
253
- with gr.Column():
254
- api_prompt = gr.Textbox(
255
- label="Prompt",
256
- placeholder="Enter your prompt here...",
257
- lines=4,
258
- )
259
- api_system = gr.Textbox(
260
- label="System Prompt (Optional)",
261
- placeholder="Set behavior/personality...",
262
- lines=2,
263
- )
264
- with gr.Row():
265
- api_temp = gr.Slider(0.0, 2.0, 0.7, step=0.1, label="Temperature")
266
- api_top_p = gr.Slider(0.0, 1.0, 0.8, step=0.05, label="Top P")
267
- api_max_tokens = gr.Slider(64, 4096, 1024, step=64, label="Max Tokens")
268
- api_submit = gr.Button("Generate", variant="primary")
269
-
270
- with gr.Column():
271
- api_output = gr.JSON(label="API Response")
272
-
273
- api_submit.click(
274
- api_generate,
275
- [api_prompt, api_system, api_temp, api_top_p, api_max_tokens],
276
- api_output,
277
- api_name="api_generate",
278
- )
279
-
280
- demo.launch(server_name="0.0.0.0", server_port=7860)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
requirements.txt DELETED
@@ -1,2 +0,0 @@
1
- gradio>=4.0.0
2
- huggingface_hub>=0.20.0