Instructions to use mahahahug/qwen3.5-4b-opus46-cot with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mahahahug/qwen3.5-4b-opus46-cot with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen3.5-4B")
model = PeftModel.from_pretrained(base_model, "mahahahug/qwen3.5-4b-opus46-cot")

llama-cpp-python

How to use mahahahug/qwen3.5-4b-opus46-cot with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="mahahahug/qwen3.5-4b-opus46-cot",
	filename="qwen3.5-4b-opus46-cot-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use mahahahug/qwen3.5-4b-opus46-cot with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M

Use Docker

docker model run hf.co/mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M

LM Studio
Jan

vLLM

How to use mahahahug/qwen3.5-4b-opus46-cot with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "mahahahug/qwen3.5-4b-opus46-cot"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mahahahug/qwen3.5-4b-opus46-cot",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M

Ollama
How to use mahahahug/qwen3.5-4b-opus46-cot with Ollama:
```
ollama run hf.co/mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M
```

Unsloth Studio

How to use mahahahug/qwen3.5-4b-opus46-cot with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for mahahahug/qwen3.5-4b-opus46-cot to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for mahahahug/qwen3.5-4b-opus46-cot to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for mahahahug/qwen3.5-4b-opus46-cot to start chatting

How to use mahahahug/qwen3.5-4b-opus46-cot with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use mahahahug/qwen3.5-4b-opus46-cot with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use mahahahug/qwen3.5-4b-opus46-cot with Docker Model Runner:
```
docker model run hf.co/mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M
```

Lemonade

How to use mahahahug/qwen3.5-4b-opus46-cot with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M

Run and chat with the model

lemonade run user.qwen3.5-4b-opus46-cot-Q4_K_M

List all available models

lemonade list

Qwen3.5-4B × Claude Opus 思维链微调模型

基于 Qwen3.5-4B 的 QLoRA/LoRA 全模块微调模型，注入 Claude Opus 4.6 的思维链（Chain-of-Thought）推理能力。单张 24GB 显卡即可训练。

模型描述

本仓库包含三种格式：

格式	说明	大小
LoRA 权重 (PEFT)	可直接加载推理，合并回基础模型使用	~170 MB
GGUF Q4_K_M	4-bit 量化，llama.cpp / Ollama 部署推荐	~2.6 GB
GGUF Q8_0	8-bit 量化，高质量推理	~4.2 GB

训练配置

配置项	参数
基础模型	Qwen3.5-4B
LoRA 目标模块	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
LoRA r / alpha	32 / 64
最大序列长度	16,384
批次大小	12
学习率	2e-4（余弦退火）
训练轮数	2 epoch
总步数	1,606
优化器	adamw_8bit
量化加载	4-bit NF4

训练结果

指标	值
初始 Loss	1.0738
最终 Loss	0.3505
最低 Loss	0.2501
Loss 下降	81.5%
平均 Loss	0.5395

评测结果：GSM8K

使用 lm-eval 在 GSM8K 测试集（1,319 题）上对比基线模型与 LoRA 微调模型，5-shot 评测。

总体指标

模型	空响应	思考过程	strict-match	flexible-extract	校准宽松
Qwen3.5-4B 基线	29.9%	0%	67.93%	68.01%	60.73%
LoRA 微调后	0%	95.8%	46.17%	84.08%	77.86%

评分标准：

strict-match：要求模型末尾输出 #### 答案，严格格式匹配

flexible-extract：自动从回复中提取数值答案

校准宽松：答案数字出现在推理过程中且不在题目原文中（排除假阳性）

关键发现

LoRA 模型 100% 响应率，稳定输出推理链（95.8% 含思考过程）
**校准宽松评分 77.86%**（1027/1319），整体数学能力远超基线
基线 29.9% 不回答，LoRA 完全解决该问题
主要短板：格式输出不规范——flexible-extract 84% vs strict 46%，大量题目算对但未按 #### 格式收尾

评测命令：

lm_eval --model local-chat-completions \
  --tasks gsm8k \
  --model_args "model=qwen_cot,base_url=http://127.0.0.1:5003/v1/chat/completions,api_key=sk-fake,tokenized_requests=False,num_concurrent=8,max_length=16384,max_gen_toks=4096" \
  --apply_chat_template \
  --num_fewshot 5 \
  --log_samples \
  --output_path eval_results/lora_gsm8k_test_full

使用方式

方式一：LoRA 加载推理

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

model_name = "unsloth/Qwen3.5-4B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "mahahahug/qwen3.5-4b-opus46-cot")

messages = [{"role": "user", "content": "小明有15个苹果，给了小红40%，还剩几个？请一步步思考。"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(output[0], skip_special_tokens=True))

方式二：llama.cpp API Server

llama-server \
  -ngl 1000 \
  --host 0.0.0.0 --port 5003 \
  --flash-attn on \
  --cache-type-k q4_0 --cache-type-v q4_0 \
  -c 15999 \
  --repeat-penalty 1.0 \
  --presence-penalty 1.5 \
  --min-p 0.02 \
  --top-k 30 --top-p 0.9 --temp 0.85 \
  --reasoning on \
  --no-mmap \
  --chat-template chatml \
  -m qwen3.5-4b-opus46-cot-Q4_K_M.gguf

启动后可通过 http://localhost:5003/v1/chat/completions 调用，兼容 OpenAI API。

方式三：Ollama

cat > Modelfile << 'EOF'
FROM ./qwen3.5-4b-opus46-cot-Q4_K_M.gguf
PARAMETER temperature 0.85
PARAMETER top_k 30
PARAMETER top_p 0.9
PARAMETER min_p 0.02
PARAMETER num_ctx 16000
SYSTEM You are a helpful AI assistant that always thinks step-by-step. 请用中文回复。
EOF

ollama create qwen3.5-opus-cot -f Modelfile
ollama run qwen3.5-opus-cot

方式四：llama-cli 命令行

./llama-cli -m qwen3.5-4b-opus46-cot-Q4_K_M.gguf \
  -ngl 1000 --flash-attn on -c 15999 \
  --reasoning on --temp 0.85 --top-k 30 --top-p 0.9 --min-p 0.02 \
  --chat-template chatml \
  -p "一只农场有14只羊，除了8只都死了，还剩几只？请一步步思考。" \
  -n 2048

数据集

使用 Claude Opus 4.6 推理数据集，包含约 10,000 条 Claude Opus 4.6 的推理对话数据。训练时自动将 reasoning 字段注入 <｜begin_of_think｜> / <｜end_of_think｜> 标签，同时过滤超长样本。

模型地址

平台	仓库	内容
HuggingFace	mahahahug/qwen3.5-4b-opus46-cot	本仓库（LoRA + GGUF）
ModelScope	oooooo0o/qwen3.5-4b-opus46-cot	LoRA + GGUF Q4_K_M + GGUF Q8_0
GitHub	Pyzmxu/qwen3.5_4b_opus	训练代码（Unsloth + LoRA）

License

MIT

Downloads last month: 10

GGUF

Model size

4B params

Architecture

qwen35

Hardware compatibility

4-bit

8-bit

Model tree for mahahahug/qwen3.5-4b-opus46-cot

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Finetuned

unsloth/Qwen3.5-4B

Adapter

(37)

this model