Instructions to use mahahahug/qwen3.5-4b-opus46-cot with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use mahahahug/qwen3.5-4b-opus46-cot with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen3.5-4B") model = PeftModel.from_pretrained(base_model, "mahahahug/qwen3.5-4b-opus46-cot") - llama-cpp-python
How to use mahahahug/qwen3.5-4b-opus46-cot with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="mahahahug/qwen3.5-4b-opus46-cot", filename="qwen3.5-4b-opus46-cot-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use mahahahug/qwen3.5-4b-opus46-cot with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M # Run inference directly in the terminal: llama-cli -hf mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M # Run inference directly in the terminal: llama-cli -hf mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M
Use Docker
docker model run hf.co/mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use mahahahug/qwen3.5-4b-opus46-cot with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "mahahahug/qwen3.5-4b-opus46-cot" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mahahahug/qwen3.5-4b-opus46-cot", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M
- Ollama
How to use mahahahug/qwen3.5-4b-opus46-cot with Ollama:
ollama run hf.co/mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M
- Unsloth Studio
How to use mahahahug/qwen3.5-4b-opus46-cot with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for mahahahug/qwen3.5-4b-opus46-cot to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for mahahahug/qwen3.5-4b-opus46-cot to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for mahahahug/qwen3.5-4b-opus46-cot to start chatting
- Pi
How to use mahahahug/qwen3.5-4b-opus46-cot with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use mahahahug/qwen3.5-4b-opus46-cot with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use mahahahug/qwen3.5-4b-opus46-cot with Docker Model Runner:
docker model run hf.co/mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M
- Lemonade
How to use mahahahug/qwen3.5-4b-opus46-cot with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull mahahahug/qwen3.5-4b-opus46-cot:Q4_K_M
Run and chat with the model
lemonade run user.qwen3.5-4b-opus46-cot-Q4_K_M
List all available models
lemonade list
Qwen3.5-4B × Claude Opus 思维链微调模型
基于 Qwen3.5-4B 的 QLoRA/LoRA 全模块微调模型,注入 Claude Opus 4.6 的思维链(Chain-of-Thought)推理能力。单张 24GB 显卡即可训练。
模型描述
本仓库包含三种格式:
| 格式 | 说明 | 大小 |
|---|---|---|
| LoRA 权重 (PEFT) | 可直接加载推理,合并回基础模型使用 | ~170 MB |
| GGUF Q4_K_M | 4-bit 量化,llama.cpp / Ollama 部署推荐 | ~2.6 GB |
| GGUF Q8_0 | 8-bit 量化,高质量推理 | ~4.2 GB |
训练配置
| 配置项 | 参数 |
|---|---|
| 基础模型 | Qwen3.5-4B |
| LoRA 目标模块 | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| LoRA r / alpha | 32 / 64 |
| 最大序列长度 | 16,384 |
| 批次大小 | 12 |
| 学习率 | 2e-4(余弦退火) |
| 训练轮数 | 2 epoch |
| 总步数 | 1,606 |
| 优化器 | adamw_8bit |
| 量化加载 | 4-bit NF4 |
训练结果
| 指标 | 值 |
|---|---|
| 初始 Loss | 1.0738 |
| 最终 Loss | 0.3505 |
| 最低 Loss | 0.2501 |
| Loss 下降 | 81.5% |
| 平均 Loss | 0.5395 |
评测结果:GSM8K
使用 lm-eval 在 GSM8K 测试集(1,319 题)上对比基线模型与 LoRA 微调模型,5-shot 评测。
总体指标
| 模型 | 空响应 | 思考过程 | strict-match | flexible-extract | 校准宽松 |
|---|---|---|---|---|---|
| Qwen3.5-4B 基线 | 29.9% | 0% | 67.93% | 68.01% | 60.73% |
| LoRA 微调后 | 0% | 95.8% | 46.17% | 84.08% | 77.86% |
评分标准:
- strict-match:要求模型末尾输出
#### 答案,严格格式匹配- flexible-extract:自动从回复中提取数值答案
- 校准宽松:答案数字出现在推理过程中且不在题目原文中(排除假阳性)
关键发现
- LoRA 模型 100% 响应率,稳定输出推理链(95.8% 含思考过程)
- **校准宽松评分 77.86%**(1027/1319),整体数学能力远超基线
- 基线 29.9% 不回答,LoRA 完全解决该问题
- 主要短板:格式输出不规范——flexible-extract 84% vs strict 46%,大量题目算对但未按
####格式收尾
评测命令:
lm_eval --model local-chat-completions \
--tasks gsm8k \
--model_args "model=qwen_cot,base_url=http://127.0.0.1:5003/v1/chat/completions,api_key=sk-fake,tokenized_requests=False,num_concurrent=8,max_length=16384,max_gen_toks=4096" \
--apply_chat_template \
--num_fewshot 5 \
--log_samples \
--output_path eval_results/lora_gsm8k_test_full
使用方式
方式一:LoRA 加载推理
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
model_name = "unsloth/Qwen3.5-4B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype="auto", device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "mahahahug/qwen3.5-4b-opus46-cot")
messages = [{"role": "user", "content": "小明有15个苹果,给了小红40%,还剩几个?请一步步思考。"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(output[0], skip_special_tokens=True))
方式二:llama.cpp API Server
llama-server \
-ngl 1000 \
--host 0.0.0.0 --port 5003 \
--flash-attn on \
--cache-type-k q4_0 --cache-type-v q4_0 \
-c 15999 \
--repeat-penalty 1.0 \
--presence-penalty 1.5 \
--min-p 0.02 \
--top-k 30 --top-p 0.9 --temp 0.85 \
--reasoning on \
--no-mmap \
--chat-template chatml \
-m qwen3.5-4b-opus46-cot-Q4_K_M.gguf
启动后可通过 http://localhost:5003/v1/chat/completions 调用,兼容 OpenAI API。
方式三:Ollama
cat > Modelfile << 'EOF'
FROM ./qwen3.5-4b-opus46-cot-Q4_K_M.gguf
PARAMETER temperature 0.85
PARAMETER top_k 30
PARAMETER top_p 0.9
PARAMETER min_p 0.02
PARAMETER num_ctx 16000
SYSTEM You are a helpful AI assistant that always thinks step-by-step. 请用中文回复。
EOF
ollama create qwen3.5-opus-cot -f Modelfile
ollama run qwen3.5-opus-cot
方式四:llama-cli 命令行
./llama-cli -m qwen3.5-4b-opus46-cot-Q4_K_M.gguf \
-ngl 1000 --flash-attn on -c 15999 \
--reasoning on --temp 0.85 --top-k 30 --top-p 0.9 --min-p 0.02 \
--chat-template chatml \
-p "一只农场有14只羊,除了8只都死了,还剩几只?请一步步思考。" \
-n 2048
数据集
使用 Claude Opus 4.6 推理数据集,包含约 10,000 条 Claude Opus 4.6 的推理对话数据。训练时自动将 reasoning 字段注入 <|begin_of_think|> / <|end_of_think|> 标签,同时过滤超长样本。
模型地址
| 平台 | 仓库 | 内容 |
|---|---|---|
| HuggingFace | mahahahug/qwen3.5-4b-opus46-cot | 本仓库(LoRA + GGUF) |
| ModelScope | oooooo0o/qwen3.5-4b-opus46-cot | LoRA + GGUF Q4_K_M + GGUF Q8_0 |
| GitHub | Pyzmxu/qwen3.5_4b_opus | 训练代码(Unsloth + LoRA) |
License
MIT
- Downloads last month
- 10
4-bit
8-bit