Instructions to use Jackrong/Qwopus3.5-9B-Coder-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Jackrong/Qwopus3.5-9B-Coder-GGUF with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Jackrong/Qwopus3.5-9B-Coder-GGUF") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Jackrong/Qwopus3.5-9B-Coder-GGUF", dtype="auto") - llama-cpp-python
How to use Jackrong/Qwopus3.5-9B-Coder-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Jackrong/Qwopus3.5-9B-Coder-GGUF", filename="Qwopus3.5-9B-coder-Exp-BF16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use Jackrong/Qwopus3.5-9B-Coder-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M
Use Docker
docker model run hf.co/Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use Jackrong/Qwopus3.5-9B-Coder-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Jackrong/Qwopus3.5-9B-Coder-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Jackrong/Qwopus3.5-9B-Coder-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M
- SGLang
How to use Jackrong/Qwopus3.5-9B-Coder-GGUF with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Jackrong/Qwopus3.5-9B-Coder-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Jackrong/Qwopus3.5-9B-Coder-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Jackrong/Qwopus3.5-9B-Coder-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Jackrong/Qwopus3.5-9B-Coder-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Ollama
How to use Jackrong/Qwopus3.5-9B-Coder-GGUF with Ollama:
ollama run hf.co/Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M
- Unsloth Studio
How to use Jackrong/Qwopus3.5-9B-Coder-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Jackrong/Qwopus3.5-9B-Coder-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Jackrong/Qwopus3.5-9B-Coder-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Jackrong/Qwopus3.5-9B-Coder-GGUF to start chatting
- Pi
How to use Jackrong/Qwopus3.5-9B-Coder-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Jackrong/Qwopus3.5-9B-Coder-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use Jackrong/Qwopus3.5-9B-Coder-GGUF with Docker Model Runner:
docker model run hf.co/Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M
- Lemonade
How to use Jackrong/Qwopus3.5-9B-Coder-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Qwopus3.5-9B-Coder-GGUF-Q4_K_M
List all available models
lemonade list
Tool calling capability on par with Opus 4.6
Tested between Qwen 3.5 9B, Qwopus3.5 9B Coder , Opus 4.6, Opus 4.7, MiniCPM v4.6, etc, and found Coder to be equally as good as Opus 4.6 in tool calling. You could find the compilation at:
Connecting to LM Studio...
Models: qwopus3.5-9b-coder
Phase selection: both
############################################################
MODEL: qwopus3.5-9b-coder
############################################################
==================================================
Testing with 2 tools available...
[PASS] 'get_weather' correctly called (4.239s)
[PASS] 'send_email' correctly called (5.591s)
Accuracy at 2 tools: 2/2 = 100%
==================================================
Testing with 3 tools available...
[PASS] 'get_weather' correctly called (4.073s)
[PASS] 'send_email' correctly called (7.078s)
[PASS] 'calculate_tip' correctly called (5.938s)
Accuracy at 3 tools: 3/3 = 100%
==================================================
Testing with 4 tools available...
[PASS] 'get_weather' correctly called (5.072s)
[PASS] 'send_email' correctly called (6.148s)
[PASS] 'calculate_tip' correctly called (6.41s)
[PASS] 'translate_text' correctly called (6.024s)
Accuracy at 4 tools: 4/4 = 100%
==================================================
Testing with 5 tools available...
[PASS] 'get_weather' correctly called (4.393s)
[PASS] 'send_email' correctly called (5.947s)
[PASS] 'calculate_tip' correctly called (5.511s)
[PASS] 'translate_text' correctly called (6.605s)
[PASS] 'set_timer' correctly called (4.993s)
Accuracy at 5 tools: 5/5 = 100%
==================================================
Testing with 7 tools available...
[PASS] 'search_wikipedia' correctly called (4.94s)
[PASS] 'get_weather' correctly called (4.342s)
[PASS] 'convert_currency' correctly called (5.746s)
[PASS] 'calculate_tip' correctly called (5.461s)
[PASS] 'set_timer' correctly called (5.096s)
Accuracy at 7 tools: 5/5 = 100%
==================================================
Testing with 9 tools available...
[PASS] 'translate_text' correctly called (6.471s)
[PASS] 'calculate_tip' correctly called (5.885s)
[PASS] 'search_wikipedia' correctly called (5.104s)
[PASS] 'get_weather' correctly called (4.426s)
[PASS] 'set_timer' correctly called (5.086s)
Accuracy at 9 tools: 5/5 = 100%
==================================================
Testing with 11 tools available...
[PASS] 'send_email' correctly called (6.822s)
[PASS] 'resize_image' correctly called (6.063s)
[PASS] 'convert_currency' correctly called (5.901s)
[PASS] 'get_weather' correctly called (4.876s)
[PASS] 'create_calendar_event' correctly called (8.235s)
Accuracy at 11 tools: 5/5 = 100%
==================================================
Testing with 16 tools available...
[PASS] 'calculate_tip' correctly called (5.894s)
[PASS] 'translate_text' correctly called (5.926s)
[PASS] 'generate_password' correctly called (5.904s)
[PASS] 'get_stock_price' correctly called (4.68s)
[PASS] 'resize_image' correctly called (6.341s)
Accuracy at 16 tools: 5/5 = 100%
==================================================
Testing with 21 tools available...
[PASS] 'get_weather' correctly called (5.254s)
[PASS] 'scan_port' correctly called (6.331s)
[PASS] 'convert_currency' correctly called (5.668s)
[PASS] 'find_restaurant' correctly called (5.798s)
[PASS] 'check_spelling' correctly called (5.283s)
Accuracy at 21 tools: 5/5 = 100%
==================================================
Testing with 26 tools available...
[PASS] 'create_calendar_event' correctly called (8.906s)
[PASS] 'generate_password' correctly called (5.986s)
[PASS] 'convert_units' correctly called (5.831s)
[PASS] 'get_stock_price' correctly called (4.516s)
[PASS] 'shorten_url' correctly called (6.089s)
Accuracy at 26 tools: 5/5 = 100%
==================================================
Testing with 31 tools available...
[PASS] 'create_qr_code' correctly called (6.452s)
[PASS] 'get_weather' correctly called (4.389s)
[PASS] 'analyze_sentiment' correctly called (5.215s)
[PASS] 'shorten_url' correctly called (5.011s)
[PASS] 'search_wikipedia' correctly called (4.789s)
Accuracy at 31 tools: 5/5 = 100%
==================================================
PHASE 1 SUMMARY
Model: qwopus3.5-9b-coder
Tool pool size: 31
Tools Correct Total Accuracy
2 2 2 100%
3 3 3 100%
4 4 4 100%
5 5 5 100%
7 5 5 100%
9 5 5 100%
11 5 5 100%
16 5 5 100%
21 5 5 100%
26 5 5 100%
31 5 5 100%
============================================================
PHASE 2: Adversarial Tool Recall Test
Tools in pool: 25 (with semantic overlaps & decoys)
Test cases: 28
[SEMANTIC_OVERLAP] PASS: 'search_wikipedia' (6.455s)
[SEMANTIC_OVERLAP] PASS: 'search_news' (6.202s)
[SEMANTIC_OVERLAP] PASS: 'search_academic' (6.646s)
[SEMANTIC_OVERLAP] PASS: 'search_local_files' (6.383s)
[SEMANTIC_OVERLAP] PASS: 'convert_currency' (5.468s)
[SEMANTIC_OVERLAP] PASS: 'convert_units' (5.23s)
[SEMANTIC_OVERLAP] PASS: 'convert_timezone' (7.501s)
[SEMANTIC_OVERLAP] PASS: 'send_slack_message' (5.852s)
[SEMANTIC_OVERLAP] PASS: 'send_sms' (6.462s)
[SEMANTIC_OVERLAP] PASS: 'set_alarm' (10.232s)
[INDIRECT_PROMPT] PASS: 'convert_timezone' (13.482s)
[INDIRECT_PROMPT] PASS: 'convert_currency' (6.508s)
[INDIRECT_PROMPT] PASS: 'get_weather_forecast' (4.875s)
[INDIRECT_PROMPT] PASS: 'summarize_text' (12.746s)
[INDIRECT_PROMPT] PASS: 'paraphrase_text' (6.895s)
[INDIRECT_PROMPT] PASS: 'calculate_split_bill' (5.991s)
[INDIRECT_PROMPT] PASS: 'search_academic' (5.876s)
[INDIRECT_PROMPT] PASS: 'send_slack_message' (6.178s)
[DECOY_TOOL] PASS: 'get_air_quality' (4.727s)
[DECOY_TOOL] PASS: 'get_weather' (4.184s)
[DECOY_TOOL] PASS: 'paraphrase_text' (5.278s)
[DECOY_TOOL] PASS: 'set_timer' (4.527s)
[DECOY_TOOL] FAIL: expected 'create_reminder', got 'send_push_notification' (10.072s)
Rationale: Decoy: send_push_notification is tempting (buzz phone), but a scheduled note about a task = reminder
[DECOY_TOOL] PASS: 'create_calendar_event' (8.705s)
[DECOY_TOOL] PASS: 'convert_file_format' (5.956s)
[RANDOM_ORDER] PASS: 'search_web' (6.423s)
[RANDOM_ORDER] PASS: 'send_push_notification' (12.028s)
[RANDOM_ORDER] PASS: 'convert_units' (7.333s)
============================================================
PHASE 2 SUMMARY
SEMANTIC_OVERLAP 10/10 (100%) [PASS]
INDIRECT_PROMPT 8/8 (100%) [PASS]
DECOY_TOOL 6/7 (86%) [DEGRADED]
- Expected 'create_reminder', got 'send_push_notification'
Prompt: Can you buzz my phone with a note about buying milk tomorrow...
RANDOM_ORDER 3/3 (100%) [PASS]
OVERALL: 27/28 (96%)
Connecting to LM Studio...
Models: qwen/qwen3.5-9b
Phase selection: both
############################################################
MODEL: qwen/qwen3.5-9b
############################################################
==================================================
Testing with 2 tools available...
[PASS] 'get_weather' correctly called (5.009s)
[PASS] 'send_email' correctly called (6.897s)
Accuracy at 2 tools: 2/2 = 100%
==================================================
Testing with 3 tools available...
[PASS] 'get_weather' correctly called (5.311s)
[PASS] 'send_email' correctly called (7.27s)
[PASS] 'calculate_tip' correctly called (5.953s)
Accuracy at 3 tools: 3/3 = 100%
==================================================
Testing with 4 tools available...
[PASS] 'get_weather' correctly called (6.182s)
[PASS] 'send_email' correctly called (7.321s)
[PASS] 'calculate_tip' correctly called (7.407s)
[PASS] 'translate_text' correctly called (7.859s)
Accuracy at 4 tools: 4/4 = 100%
==================================================
Testing with 5 tools available...
[PASS] 'get_weather' correctly called (5.542s)
[PASS] 'send_email' correctly called (7.401s)
[PASS] 'calculate_tip' correctly called (7.415s)
[PASS] 'translate_text' correctly called (7.911s)
[PASS] 'set_timer' correctly called (6.917s)
Accuracy at 5 tools: 5/5 = 100%
==================================================
Testing with 7 tools available...
[PASS] 'search_wikipedia' correctly called (6.971s)
[PASS] 'get_weather' correctly called (4.601s)
[PASS] 'convert_currency' correctly called (7.373s)
[PASS] 'calculate_tip' correctly called (7.458s)
[PASS] 'set_timer' correctly called (6.365s)
Accuracy at 7 tools: 5/5 = 100%
==================================================
Testing with 9 tools available...
[PASS] 'translate_text' correctly called (7.86s)
[PASS] 'calculate_tip' correctly called (8.142s)
[PASS] 'search_wikipedia' correctly called (5.413s)
[PASS] 'get_weather' correctly called (5.233s)
[PASS] 'set_timer' correctly called (6.693s)
Accuracy at 9 tools: 5/5 = 100%
==================================================
Testing with 11 tools available...
[PASS] 'send_email' correctly called (11.327s)
[PASS] 'resize_image' correctly called (7.478s)
[PASS] 'convert_currency' correctly called (7.245s)
[PASS] 'get_weather' correctly called (5.443s)
[PASS] 'create_calendar_event' correctly called (9.094s)
Accuracy at 11 tools: 5/5 = 100%
==================================================
Testing with 16 tools available...
[PASS] 'calculate_tip' correctly called (7.187s)
[PASS] 'translate_text' correctly called (7.704s)
[PASS] 'generate_password' correctly called (7.153s)
[PASS] 'get_stock_price' correctly called (5.342s)
[PASS] 'resize_image' correctly called (7.872s)
Accuracy at 16 tools: 5/5 = 100%
==================================================
Testing with 21 tools available...
[PASS] 'get_weather' correctly called (5.344s)
[PASS] 'scan_port' correctly called (8.579s)
[PASS] 'convert_currency' correctly called (7.505s)
[PASS] 'find_restaurant' correctly called (7.789s)
[PASS] 'check_spelling' correctly called (8.24s)
Accuracy at 21 tools: 5/5 = 100%
==================================================
Testing with 26 tools available...
[PASS] 'create_calendar_event' correctly called (9.7s)
[PASS] 'generate_password' correctly called (7.007s)
[PASS] 'convert_units' correctly called (6.977s)
[PASS] 'get_stock_price' correctly called (6.252s)
[PASS] 'shorten_url' correctly called (7.046s)
Accuracy at 26 tools: 5/5 = 100%
==================================================
Testing with 31 tools available...
[PASS] 'create_qr_code' correctly called (7.39s)
[PASS] 'get_weather' correctly called (5.595s)
[PASS] 'analyze_sentiment' correctly called (7.181s)
[PASS] 'shorten_url' correctly called (6.787s)
[PASS] 'search_wikipedia' correctly called (5.159s)
Accuracy at 31 tools: 5/5 = 100%
==================================================
PHASE 1 SUMMARY
Model: qwen/qwen3.5-9b
Tool pool size: 31
Tools Correct Total Accuracy
2 2 2 100%
3 3 3 100%
4 4 4 100%
5 5 5 100%
7 5 5 100%
9 5 5 100%
11 5 5 100%
16 5 5 100%
21 5 5 100%
26 5 5 100%
31 5 5 100%
============================================================
PHASE 2: Adversarial Tool Recall Test
Tools in pool: 25 (with semantic overlaps & decoys)
Test cases: 28
[SEMANTIC_OVERLAP] PASS: 'search_wikipedia' (8.046s)
[SEMANTIC_OVERLAP] PASS: 'search_news' (6.251s)
[SEMANTIC_OVERLAP] PASS: 'search_academic' (9.928s)
[SEMANTIC_OVERLAP] PASS: 'search_local_files' (7.043s)
[SEMANTIC_OVERLAP] PASS: 'convert_currency' (7.266s)
[SEMANTIC_OVERLAP] PASS: 'convert_units' (7.805s)
[SEMANTIC_OVERLAP] PASS: 'convert_timezone' (9.381s)
[SEMANTIC_OVERLAP] PASS: 'send_slack_message' (6.704s)
[SEMANTIC_OVERLAP] PASS: 'send_sms' (7.236s)
[SEMANTIC_OVERLAP] PASS: 'set_alarm' (9.284s)
[INDIRECT_PROMPT] PASS: 'convert_timezone' (12.02s)
[INDIRECT_PROMPT] PASS: 'convert_currency' (13.453s)
[INDIRECT_PROMPT] PASS: 'get_weather_forecast' (6.949s)
[INDIRECT_PROMPT] PASS: 'summarize_text' (10.556s)
[INDIRECT_PROMPT] PASS: 'paraphrase_text' (9.886s)
[INDIRECT_PROMPT] PASS: 'calculate_split_bill' (6.861s)
[INDIRECT_PROMPT] PASS: 'search_academic' (9.342s)
[INDIRECT_PROMPT] FAIL: expected 'send_slack_message', got 'none' (14.461s)
Rationale: Indirect - 'drop it in our dev channel' implies Slack, explicitly rules out email
[DECOY_TOOL] PASS: 'get_air_quality' (6.05s)
[DECOY_TOOL] PASS: 'get_weather' (5.09s)
[DECOY_TOOL] PASS: 'paraphrase_text' (7.659s)
[DECOY_TOOL] PASS: 'set_timer' (7.384s)
[DECOY_TOOL] FAIL: expected 'create_reminder', got 'send_push_notification' (10.21s)
Rationale: Decoy: send_push_notification is tempting (buzz phone), but a scheduled note about a task = reminder
[DECOY_TOOL] PASS: 'create_calendar_event' (19.048s)
[DECOY_TOOL] PASS: 'convert_file_format' (8.045s)
[RANDOM_ORDER] PASS: 'search_web' (8.181s)
[RANDOM_ORDER] PASS: 'send_push_notification' (15.028s)
[RANDOM_ORDER] PASS: 'convert_units' (11.034s)
============================================================
PHASE 2 SUMMARY
SEMANTIC_OVERLAP 10/10 (100%) [PASS]
INDIRECT_PROMPT 7/8 (88%) [DEGRADED]
- Expected 'send_slack_message', got 'none/no-tool-call'
Prompt: I need to let my team know the build succeeded but I don't w...
DECOY_TOOL 6/7 (86%) [DEGRADED]
- Expected 'create_reminder', got 'send_push_notification'
Prompt: Can you buzz my phone with a note about buying milk tomorrow...
RANDOM_ORDER 3/3 (100%) [PASS]
OVERALL: 26/28 (93%)
Connecting to LM Studio...
Models: mradermacher/minicpm-v-4.6
Phase selection: both
############################################################
MODEL: mradermacher/minicpm-v-4.6
############################################################
==================================================
Testing with 2 tools available...
[PASS] 'get_weather' correctly called (2.501s)
[FAIL] Expected 'send_email', got 'none' No tool call made (2.386s)
Accuracy at 2 tools: 1/2 = 50%
==================================================
Testing with 3 tools available...
[PASS] 'get_weather' correctly called (2.614s)
[PASS] 'send_email' correctly called (2.795s)
[PASS] 'calculate_tip' correctly called (2.626s)
Accuracy at 3 tools: 3/3 = 100%
==================================================
Testing with 4 tools available...
[PASS] 'get_weather' correctly called (2.581s)
[PASS] 'send_email' correctly called (2.919s)
[PASS] 'calculate_tip' correctly called (2.882s)
[PASS] 'translate_text' correctly called (2.742s)
Accuracy at 4 tools: 4/4 = 100%
==================================================
Testing with 5 tools available...
[PASS] 'get_weather' correctly called (2.645s)
[PASS] 'send_email' correctly called (2.923s)
[PASS] 'calculate_tip' correctly called (2.849s)
[PASS] 'translate_text' correctly called (2.789s)
[PASS] 'set_timer' correctly called (2.601s)
Accuracy at 5 tools: 5/5 = 100%
==================================================
Testing with 7 tools available...
[PASS] 'search_wikipedia' correctly called (2.472s)
[PASS] 'get_weather' correctly called (2.514s)
[PASS] 'convert_currency' correctly called (2.616s)
[PASS] 'calculate_tip' correctly called (2.791s)
[PASS] 'set_timer' correctly called (2.711s)
Accuracy at 7 tools: 5/5 = 100%
==================================================
Testing with 9 tools available...
[PASS] 'translate_text' correctly called (2.788s)
[PASS] 'calculate_tip' correctly called (2.632s)
[PASS] 'search_wikipedia' correctly called (2.696s)
[PASS] 'get_weather' correctly called (2.597s)
[PASS] 'set_timer' correctly called (2.698s)
Accuracy at 9 tools: 5/5 = 100%
==================================================
Testing with 11 tools available...
[PASS] 'send_email' correctly called (2.922s)
[PASS] 'resize_image' correctly called (2.865s)
[PASS] 'convert_currency' correctly called (2.673s)
[PASS] 'get_weather' correctly called (2.606s)
[FAIL] Expected 'create_calendar_event', got 'none' No tool call made (2.885s)
Accuracy at 11 tools: 4/5 = 80%
==================================================
Testing with 16 tools available...
[PASS] 'calculate_tip' correctly called (2.977s)
[PASS] 'translate_text' correctly called (2.859s)
[PASS] 'generate_password' correctly called (2.807s)
[PASS] 'get_stock_price' correctly called (2.711s)
[PASS] 'resize_image' correctly called (2.792s)
Accuracy at 16 tools: 5/5 = 100%
==================================================
Testing with 21 tools available...
[PASS] 'get_weather' correctly called (2.659s)
[PASS] 'scan_port' correctly called (2.967s)
[PASS] 'convert_currency' correctly called (2.882s)
[PASS] 'find_restaurant' correctly called (2.844s)
[PASS] 'check_spelling' correctly called (2.654s)
Accuracy at 21 tools: 5/5 = 100%
==================================================
Testing with 26 tools available...
[PASS] 'create_calendar_event' correctly called (3.0s)
[FAIL] Expected 'generate_password', got 'none' No tool call made (2.777s)
[PASS] 'convert_units' correctly called (2.653s)
[PASS] 'get_stock_price' correctly called (2.51s)
[PASS] 'shorten_url' correctly called (2.521s)
Accuracy at 26 tools: 4/5 = 80%
==================================================
Testing with 31 tools available...
[FAIL] Expected 'create_qr_code', got 'none' No tool call made (2.814s)
[PASS] 'get_weather' correctly called (2.645s)
[PASS] 'analyze_sentiment' correctly called (2.518s)
[PASS] 'shorten_url' correctly called (2.721s)
[PASS] 'search_wikipedia' correctly called (2.755s)
Accuracy at 31 tools: 4/5 = 80%
==================================================
PHASE 1 SUMMARY
Model: mradermacher/minicpm-v-4.6
Tool pool size: 31
Tools Correct Total Accuracy
2 1 2 50%
3 3 3 100%
4 4 4 100%
5 5 5 100%
7 5 5 100%
9 5 5 100%
11 4 5 80%
16 5 5 100%
21 5 5 100%
26 4 5 80%
31 4 5 80%
Detailed Failures:
At 2 tools:
- expected '?', got 'None' (error: No tool call made)
At 11 tools:
- expected '?', got 'None' (error: No tool call made)
At 26 tools:
- expected '?', got 'None' (error: No tool call made)
At 31 tools:
- expected '?', got 'None' (error: No tool call made)
============================================================
PHASE 2: Adversarial Tool Recall Test
Tools in pool: 25 (with semantic overlaps & decoys)
Test cases: 28
[SEMANTIC_OVERLAP] PASS: 'search_wikipedia' (3.032s)
[SEMANTIC_OVERLAP] FAIL: expected 'search_news', got 'none' (2.758s)
Rationale: Must pick search_news over search_web, search_wikipedia
[SEMANTIC_OVERLAP] PASS: 'search_academic' (2.856s)
[SEMANTIC_OVERLAP] FAIL: expected 'search_local_files', got 'none' (2.509s)
Rationale: Must pick search_local_files over other search tools
[SEMANTIC_OVERLAP] PASS: 'convert_currency' (2.731s)
[SEMANTIC_OVERLAP] FAIL: expected 'convert_units', got 'none' (2.545s)
Rationale: Must pick convert_units over convert_currency
[SEMANTIC_OVERLAP] PASS: 'convert_timezone' (2.881s)
[SEMANTIC_OVERLAP] PASS: 'send_slack_message' (2.678s)
[SEMANTIC_OVERLAP] FAIL: expected 'send_sms', got 'none' (2.591s)
Rationale: Must pick send_sms over send_email, send_slack_message
[SEMANTIC_OVERLAP] FAIL: expected 'set_alarm', got 'create_reminder' (2.943s)
Rationale: Must pick set_alarm (recurring wake-up) over set_timer, create_reminder
[INDIRECT_PROMPT] PASS: 'convert_timezone' (2.741s)
[INDIRECT_PROMPT] PASS: 'convert_currency' (2.944s)
[INDIRECT_PROMPT] PASS: 'get_weather_forecast' (2.827s)
[INDIRECT_PROMPT] PASS: 'summarize_text' (3.099s)
[INDIRECT_PROMPT] FAIL: expected 'paraphrase_text', got 'search_web' (2.624s)
Rationale: Indirect - 'sound more professional without changing meaning' = paraphrase
[INDIRECT_PROMPT] FAIL: expected 'calculate_split_bill', got 'none' (9.343s)
Rationale: Indirect - describes bill splitting scenario, doesn't say 'split bill'
[INDIRECT_PROMPT] FAIL: expected 'search_academic', got 'none' (2.666s)
Rationale: Indirect - 'thesis advisor recommended a paper' implies academic search
[INDIRECT_PROMPT] PASS: 'send_slack_message' (2.726s)
[DECOY_TOOL] PASS: 'get_air_quality' (2.644s)
[DECOY_TOOL] PASS: 'get_weather' (2.445s)
[DECOY_TOOL] PASS: 'paraphrase_text' (2.853s)
[DECOY_TOOL] PASS: 'set_timer' (2.685s)
[DECOY_TOOL] FAIL: expected 'create_reminder', got 'send_sms' (2.892s)
Rationale: Decoy: send_push_notification is tempting (buzz phone), but a scheduled note about a task = reminder
[DECOY_TOOL] PASS: 'create_calendar_event' (2.736s)
[DECOY_TOOL] PASS: 'convert_file_format' (2.838s)
[RANDOM_ORDER] PASS: 'search_web' (3.091s)
[RANDOM_ORDER] FAIL: expected 'send_push_notification', got 'none' (3.24s)
Rationale: Must pick push notification over send_sms, send_email when tool order is random
[RANDOM_ORDER] PASS: 'convert_units' (3.219s)
============================================================
PHASE 2 SUMMARY
SEMANTIC_OVERLAP 5/10 (50%) [FAIL]
- Expected 'search_news', got 'none/no-tool-call'
Prompt: Search for recent news articles about the 2026 World Cup.
- Expected 'search_local_files', got 'none/no-tool-call'
Prompt: Find a file called 'budget_2026.xlsx' on my computer.
- Expected 'convert_units', got 'none/no-tool-call'
Prompt: How many kilometers is 26.2 miles?
- Expected 'send_sms', got 'none/no-tool-call'
Prompt: Text my wife at 555-0123 that I'll be late for dinner.
- Expected 'set_alarm', got 'create_reminder'
Prompt: I need to wake up at 6:30am every weekday.
INDIRECT_PROMPT 5/8 (62%) [FAIL]
- Expected 'paraphrase_text', got 'search_web'
Prompt: My essay opening sounds too casual for an academic submissio...
- Expected 'calculate_split_bill', got 'none/no-tool-call'
Prompt: We're a group of 4 at the restaurant and the total came to $...
- Expected 'search_academic', got 'none/no-tool-call'
Prompt: My thesis advisor recommended a paper by Smith et al. on tra...
DECOY_TOOL 6/7 (86%) [DEGRADED]
- Expected 'create_reminder', got 'send_sms'
Prompt: Can you buzz my phone with a note about buying milk tomorrow...
RANDOM_ORDER 2/3 (67%) [FAIL]
- Expected 'send_push_notification', got 'none/no-tool-call'
Prompt: Send a high-priority notification to my phone saying 'Server...
OVERALL: 18/28 (64%)
Verifying GitHub Copilot API connection...
Connected successfully.
Model: claude-opus-4.6
Tool pool: 31 tools, testing up to 31
Phase selection: both
==================================================
Testing with 2 tools available...
[PASS] 'get_weather' correctly called (3.075s)
[PASS] 'send_email' correctly called (3.281s)
Accuracy at 2 tools: 2/2 = 100%
==================================================
Testing with 3 tools available...
[PASS] 'get_weather' correctly called (2.794s)
[PASS] 'send_email' correctly called (3.529s)
[PASS] 'calculate_tip' correctly called (3.808s)
Accuracy at 3 tools: 3/3 = 100%
==================================================
Testing with 4 tools available...
[PASS] 'get_weather' correctly called (3.11s)
[PASS] 'send_email' correctly called (3.704s)
[PASS] 'calculate_tip' correctly called (3.088s)
[PASS] 'translate_text' correctly called (3.42s)
Accuracy at 4 tools: 4/4 = 100%
==================================================
Testing with 5 tools available...
[PASS] 'get_weather' correctly called (2.879s)
[PASS] 'send_email' correctly called (3.668s)
[PASS] 'calculate_tip' correctly called (6.329s)
[PASS] 'translate_text' correctly called (3.745s)
[PASS] 'set_timer' correctly called (3.087s)
Accuracy at 5 tools: 5/5 = 100%
==================================================
Testing with 7 tools available...
[PASS] 'search_wikipedia' correctly called (3.004s)
[PASS] 'get_weather' correctly called (2.978s)
[PASS] 'convert_currency' correctly called (3.401s)
[PASS] 'calculate_tip' correctly called (3.675s)
[PASS] 'set_timer' correctly called (3.143s)
Accuracy at 7 tools: 5/5 = 100%
==================================================
Testing with 9 tools available...
[PASS] 'translate_text' correctly called (3.327s)
[PASS] 'calculate_tip' correctly called (3.17s)
[PASS] 'search_wikipedia' correctly called (3.415s)
[PASS] 'get_weather' correctly called (3.083s)
[PASS] 'set_timer' correctly called (3.5s)
Accuracy at 9 tools: 5/5 = 100%
==================================================
Testing with 11 tools available...
[PASS] 'send_email' correctly called (3.156s)
[PASS] 'resize_image' correctly called (3.302s)
[PASS] 'convert_currency' correctly called (3.315s)
[PASS] 'get_weather' correctly called (3.2s)
[PASS] 'create_calendar_event' correctly called (3.777s)
Accuracy at 11 tools: 5/5 = 100%
==================================================
Testing with 16 tools available...
[PASS] 'calculate_tip' correctly called (3.033s)
[PASS] 'translate_text' correctly called (3.285s)
[PASS] 'generate_password' correctly called (4.948s)
[PASS] 'get_stock_price' correctly called (5.655s)
[PASS] 'resize_image' correctly called (3.473s)
Accuracy at 16 tools: 5/5 = 100%
==================================================
Testing with 21 tools available...
[PASS] 'get_weather' correctly called (3.237s)
[PASS] 'scan_port' correctly called (3.156s)
[PASS] 'convert_currency' correctly called (3.262s)
[PASS] 'find_restaurant' correctly called (3.212s)
[PASS] 'check_spelling' correctly called (4.512s)
Accuracy at 21 tools: 5/5 = 100%
==================================================
Testing with 26 tools available...
[PASS] 'create_calendar_event' correctly called (3.708s)
[PASS] 'generate_password' correctly called (3.626s)
[PASS] 'convert_units' correctly called (3.188s)
[PASS] 'get_stock_price' correctly called (3.158s)
[PASS] 'shorten_url' correctly called (4.027s)
Accuracy at 26 tools: 5/5 = 100%
==================================================
Testing with 31 tools available...
[PASS] 'create_qr_code' correctly called (3.715s)
[PASS] 'get_weather' correctly called (3.521s)
[PASS] 'analyze_sentiment' correctly called (3.166s)
[PASS] 'shorten_url' correctly called (4.017s)
[PASS] 'search_wikipedia' correctly called (3.161s)
Accuracy at 31 tools: 5/5 = 100%
==================================================
PHASE 1 SUMMARY
Model: claude-opus-4.6
Tool pool size: 31
Tools Correct Total Accuracy
2 2 2 100%
3 3 3 100%
4 4 4 100%
5 5 5 100%
7 5 5 100%
9 5 5 100%
11 5 5 100%
16 5 5 100%
21 5 5 100%
26 5 5 100%
31 5 5 100%
============================================================
PHASE 2: Adversarial Tool Recall Test
Tools in pool: 25 (with semantic overlaps & decoys)
Test cases: 28
[SEMANTIC_OVERLAP] PASS: 'search_wikipedia' (4.028s)
[SEMANTIC_OVERLAP] PASS: 'search_news' (3.566s)
[SEMANTIC_OVERLAP] PASS: 'search_academic' (3.347s)
[SEMANTIC_OVERLAP] PASS: 'search_local_files' (2.924s)
[SEMANTIC_OVERLAP] PASS: 'convert_currency' (3.377s)
[SEMANTIC_OVERLAP] PASS: 'convert_units' (3.491s)
[SEMANTIC_OVERLAP] PASS: 'convert_timezone' (3.429s)
[SEMANTIC_OVERLAP] PASS: 'send_slack_message' (3.727s)
[SEMANTIC_OVERLAP] PASS: 'send_sms' (3.232s)
[SEMANTIC_OVERLAP] PASS: 'set_alarm' (3.398s)
[INDIRECT_PROMPT] PASS: 'convert_timezone' (3.534s)
[INDIRECT_PROMPT] PASS: 'convert_currency' (3.706s)
[INDIRECT_PROMPT] PASS: 'get_weather_forecast' (3.599s)
[INDIRECT_PROMPT] PASS: 'summarize_text' (4.205s)
[INDIRECT_PROMPT] PASS: 'paraphrase_text' (3.432s)
[INDIRECT_PROMPT] PASS: 'calculate_split_bill' (3.358s)
[INDIRECT_PROMPT] PASS: 'search_academic' (3.615s)
[INDIRECT_PROMPT] PASS: 'send_slack_message' (3.199s)
[DECOY_TOOL] PASS: 'get_air_quality' (3.067s)
[DECOY_TOOL] PASS: 'get_weather' (3.149s)
[DECOY_TOOL] PASS: 'paraphrase_text' (3.204s)
[DECOY_TOOL] PASS: 'set_timer' (3.486s)
[DECOY_TOOL] FAIL: expected 'create_reminder', got 'send_push_notification' (3.297s)
Rationale: Decoy: send_push_notification is tempting (buzz phone), but a scheduled note about a task = reminder
[DECOY_TOOL] PASS: 'create_calendar_event' (3.357s)
[DECOY_TOOL] PASS: 'convert_file_format' (3.8s)
[RANDOM_ORDER] PASS: 'search_web' (3.411s)
[RANDOM_ORDER] PASS: 'send_push_notification' (3.373s)
[RANDOM_ORDER] PASS: 'convert_units' (3.33s)
============================================================
PHASE 2 SUMMARY
SEMANTIC_OVERLAP 10/10 (100%) [PASS]
INDIRECT_PROMPT 8/8 (100%) [PASS]
DECOY_TOOL 6/7 (86%) [DEGRADED]
- Expected 'create_reminder', got 'send_push_notification'
Prompt: Can you buzz my phone with a note about buying milk tomorrow...
RANDOM_ORDER 3/3 (100%) [PASS]
OVERALL: 27/28 (96%)
Tool Recall Capacity Test (GitHub Copilot / GitHub Models)
Timestamp: 2026-05-28 23:50:34
Pool: 31 tools (Phase 1), 25 tools (Phase 2)
Verifying connection...
Connected. Model: claude-opus-4.7
============================================================
PHASE 2: Adversarial Tool Recall Test
Tools in pool: 25 (with semantic overlaps & decoys)
Test cases: 28
[SEMANTIC_OVERLAP] PASS: 'search_wikipedia' (2.769s)
[SEMANTIC_OVERLAP] PASS: 'search_news' (2.878s)
[SEMANTIC_OVERLAP] PASS: 'search_academic' (2.735s)
[SEMANTIC_OVERLAP] PASS: 'search_local_files' (3.282s)
[SEMANTIC_OVERLAP] PASS: 'convert_currency' (3.39s)
[SEMANTIC_OVERLAP] PASS: 'convert_units' (3.855s)
[SEMANTIC_OVERLAP] PASS: 'convert_timezone' (2.946s)
[SEMANTIC_OVERLAP] PASS: 'send_slack_message' (4.025s)
[SEMANTIC_OVERLAP] PASS: 'send_sms' (3.01s)
[SEMANTIC_OVERLAP] PASS: 'set_alarm' (3.784s)
[INDIRECT_PROMPT] PASS: 'convert_timezone' (2.943s)
[INDIRECT_PROMPT] PASS: 'convert_currency' (2.88s)
[INDIRECT_PROMPT] PASS: 'get_weather_forecast' (3.592s)
[INDIRECT_PROMPT] PASS: 'summarize_text' (3.547s)
[INDIRECT_PROMPT] PASS: 'paraphrase_text' (2.969s)
[INDIRECT_PROMPT] PASS: 'calculate_split_bill' (2.948s)
[INDIRECT_PROMPT] FAIL: expected 'search_academic', got 'none' (6.3s)
Rationale: Indirect - 'thesis advisor recommended a paper' implies academic search
[INDIRECT_PROMPT] PASS: 'send_slack_message' (2.616s)
[DECOY_TOOL] PASS: 'get_air_quality' (2.414s)
[DECOY_TOOL] PASS: 'get_weather' (2.4s)
[DECOY_TOOL] PASS: 'paraphrase_text' (2.926s)
[DECOY_TOOL] PASS: 'set_timer' (2.668s)
[DECOY_TOOL] FAIL: expected 'create_reminder', got 'send_push_notification' (3.113s)
Rationale: Decoy: send_push_notification is tempting (buzz phone), but a scheduled note about a task = reminder
[DECOY_TOOL] PASS: 'create_calendar_event' (2.918s)
[DECOY_TOOL] PASS: 'convert_file_format' (3.598s)
[RANDOM_ORDER] PASS: 'search_web' (3.661s)
[RANDOM_ORDER] PASS: 'send_push_notification' (3.972s)
[RANDOM_ORDER] PASS: 'convert_units' (3.293s)
============================================================
PHASE 2 SUMMARY
SEMANTIC_OVERLAP 10/10 (100%) [PASS]
INDIRECT_PROMPT 7/8 (88%) [DEGRADED]
- Expected 'search_academic', got 'none/no-tool-call'
Prompt: My thesis advisor recommended a paper by Smith et al. on tra...
DECOY_TOOL 6/7 (86%) [DEGRADED]
- Expected 'create_reminder', got 'send_push_notification'
Prompt: Can you buzz my phone with a note about buying milk tomorrow...
RANDOM_ORDER 3/3 (100%) [PASS]
OVERALL: 26/28 (93%)
Log saved to: copilot_tool_recall_20260528_235034.log