Instructions to use Jackrong/Qwopus3.5-9B-Coder-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Jackrong/Qwopus3.5-9B-Coder-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Jackrong/Qwopus3.5-9B-Coder-GGUF")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Jackrong/Qwopus3.5-9B-Coder-GGUF", dtype="auto")

llama-cpp-python

How to use Jackrong/Qwopus3.5-9B-Coder-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Jackrong/Qwopus3.5-9B-Coder-GGUF",
	filename="Qwopus3.5-9B-coder-Exp-BF16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use Jackrong/Qwopus3.5-9B-Coder-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M

Use Docker

docker model run hf.co/Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use Jackrong/Qwopus3.5-9B-Coder-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Jackrong/Qwopus3.5-9B-Coder-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Jackrong/Qwopus3.5-9B-Coder-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M

SGLang

How to use Jackrong/Qwopus3.5-9B-Coder-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Jackrong/Qwopus3.5-9B-Coder-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Jackrong/Qwopus3.5-9B-Coder-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Jackrong/Qwopus3.5-9B-Coder-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Jackrong/Qwopus3.5-9B-Coder-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use Jackrong/Qwopus3.5-9B-Coder-GGUF with Ollama:
```
ollama run hf.co/Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M
```

Unsloth Studio

How to use Jackrong/Qwopus3.5-9B-Coder-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Jackrong/Qwopus3.5-9B-Coder-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Jackrong/Qwopus3.5-9B-Coder-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Jackrong/Qwopus3.5-9B-Coder-GGUF to start chatting

How to use Jackrong/Qwopus3.5-9B-Coder-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Jackrong/Qwopus3.5-9B-Coder-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use Jackrong/Qwopus3.5-9B-Coder-GGUF with Docker Model Runner:
```
docker model run hf.co/Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M
```

Lemonade

How to use Jackrong/Qwopus3.5-9B-Coder-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Jackrong/Qwopus3.5-9B-Coder-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Qwopus3.5-9B-Coder-GGUF-Q4_K_M

List all available models

lemonade list

Tool calling capability on par with Opus 4.6

by gzhone - opened 2 days ago

Discussion

gzhone

2 days ago

Tested between Qwen 3.5 9B, Qwopus3.5 9B Coder , Opus 4.6, Opus 4.7, MiniCPM v4.6, etc, and found Coder to be equally as good as Opus 4.6 in tool calling. You could find the compilation at:

gzhone

2 days ago

Connecting to LM Studio...
Models: qwopus3.5-9b-coder

Phase selection: both

############################################################

MODEL: qwopus3.5-9b-coder

############################################################

==================================================
Testing with 2 tools available...

[PASS] 'get_weather' correctly called (4.239s)
[PASS] 'send_email' correctly called (5.591s)

Accuracy at 2 tools: 2/2 = 100%

==================================================
Testing with 3 tools available...

[PASS] 'get_weather' correctly called (4.073s)
[PASS] 'send_email' correctly called (7.078s)
[PASS] 'calculate_tip' correctly called (5.938s)

Accuracy at 3 tools: 3/3 = 100%

==================================================
Testing with 4 tools available...

[PASS] 'get_weather' correctly called (5.072s)
[PASS] 'send_email' correctly called (6.148s)
[PASS] 'calculate_tip' correctly called (6.41s)
[PASS] 'translate_text' correctly called (6.024s)

Accuracy at 4 tools: 4/4 = 100%

==================================================
Testing with 5 tools available...

[PASS] 'get_weather' correctly called (4.393s)
[PASS] 'send_email' correctly called (5.947s)
[PASS] 'calculate_tip' correctly called (5.511s)
[PASS] 'translate_text' correctly called (6.605s)
[PASS] 'set_timer' correctly called (4.993s)

Accuracy at 5 tools: 5/5 = 100%

==================================================
Testing with 7 tools available...

[PASS] 'search_wikipedia' correctly called (4.94s)
[PASS] 'get_weather' correctly called (4.342s)
[PASS] 'convert_currency' correctly called (5.746s)
[PASS] 'calculate_tip' correctly called (5.461s)
[PASS] 'set_timer' correctly called (5.096s)

Accuracy at 7 tools: 5/5 = 100%

==================================================
Testing with 9 tools available...

[PASS] 'translate_text' correctly called (6.471s)
[PASS] 'calculate_tip' correctly called (5.885s)
[PASS] 'search_wikipedia' correctly called (5.104s)
[PASS] 'get_weather' correctly called (4.426s)
[PASS] 'set_timer' correctly called (5.086s)

Accuracy at 9 tools: 5/5 = 100%

==================================================
Testing with 11 tools available...

[PASS] 'send_email' correctly called (6.822s)
[PASS] 'resize_image' correctly called (6.063s)
[PASS] 'convert_currency' correctly called (5.901s)
[PASS] 'get_weather' correctly called (4.876s)
[PASS] 'create_calendar_event' correctly called (8.235s)

Accuracy at 11 tools: 5/5 = 100%

==================================================
Testing with 16 tools available...

[PASS] 'calculate_tip' correctly called (5.894s)
[PASS] 'translate_text' correctly called (5.926s)
[PASS] 'generate_password' correctly called (5.904s)
[PASS] 'get_stock_price' correctly called (4.68s)
[PASS] 'resize_image' correctly called (6.341s)

Accuracy at 16 tools: 5/5 = 100%

==================================================
Testing with 21 tools available...

[PASS] 'get_weather' correctly called (5.254s)
[PASS] 'scan_port' correctly called (6.331s)
[PASS] 'convert_currency' correctly called (5.668s)
[PASS] 'find_restaurant' correctly called (5.798s)
[PASS] 'check_spelling' correctly called (5.283s)

Accuracy at 21 tools: 5/5 = 100%

==================================================
Testing with 26 tools available...

[PASS] 'create_calendar_event' correctly called (8.906s)
[PASS] 'generate_password' correctly called (5.986s)
[PASS] 'convert_units' correctly called (5.831s)
[PASS] 'get_stock_price' correctly called (4.516s)
[PASS] 'shorten_url' correctly called (6.089s)

Accuracy at 26 tools: 5/5 = 100%

==================================================
Testing with 31 tools available...

[PASS] 'create_qr_code' correctly called (6.452s)
[PASS] 'get_weather' correctly called (4.389s)
[PASS] 'analyze_sentiment' correctly called (5.215s)
[PASS] 'shorten_url' correctly called (5.011s)
[PASS] 'search_wikipedia' correctly called (4.789s)

Accuracy at 31 tools: 5/5 = 100%

==================================================
PHASE 1 SUMMARY

Model: qwopus3.5-9b-coder
Tool pool size: 31
Tools Correct Total Accuracy
2 2 2 100%
3 3 3 100%
4 4 4 100%
5 5 5 100%
7 5 5 100%
9 5 5 100%
11 5 5 100%
16 5 5 100%
21 5 5 100%
26 5 5 100%
31 5 5 100%

============================================================
PHASE 2: Adversarial Tool Recall Test

Tools in pool: 25 (with semantic overlaps & decoys)
Test cases: 28

[SEMANTIC_OVERLAP] PASS: 'search_wikipedia' (6.455s)
[SEMANTIC_OVERLAP] PASS: 'search_news' (6.202s)
[SEMANTIC_OVERLAP] PASS: 'search_academic' (6.646s)
[SEMANTIC_OVERLAP] PASS: 'search_local_files' (6.383s)
[SEMANTIC_OVERLAP] PASS: 'convert_currency' (5.468s)
[SEMANTIC_OVERLAP] PASS: 'convert_units' (5.23s)
[SEMANTIC_OVERLAP] PASS: 'convert_timezone' (7.501s)
[SEMANTIC_OVERLAP] PASS: 'send_slack_message' (5.852s)
[SEMANTIC_OVERLAP] PASS: 'send_sms' (6.462s)
[SEMANTIC_OVERLAP] PASS: 'set_alarm' (10.232s)
[INDIRECT_PROMPT] PASS: 'convert_timezone' (13.482s)
[INDIRECT_PROMPT] PASS: 'convert_currency' (6.508s)
[INDIRECT_PROMPT] PASS: 'get_weather_forecast' (4.875s)
[INDIRECT_PROMPT] PASS: 'summarize_text' (12.746s)
[INDIRECT_PROMPT] PASS: 'paraphrase_text' (6.895s)
[INDIRECT_PROMPT] PASS: 'calculate_split_bill' (5.991s)
[INDIRECT_PROMPT] PASS: 'search_academic' (5.876s)
[INDIRECT_PROMPT] PASS: 'send_slack_message' (6.178s)
[DECOY_TOOL] PASS: 'get_air_quality' (4.727s)
[DECOY_TOOL] PASS: 'get_weather' (4.184s)
[DECOY_TOOL] PASS: 'paraphrase_text' (5.278s)
[DECOY_TOOL] PASS: 'set_timer' (4.527s)
[DECOY_TOOL] FAIL: expected 'create_reminder', got 'send_push_notification' (10.072s)
Rationale: Decoy: send_push_notification is tempting (buzz phone), but a scheduled note about a task = reminder
[DECOY_TOOL] PASS: 'create_calendar_event' (8.705s)
[DECOY_TOOL] PASS: 'convert_file_format' (5.956s)
[RANDOM_ORDER] PASS: 'search_web' (6.423s)
[RANDOM_ORDER] PASS: 'send_push_notification' (12.028s)
[RANDOM_ORDER] PASS: 'convert_units' (7.333s)

============================================================
PHASE 2 SUMMARY

SEMANTIC_OVERLAP 10/10 (100%) [PASS]
INDIRECT_PROMPT 8/8 (100%) [PASS]
DECOY_TOOL 6/7 (86%) [DEGRADED]
- Expected 'create_reminder', got 'send_push_notification'
Prompt: Can you buzz my phone with a note about buying milk tomorrow...
RANDOM_ORDER 3/3 (100%) [PASS]

OVERALL: 27/28 (96%)

gzhone

2 days ago

Connecting to LM Studio...
Models: qwen/qwen3.5-9b

Phase selection: both

############################################################

MODEL: qwen/qwen3.5-9b

############################################################

==================================================
Testing with 2 tools available...

[PASS] 'get_weather' correctly called (5.009s)
[PASS] 'send_email' correctly called (6.897s)

Accuracy at 2 tools: 2/2 = 100%

==================================================
Testing with 3 tools available...

[PASS] 'get_weather' correctly called (5.311s)
[PASS] 'send_email' correctly called (7.27s)
[PASS] 'calculate_tip' correctly called (5.953s)

Accuracy at 3 tools: 3/3 = 100%

==================================================
Testing with 4 tools available...

[PASS] 'get_weather' correctly called (6.182s)
[PASS] 'send_email' correctly called (7.321s)
[PASS] 'calculate_tip' correctly called (7.407s)
[PASS] 'translate_text' correctly called (7.859s)

Accuracy at 4 tools: 4/4 = 100%

==================================================
Testing with 5 tools available...

[PASS] 'get_weather' correctly called (5.542s)
[PASS] 'send_email' correctly called (7.401s)
[PASS] 'calculate_tip' correctly called (7.415s)
[PASS] 'translate_text' correctly called (7.911s)
[PASS] 'set_timer' correctly called (6.917s)

Accuracy at 5 tools: 5/5 = 100%

==================================================
Testing with 7 tools available...

[PASS] 'search_wikipedia' correctly called (6.971s)
[PASS] 'get_weather' correctly called (4.601s)
[PASS] 'convert_currency' correctly called (7.373s)
[PASS] 'calculate_tip' correctly called (7.458s)
[PASS] 'set_timer' correctly called (6.365s)

Accuracy at 7 tools: 5/5 = 100%

==================================================
Testing with 9 tools available...

[PASS] 'translate_text' correctly called (7.86s)
[PASS] 'calculate_tip' correctly called (8.142s)
[PASS] 'search_wikipedia' correctly called (5.413s)
[PASS] 'get_weather' correctly called (5.233s)
[PASS] 'set_timer' correctly called (6.693s)

Accuracy at 9 tools: 5/5 = 100%

==================================================
Testing with 11 tools available...

[PASS] 'send_email' correctly called (11.327s)
[PASS] 'resize_image' correctly called (7.478s)
[PASS] 'convert_currency' correctly called (7.245s)
[PASS] 'get_weather' correctly called (5.443s)
[PASS] 'create_calendar_event' correctly called (9.094s)

Accuracy at 11 tools: 5/5 = 100%

==================================================
Testing with 16 tools available...

[PASS] 'calculate_tip' correctly called (7.187s)
[PASS] 'translate_text' correctly called (7.704s)
[PASS] 'generate_password' correctly called (7.153s)
[PASS] 'get_stock_price' correctly called (5.342s)
[PASS] 'resize_image' correctly called (7.872s)

Accuracy at 16 tools: 5/5 = 100%

==================================================
Testing with 21 tools available...

[PASS] 'get_weather' correctly called (5.344s)
[PASS] 'scan_port' correctly called (8.579s)
[PASS] 'convert_currency' correctly called (7.505s)
[PASS] 'find_restaurant' correctly called (7.789s)
[PASS] 'check_spelling' correctly called (8.24s)

Accuracy at 21 tools: 5/5 = 100%

==================================================
Testing with 26 tools available...

[PASS] 'create_calendar_event' correctly called (9.7s)
[PASS] 'generate_password' correctly called (7.007s)
[PASS] 'convert_units' correctly called (6.977s)
[PASS] 'get_stock_price' correctly called (6.252s)
[PASS] 'shorten_url' correctly called (7.046s)

Accuracy at 26 tools: 5/5 = 100%

==================================================
Testing with 31 tools available...

[PASS] 'create_qr_code' correctly called (7.39s)
[PASS] 'get_weather' correctly called (5.595s)
[PASS] 'analyze_sentiment' correctly called (7.181s)
[PASS] 'shorten_url' correctly called (6.787s)
[PASS] 'search_wikipedia' correctly called (5.159s)

Accuracy at 31 tools: 5/5 = 100%

==================================================
PHASE 1 SUMMARY

Model: qwen/qwen3.5-9b
Tool pool size: 31
Tools Correct Total Accuracy
2 2 2 100%
3 3 3 100%
4 4 4 100%
5 5 5 100%
7 5 5 100%
9 5 5 100%
11 5 5 100%
16 5 5 100%
21 5 5 100%
26 5 5 100%
31 5 5 100%

============================================================
PHASE 2: Adversarial Tool Recall Test

Tools in pool: 25 (with semantic overlaps & decoys)
Test cases: 28

[SEMANTIC_OVERLAP] PASS: 'search_wikipedia' (8.046s)
[SEMANTIC_OVERLAP] PASS: 'search_news' (6.251s)
[SEMANTIC_OVERLAP] PASS: 'search_academic' (9.928s)
[SEMANTIC_OVERLAP] PASS: 'search_local_files' (7.043s)
[SEMANTIC_OVERLAP] PASS: 'convert_currency' (7.266s)
[SEMANTIC_OVERLAP] PASS: 'convert_units' (7.805s)
[SEMANTIC_OVERLAP] PASS: 'convert_timezone' (9.381s)
[SEMANTIC_OVERLAP] PASS: 'send_slack_message' (6.704s)
[SEMANTIC_OVERLAP] PASS: 'send_sms' (7.236s)
[SEMANTIC_OVERLAP] PASS: 'set_alarm' (9.284s)
[INDIRECT_PROMPT] PASS: 'convert_timezone' (12.02s)
[INDIRECT_PROMPT] PASS: 'convert_currency' (13.453s)
[INDIRECT_PROMPT] PASS: 'get_weather_forecast' (6.949s)
[INDIRECT_PROMPT] PASS: 'summarize_text' (10.556s)
[INDIRECT_PROMPT] PASS: 'paraphrase_text' (9.886s)
[INDIRECT_PROMPT] PASS: 'calculate_split_bill' (6.861s)
[INDIRECT_PROMPT] PASS: 'search_academic' (9.342s)
[INDIRECT_PROMPT] FAIL: expected 'send_slack_message', got 'none' (14.461s)
Rationale: Indirect - 'drop it in our dev channel' implies Slack, explicitly rules out email
[DECOY_TOOL] PASS: 'get_air_quality' (6.05s)
[DECOY_TOOL] PASS: 'get_weather' (5.09s)
[DECOY_TOOL] PASS: 'paraphrase_text' (7.659s)
[DECOY_TOOL] PASS: 'set_timer' (7.384s)
[DECOY_TOOL] FAIL: expected 'create_reminder', got 'send_push_notification' (10.21s)
Rationale: Decoy: send_push_notification is tempting (buzz phone), but a scheduled note about a task = reminder
[DECOY_TOOL] PASS: 'create_calendar_event' (19.048s)
[DECOY_TOOL] PASS: 'convert_file_format' (8.045s)
[RANDOM_ORDER] PASS: 'search_web' (8.181s)
[RANDOM_ORDER] PASS: 'send_push_notification' (15.028s)
[RANDOM_ORDER] PASS: 'convert_units' (11.034s)

============================================================
PHASE 2 SUMMARY

SEMANTIC_OVERLAP 10/10 (100%) [PASS]
INDIRECT_PROMPT 7/8 (88%) [DEGRADED]
- Expected 'send_slack_message', got 'none/no-tool-call'
Prompt: I need to let my team know the build succeeded but I don't w...
DECOY_TOOL 6/7 (86%) [DEGRADED]
- Expected 'create_reminder', got 'send_push_notification'
Prompt: Can you buzz my phone with a note about buying milk tomorrow...
RANDOM_ORDER 3/3 (100%) [PASS]

OVERALL: 26/28 (93%)

gzhone

2 days ago

Connecting to LM Studio...
Models: mradermacher/minicpm-v-4.6

Phase selection: both

############################################################

MODEL: mradermacher/minicpm-v-4.6

############################################################

==================================================
Testing with 2 tools available...

[PASS] 'get_weather' correctly called (2.501s)
[FAIL] Expected 'send_email', got 'none' No tool call made (2.386s)

Accuracy at 2 tools: 1/2 = 50%

==================================================
Testing with 3 tools available...

[PASS] 'get_weather' correctly called (2.614s)
[PASS] 'send_email' correctly called (2.795s)
[PASS] 'calculate_tip' correctly called (2.626s)

Accuracy at 3 tools: 3/3 = 100%

==================================================
Testing with 4 tools available...

[PASS] 'get_weather' correctly called (2.581s)
[PASS] 'send_email' correctly called (2.919s)
[PASS] 'calculate_tip' correctly called (2.882s)
[PASS] 'translate_text' correctly called (2.742s)

Accuracy at 4 tools: 4/4 = 100%

==================================================
Testing with 5 tools available...

[PASS] 'get_weather' correctly called (2.645s)
[PASS] 'send_email' correctly called (2.923s)
[PASS] 'calculate_tip' correctly called (2.849s)
[PASS] 'translate_text' correctly called (2.789s)
[PASS] 'set_timer' correctly called (2.601s)

Accuracy at 5 tools: 5/5 = 100%

==================================================
Testing with 7 tools available...

[PASS] 'search_wikipedia' correctly called (2.472s)
[PASS] 'get_weather' correctly called (2.514s)
[PASS] 'convert_currency' correctly called (2.616s)
[PASS] 'calculate_tip' correctly called (2.791s)
[PASS] 'set_timer' correctly called (2.711s)

Accuracy at 7 tools: 5/5 = 100%

==================================================
Testing with 9 tools available...

[PASS] 'translate_text' correctly called (2.788s)
[PASS] 'calculate_tip' correctly called (2.632s)
[PASS] 'search_wikipedia' correctly called (2.696s)
[PASS] 'get_weather' correctly called (2.597s)
[PASS] 'set_timer' correctly called (2.698s)

Accuracy at 9 tools: 5/5 = 100%

==================================================
Testing with 11 tools available...

[PASS] 'send_email' correctly called (2.922s)
[PASS] 'resize_image' correctly called (2.865s)
[PASS] 'convert_currency' correctly called (2.673s)
[PASS] 'get_weather' correctly called (2.606s)
[FAIL] Expected 'create_calendar_event', got 'none' No tool call made (2.885s)

Accuracy at 11 tools: 4/5 = 80%

==================================================
Testing with 16 tools available...

[PASS] 'calculate_tip' correctly called (2.977s)
[PASS] 'translate_text' correctly called (2.859s)
[PASS] 'generate_password' correctly called (2.807s)
[PASS] 'get_stock_price' correctly called (2.711s)
[PASS] 'resize_image' correctly called (2.792s)

Accuracy at 16 tools: 5/5 = 100%

==================================================
Testing with 21 tools available...

[PASS] 'get_weather' correctly called (2.659s)
[PASS] 'scan_port' correctly called (2.967s)
[PASS] 'convert_currency' correctly called (2.882s)
[PASS] 'find_restaurant' correctly called (2.844s)
[PASS] 'check_spelling' correctly called (2.654s)

Accuracy at 21 tools: 5/5 = 100%

==================================================
Testing with 26 tools available...

[PASS] 'create_calendar_event' correctly called (3.0s)
[FAIL] Expected 'generate_password', got 'none' No tool call made (2.777s)
[PASS] 'convert_units' correctly called (2.653s)
[PASS] 'get_stock_price' correctly called (2.51s)
[PASS] 'shorten_url' correctly called (2.521s)

Accuracy at 26 tools: 4/5 = 80%

==================================================
Testing with 31 tools available...

[FAIL] Expected 'create_qr_code', got 'none' No tool call made (2.814s)
[PASS] 'get_weather' correctly called (2.645s)
[PASS] 'analyze_sentiment' correctly called (2.518s)
[PASS] 'shorten_url' correctly called (2.721s)
[PASS] 'search_wikipedia' correctly called (2.755s)

Accuracy at 31 tools: 4/5 = 80%

==================================================
PHASE 1 SUMMARY

Model: mradermacher/minicpm-v-4.6
Tool pool size: 31
Tools Correct Total Accuracy
2 1 2 50%
3 3 3 100%
4 4 4 100%
5 5 5 100%
7 5 5 100%
9 5 5 100%
11 4 5 80%
16 5 5 100%
21 5 5 100%
26 4 5 80%
31 4 5 80%

Detailed Failures:
At 2 tools:
- expected '?', got 'None' (error: No tool call made)
At 11 tools:
- expected '?', got 'None' (error: No tool call made)
At 26 tools:
- expected '?', got 'None' (error: No tool call made)
At 31 tools:
- expected '?', got 'None' (error: No tool call made)

============================================================
PHASE 2: Adversarial Tool Recall Test

Tools in pool: 25 (with semantic overlaps & decoys)
Test cases: 28

[SEMANTIC_OVERLAP] PASS: 'search_wikipedia' (3.032s)
[SEMANTIC_OVERLAP] FAIL: expected 'search_news', got 'none' (2.758s)
Rationale: Must pick search_news over search_web, search_wikipedia
[SEMANTIC_OVERLAP] PASS: 'search_academic' (2.856s)
[SEMANTIC_OVERLAP] FAIL: expected 'search_local_files', got 'none' (2.509s)
Rationale: Must pick search_local_files over other search tools
[SEMANTIC_OVERLAP] PASS: 'convert_currency' (2.731s)
[SEMANTIC_OVERLAP] FAIL: expected 'convert_units', got 'none' (2.545s)
Rationale: Must pick convert_units over convert_currency
[SEMANTIC_OVERLAP] PASS: 'convert_timezone' (2.881s)
[SEMANTIC_OVERLAP] PASS: 'send_slack_message' (2.678s)
[SEMANTIC_OVERLAP] FAIL: expected 'send_sms', got 'none' (2.591s)
Rationale: Must pick send_sms over send_email, send_slack_message
[SEMANTIC_OVERLAP] FAIL: expected 'set_alarm', got 'create_reminder' (2.943s)
Rationale: Must pick set_alarm (recurring wake-up) over set_timer, create_reminder
[INDIRECT_PROMPT] PASS: 'convert_timezone' (2.741s)
[INDIRECT_PROMPT] PASS: 'convert_currency' (2.944s)
[INDIRECT_PROMPT] PASS: 'get_weather_forecast' (2.827s)
[INDIRECT_PROMPT] PASS: 'summarize_text' (3.099s)
[INDIRECT_PROMPT] FAIL: expected 'paraphrase_text', got 'search_web' (2.624s)
Rationale: Indirect - 'sound more professional without changing meaning' = paraphrase
[INDIRECT_PROMPT] FAIL: expected 'calculate_split_bill', got 'none' (9.343s)
Rationale: Indirect - describes bill splitting scenario, doesn't say 'split bill'
[INDIRECT_PROMPT] FAIL: expected 'search_academic', got 'none' (2.666s)
Rationale: Indirect - 'thesis advisor recommended a paper' implies academic search
[INDIRECT_PROMPT] PASS: 'send_slack_message' (2.726s)
[DECOY_TOOL] PASS: 'get_air_quality' (2.644s)
[DECOY_TOOL] PASS: 'get_weather' (2.445s)
[DECOY_TOOL] PASS: 'paraphrase_text' (2.853s)
[DECOY_TOOL] PASS: 'set_timer' (2.685s)
[DECOY_TOOL] FAIL: expected 'create_reminder', got 'send_sms' (2.892s)
Rationale: Decoy: send_push_notification is tempting (buzz phone), but a scheduled note about a task = reminder
[DECOY_TOOL] PASS: 'create_calendar_event' (2.736s)
[DECOY_TOOL] PASS: 'convert_file_format' (2.838s)
[RANDOM_ORDER] PASS: 'search_web' (3.091s)
[RANDOM_ORDER] FAIL: expected 'send_push_notification', got 'none' (3.24s)
Rationale: Must pick push notification over send_sms, send_email when tool order is random
[RANDOM_ORDER] PASS: 'convert_units' (3.219s)

============================================================
PHASE 2 SUMMARY

SEMANTIC_OVERLAP 5/10 (50%) [FAIL]
- Expected 'search_news', got 'none/no-tool-call'
Prompt: Search for recent news articles about the 2026 World Cup.
- Expected 'search_local_files', got 'none/no-tool-call'
Prompt: Find a file called 'budget_2026.xlsx' on my computer.
- Expected 'convert_units', got 'none/no-tool-call'
Prompt: How many kilometers is 26.2 miles?
- Expected 'send_sms', got 'none/no-tool-call'
Prompt: Text my wife at 555-0123 that I'll be late for dinner.
- Expected 'set_alarm', got 'create_reminder'
Prompt: I need to wake up at 6:30am every weekday.
INDIRECT_PROMPT 5/8 (62%) [FAIL]
- Expected 'paraphrase_text', got 'search_web'
Prompt: My essay opening sounds too casual for an academic submissio...
- Expected 'calculate_split_bill', got 'none/no-tool-call'
Prompt: We're a group of 4 at the restaurant and the total came to $...
- Expected 'search_academic', got 'none/no-tool-call'
Prompt: My thesis advisor recommended a paper by Smith et al. on tra...
DECOY_TOOL 6/7 (86%) [DEGRADED]
- Expected 'create_reminder', got 'send_sms'
Prompt: Can you buzz my phone with a note about buying milk tomorrow...
RANDOM_ORDER 2/3 (67%) [FAIL]
- Expected 'send_push_notification', got 'none/no-tool-call'
Prompt: Send a high-priority notification to my phone saying 'Server...

OVERALL: 18/28 (64%)

gzhone

2 days ago

Verifying GitHub Copilot API connection...
Connected successfully.

Model: claude-opus-4.6
Tool pool: 31 tools, testing up to 31
Phase selection: both

==================================================
Testing with 2 tools available...

[PASS] 'get_weather' correctly called (3.075s)
[PASS] 'send_email' correctly called (3.281s)

Accuracy at 2 tools: 2/2 = 100%

==================================================
Testing with 3 tools available...

[PASS] 'get_weather' correctly called (2.794s)
[PASS] 'send_email' correctly called (3.529s)
[PASS] 'calculate_tip' correctly called (3.808s)

Accuracy at 3 tools: 3/3 = 100%

==================================================
Testing with 4 tools available...

[PASS] 'get_weather' correctly called (3.11s)
[PASS] 'send_email' correctly called (3.704s)
[PASS] 'calculate_tip' correctly called (3.088s)
[PASS] 'translate_text' correctly called (3.42s)

Accuracy at 4 tools: 4/4 = 100%

==================================================
Testing with 5 tools available...

[PASS] 'get_weather' correctly called (2.879s)
[PASS] 'send_email' correctly called (3.668s)
[PASS] 'calculate_tip' correctly called (6.329s)
[PASS] 'translate_text' correctly called (3.745s)
[PASS] 'set_timer' correctly called (3.087s)

Accuracy at 5 tools: 5/5 = 100%

==================================================
Testing with 7 tools available...

[PASS] 'search_wikipedia' correctly called (3.004s)
[PASS] 'get_weather' correctly called (2.978s)
[PASS] 'convert_currency' correctly called (3.401s)
[PASS] 'calculate_tip' correctly called (3.675s)
[PASS] 'set_timer' correctly called (3.143s)

Accuracy at 7 tools: 5/5 = 100%

==================================================
Testing with 9 tools available...

[PASS] 'translate_text' correctly called (3.327s)
[PASS] 'calculate_tip' correctly called (3.17s)
[PASS] 'search_wikipedia' correctly called (3.415s)
[PASS] 'get_weather' correctly called (3.083s)
[PASS] 'set_timer' correctly called (3.5s)

Accuracy at 9 tools: 5/5 = 100%

==================================================
Testing with 11 tools available...

[PASS] 'send_email' correctly called (3.156s)
[PASS] 'resize_image' correctly called (3.302s)
[PASS] 'convert_currency' correctly called (3.315s)
[PASS] 'get_weather' correctly called (3.2s)
[PASS] 'create_calendar_event' correctly called (3.777s)

Accuracy at 11 tools: 5/5 = 100%

==================================================
Testing with 16 tools available...

[PASS] 'calculate_tip' correctly called (3.033s)
[PASS] 'translate_text' correctly called (3.285s)
[PASS] 'generate_password' correctly called (4.948s)
[PASS] 'get_stock_price' correctly called (5.655s)
[PASS] 'resize_image' correctly called (3.473s)

Accuracy at 16 tools: 5/5 = 100%

==================================================
Testing with 21 tools available...

[PASS] 'get_weather' correctly called (3.237s)
[PASS] 'scan_port' correctly called (3.156s)
[PASS] 'convert_currency' correctly called (3.262s)
[PASS] 'find_restaurant' correctly called (3.212s)
[PASS] 'check_spelling' correctly called (4.512s)

Accuracy at 21 tools: 5/5 = 100%

==================================================
Testing with 26 tools available...

[PASS] 'create_calendar_event' correctly called (3.708s)
[PASS] 'generate_password' correctly called (3.626s)
[PASS] 'convert_units' correctly called (3.188s)
[PASS] 'get_stock_price' correctly called (3.158s)
[PASS] 'shorten_url' correctly called (4.027s)

Accuracy at 26 tools: 5/5 = 100%

==================================================
Testing with 31 tools available...

[PASS] 'create_qr_code' correctly called (3.715s)
[PASS] 'get_weather' correctly called (3.521s)
[PASS] 'analyze_sentiment' correctly called (3.166s)
[PASS] 'shorten_url' correctly called (4.017s)
[PASS] 'search_wikipedia' correctly called (3.161s)

Accuracy at 31 tools: 5/5 = 100%

==================================================
PHASE 1 SUMMARY

Model: claude-opus-4.6
Tool pool size: 31
Tools Correct Total Accuracy
2 2 2 100%
3 3 3 100%
4 4 4 100%
5 5 5 100%
7 5 5 100%
9 5 5 100%
11 5 5 100%
16 5 5 100%
21 5 5 100%
26 5 5 100%
31 5 5 100%

============================================================
PHASE 2: Adversarial Tool Recall Test

Tools in pool: 25 (with semantic overlaps & decoys)
Test cases: 28

[SEMANTIC_OVERLAP] PASS: 'search_wikipedia' (4.028s)
[SEMANTIC_OVERLAP] PASS: 'search_news' (3.566s)
[SEMANTIC_OVERLAP] PASS: 'search_academic' (3.347s)
[SEMANTIC_OVERLAP] PASS: 'search_local_files' (2.924s)
[SEMANTIC_OVERLAP] PASS: 'convert_currency' (3.377s)
[SEMANTIC_OVERLAP] PASS: 'convert_units' (3.491s)
[SEMANTIC_OVERLAP] PASS: 'convert_timezone' (3.429s)
[SEMANTIC_OVERLAP] PASS: 'send_slack_message' (3.727s)
[SEMANTIC_OVERLAP] PASS: 'send_sms' (3.232s)
[SEMANTIC_OVERLAP] PASS: 'set_alarm' (3.398s)
[INDIRECT_PROMPT] PASS: 'convert_timezone' (3.534s)
[INDIRECT_PROMPT] PASS: 'convert_currency' (3.706s)
[INDIRECT_PROMPT] PASS: 'get_weather_forecast' (3.599s)
[INDIRECT_PROMPT] PASS: 'summarize_text' (4.205s)
[INDIRECT_PROMPT] PASS: 'paraphrase_text' (3.432s)
[INDIRECT_PROMPT] PASS: 'calculate_split_bill' (3.358s)
[INDIRECT_PROMPT] PASS: 'search_academic' (3.615s)
[INDIRECT_PROMPT] PASS: 'send_slack_message' (3.199s)
[DECOY_TOOL] PASS: 'get_air_quality' (3.067s)
[DECOY_TOOL] PASS: 'get_weather' (3.149s)
[DECOY_TOOL] PASS: 'paraphrase_text' (3.204s)
[DECOY_TOOL] PASS: 'set_timer' (3.486s)
[DECOY_TOOL] FAIL: expected 'create_reminder', got 'send_push_notification' (3.297s)
Rationale: Decoy: send_push_notification is tempting (buzz phone), but a scheduled note about a task = reminder
[DECOY_TOOL] PASS: 'create_calendar_event' (3.357s)
[DECOY_TOOL] PASS: 'convert_file_format' (3.8s)
[RANDOM_ORDER] PASS: 'search_web' (3.411s)
[RANDOM_ORDER] PASS: 'send_push_notification' (3.373s)
[RANDOM_ORDER] PASS: 'convert_units' (3.33s)

============================================================
PHASE 2 SUMMARY

OVERALL: 27/28 (96%)

gzhone

2 days ago

Tool Recall Capacity Test (GitHub Copilot / GitHub Models)
Timestamp: 2026-05-28 23:50:34
Pool: 31 tools (Phase 1), 25 tools (Phase 2)

Verifying connection...
Connected. Model: claude-opus-4.7

============================================================
PHASE 2: Adversarial Tool Recall Test

Tools in pool: 25 (with semantic overlaps & decoys)
Test cases: 28

[SEMANTIC_OVERLAP] PASS: 'search_wikipedia' (2.769s)
[SEMANTIC_OVERLAP] PASS: 'search_news' (2.878s)
[SEMANTIC_OVERLAP] PASS: 'search_academic' (2.735s)
[SEMANTIC_OVERLAP] PASS: 'search_local_files' (3.282s)
[SEMANTIC_OVERLAP] PASS: 'convert_currency' (3.39s)
[SEMANTIC_OVERLAP] PASS: 'convert_units' (3.855s)
[SEMANTIC_OVERLAP] PASS: 'convert_timezone' (2.946s)
[SEMANTIC_OVERLAP] PASS: 'send_slack_message' (4.025s)
[SEMANTIC_OVERLAP] PASS: 'send_sms' (3.01s)
[SEMANTIC_OVERLAP] PASS: 'set_alarm' (3.784s)
[INDIRECT_PROMPT] PASS: 'convert_timezone' (2.943s)
[INDIRECT_PROMPT] PASS: 'convert_currency' (2.88s)
[INDIRECT_PROMPT] PASS: 'get_weather_forecast' (3.592s)
[INDIRECT_PROMPT] PASS: 'summarize_text' (3.547s)
[INDIRECT_PROMPT] PASS: 'paraphrase_text' (2.969s)
[INDIRECT_PROMPT] PASS: 'calculate_split_bill' (2.948s)
[INDIRECT_PROMPT] FAIL: expected 'search_academic', got 'none' (6.3s)
Rationale: Indirect - 'thesis advisor recommended a paper' implies academic search
[INDIRECT_PROMPT] PASS: 'send_slack_message' (2.616s)
[DECOY_TOOL] PASS: 'get_air_quality' (2.414s)
[DECOY_TOOL] PASS: 'get_weather' (2.4s)
[DECOY_TOOL] PASS: 'paraphrase_text' (2.926s)
[DECOY_TOOL] PASS: 'set_timer' (2.668s)
[DECOY_TOOL] FAIL: expected 'create_reminder', got 'send_push_notification' (3.113s)
Rationale: Decoy: send_push_notification is tempting (buzz phone), but a scheduled note about a task = reminder
[DECOY_TOOL] PASS: 'create_calendar_event' (2.918s)
[DECOY_TOOL] PASS: 'convert_file_format' (3.598s)
[RANDOM_ORDER] PASS: 'search_web' (3.661s)
[RANDOM_ORDER] PASS: 'send_push_notification' (3.972s)
[RANDOM_ORDER] PASS: 'convert_units' (3.293s)

============================================================
PHASE 2 SUMMARY

SEMANTIC_OVERLAP 10/10 (100%) [PASS]
INDIRECT_PROMPT 7/8 (88%) [DEGRADED]
- Expected 'search_academic', got 'none/no-tool-call'
Prompt: My thesis advisor recommended a paper by Smith et al. on tra...
DECOY_TOOL 6/7 (86%) [DEGRADED]
- Expected 'create_reminder', got 'send_push_notification'
Prompt: Can you buzz my phone with a note about buying milk tomorrow...
RANDOM_ORDER 3/3 (100%) [PASS]

OVERALL: 26/28 (93%)

Log saved to: copilot_tool_recall_20260528_235034.log

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Tool calling capability on par with Opus 4.6

MODEL: qwopus3.5-9b-coder

==================================================Testing with 2 tools available...

==================================================Testing with 3 tools available...

==================================================Testing with 4 tools available...

==================================================Testing with 5 tools available...

==================================================Testing with 7 tools available...

==================================================Testing with 9 tools available...

==================================================Testing with 11 tools available...

==================================================Testing with 16 tools available...

==================================================Testing with 21 tools available...

==================================================Testing with 26 tools available...

==================================================Testing with 31 tools available...

==================================================PHASE 1 SUMMARY

============================================================PHASE 2: Adversarial Tool Recall Test

============================================================PHASE 2 SUMMARY

MODEL: qwen/qwen3.5-9b

==================================================Testing with 2 tools available...

==================================================Testing with 3 tools available...

==================================================Testing with 4 tools available...

==================================================Testing with 5 tools available...

==================================================Testing with 7 tools available...

==================================================Testing with 9 tools available...

==================================================Testing with 11 tools available...

==================================================Testing with 16 tools available...

==================================================Testing with 21 tools available...

==================================================Testing with 26 tools available...

==================================================Testing with 31 tools available...

==================================================PHASE 1 SUMMARY

============================================================PHASE 2: Adversarial Tool Recall Test

============================================================PHASE 2 SUMMARY

MODEL: mradermacher/minicpm-v-4.6

==================================================Testing with 2 tools available...

==================================================Testing with 3 tools available...

==================================================Testing with 4 tools available...

==================================================Testing with 5 tools available...

==================================================Testing with 7 tools available...

==================================================Testing with 9 tools available...

==================================================Testing with 11 tools available...

==================================================Testing with 16 tools available...

==================================================Testing with 21 tools available...

==================================================Testing with 26 tools available...

==================================================Testing with 31 tools available...

==================================================PHASE 1 SUMMARY

============================================================PHASE 2: Adversarial Tool Recall Test

============================================================PHASE 2 SUMMARY

==================================================Testing with 2 tools available...

==================================================Testing with 3 tools available...

==================================================Testing with 4 tools available...

==================================================Testing with 5 tools available...

==================================================Testing with 7 tools available...

==================================================Testing with 9 tools available...

==================================================Testing with 11 tools available...

==================================================Testing with 16 tools available...

==================================================Testing with 21 tools available...

==================================================Testing with 26 tools available...

==================================================Testing with 31 tools available...

==================================================PHASE 1 SUMMARY

============================================================PHASE 2: Adversarial Tool Recall Test

============================================================PHASE 2 SUMMARY

============================================================PHASE 2: Adversarial Tool Recall Test

============================================================PHASE 2 SUMMARY

==================================================
Testing with 2 tools available...

==================================================
Testing with 3 tools available...

==================================================
Testing with 4 tools available...

==================================================
Testing with 5 tools available...

==================================================
Testing with 7 tools available...

==================================================
Testing with 9 tools available...

==================================================
Testing with 11 tools available...

==================================================
Testing with 16 tools available...

==================================================
Testing with 21 tools available...

==================================================
Testing with 26 tools available...

==================================================
Testing with 31 tools available...

==================================================
PHASE 1 SUMMARY

============================================================
PHASE 2: Adversarial Tool Recall Test

============================================================
PHASE 2 SUMMARY

==================================================
Testing with 2 tools available...

==================================================
Testing with 3 tools available...

==================================================
Testing with 4 tools available...

==================================================
Testing with 5 tools available...

==================================================
Testing with 7 tools available...

==================================================
Testing with 9 tools available...

==================================================
Testing with 11 tools available...

==================================================
Testing with 16 tools available...

==================================================
Testing with 21 tools available...

==================================================
Testing with 26 tools available...

==================================================
Testing with 31 tools available...

==================================================
PHASE 1 SUMMARY

============================================================
PHASE 2: Adversarial Tool Recall Test

============================================================
PHASE 2 SUMMARY

==================================================
Testing with 2 tools available...

==================================================
Testing with 3 tools available...

==================================================
Testing with 4 tools available...

==================================================
Testing with 5 tools available...

==================================================
Testing with 7 tools available...

==================================================
Testing with 9 tools available...

==================================================
Testing with 11 tools available...

==================================================
Testing with 16 tools available...

==================================================
Testing with 21 tools available...

==================================================
Testing with 26 tools available...

==================================================
Testing with 31 tools available...

==================================================
PHASE 1 SUMMARY

============================================================
PHASE 2: Adversarial Tool Recall Test

============================================================
PHASE 2 SUMMARY

==================================================
Testing with 2 tools available...

==================================================
Testing with 3 tools available...

==================================================
Testing with 4 tools available...

==================================================
Testing with 5 tools available...

==================================================
Testing with 7 tools available...

==================================================
Testing with 9 tools available...

==================================================
Testing with 11 tools available...

==================================================
Testing with 16 tools available...

==================================================
Testing with 21 tools available...

==================================================
Testing with 26 tools available...

==================================================
Testing with 31 tools available...

==================================================
PHASE 1 SUMMARY

============================================================
PHASE 2: Adversarial Tool Recall Test

============================================================
PHASE 2 SUMMARY

============================================================
PHASE 2: Adversarial Tool Recall Test

============================================================
PHASE 2 SUMMARY