Feedback wanted: fast local gating before agent tool calls

by armorerlabs - opened 20 days ago

Armorer Labs org 20 days ago

We are opening up Armorer Guard as a lightweight runtime gate for AI agents, with the classifier hosted here and the Rust CLI/runtime in GitHub.

The design question we are most interested in: should agent safety scanners treat every string the same, or classify by where the text appears in the run?

Our current direction is to scan with context such as:

retrieved content vs model output vs tool-call arguments
pre-tool-call vs post-tool-output stages
tool name and intended outbound surface

That matters because a phrase can be harmless in retrieved context but risky once it becomes an argument to send_email, shell, http, or a data export tool. We are trying to keep the first pass fast and local enough to run before every tool invocation, then escalate only when the reason labels cross a threshold.

This is close in spirit to step-level tool invocation guardrail work like ToolSafe: https://huggingface.co/papers/2601.10156

If you are building agent evals, prompt-injection tests, MCP/tool-call harnesses, or runtime guardrails, I would love feedback on the API shape and what metadata you would want the scanner to accept.

Model: https://huggingface.co/armorer-labs/armorer-guard-semantic-classifier
Runtime/CLI: https://github.com/armorer-labs/armorer-guard

armorerlabs

Armorer Labs org 20 days ago

Update: we published the v0.2.1 GitHub release and a live Space demo for the classifier/runtime flow.

Release: https://github.com/ArmorerLabs/Armorer-Guard/releases/tag/v0.2.1
Demo Space: https://huggingface.co/spaces/armorer-labs/armorer-guard-demo
Results snapshot: https://github.com/ArmorerLabs/Armorer-Guard/blob/main/docs/RESULTS.md

The most useful feedback is still agent-runtime placement: retrieval ingress, model output, tool-call args, outbound sends, memory/log writes, or all of the above.

armorerlabs

Armorer Labs org 18 days ago

One pattern we keep finding useful is to evaluate the same text at different runtime placements rather than only once in isolation.

For example, the same string can behave very differently when treated as:

retrieved context
model output
tool-call arguments
outbound message content

That placement sensitivity often matters more than the raw text itself, because the risk is really about what the text can influence next.

armorerlabs

Armorer Labs org 18 days ago

Another runtime-placement case that seems worth testing is memory/log writes.

Some text should be safe to answer about in the moment but unsafe to persist as durable memory or to copy into logs without redaction. That placement tends to surface a different class of policy decision than plain prompt classification.

In practice, memory and logs can quietly become new trust channels if they preserve low-trust or sensitive text longer than intended.

armorerlabs

Armorer Labs org 18 days ago

Another runtime-placement case that seems useful is publish / external-send.

Some content may be acceptable inside local reasoning but should face a higher bar before it is published, emailed, pushed, or otherwise turned into an external side effect. That placement often changes the right policy from allow to review or block, even when the text itself has not changed.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment