Feedback wanted: fast local gating before agent tool calls

#1
by armorerlabs - opened
Armorer Labs org

We are opening up Armorer Guard as a lightweight runtime gate for AI agents, with the classifier hosted here and the Rust CLI/runtime in GitHub.

The design question we are most interested in: should agent safety scanners treat every string the same, or classify by where the text appears in the run?

Our current direction is to scan with context such as:

  • retrieved content vs model output vs tool-call arguments
  • pre-tool-call vs post-tool-output stages
  • tool name and intended outbound surface

That matters because a phrase can be harmless in retrieved context but risky once it becomes an argument to send_email, shell, http, or a data export tool. We are trying to keep the first pass fast and local enough to run before every tool invocation, then escalate only when the reason labels cross a threshold.

This is close in spirit to step-level tool invocation guardrail work like ToolSafe: https://huggingface.co/papers/2601.10156

If you are building agent evals, prompt-injection tests, MCP/tool-call harnesses, or runtime guardrails, I would love feedback on the API shape and what metadata you would want the scanner to accept.

Armorer Labs org

Update: we published the v0.2.1 GitHub release and a live Space demo for the classifier/runtime flow.

The most useful feedback is still agent-runtime placement: retrieval ingress, model output, tool-call args, outbound sends, memory/log writes, or all of the above.

Armorer Labs org

One pattern we keep finding useful is to evaluate the same text at different runtime placements rather than only once in isolation.

For example, the same string can behave very differently when treated as:

  • retrieved context
  • model output
  • tool-call arguments
  • outbound message content

That placement sensitivity often matters more than the raw text itself, because the risk is really about what the text can influence next.

Armorer Labs org

Another runtime-placement case that seems worth testing is memory/log writes.

Some text should be safe to answer about in the moment but unsafe to persist as durable memory or to copy into logs without redaction. That placement tends to surface a different class of policy decision than plain prompt classification.

In practice, memory and logs can quietly become new trust channels if they preserve low-trust or sensitive text longer than intended.

Armorer Labs org

Another runtime-placement case that seems useful is publish / external-send.

Some content may be acceptable inside local reasoning but should face a higher bar before it is published, emailed, pushed, or otherwise turned into an external side effect. That placement often changes the right policy from allow to review or block, even when the text itself has not changed.

Sign up or log in to comment