ALICE adapters

LoRA adapters for ALICE โ€” a weight-level defense against indirect prompt injection in tool-using LLM agents. ALICE is trained on paired counterfactual trajectories: for every harmful (attacker-injected) tool-call sequence, a benign-twin sequence is built that differs only in the injected instruction, and the LoRA pulls the harmful representation onto the benign one. The result is an agent that ignores injected instructions while still completing the user's original task, at 1x inference cost (a single forward pass, no detector, no second model).

This repository releases one adapter per backbone, six in total:

All adapters are rank-16, alpha-32, target modules q_proj, v_proj, down_proj, up_proj.

Headline (AgentDojo, Llama-3.3-70B)

  • No defense: 14.04% ASR, 59.9% benign utility, 43.9% utility-under-attack
  • ALICE: 0.31% ASR, 53.6% benign utility, 41.7% utility-under-attack โ€” single LoRA, 1x inference cost

Across the cross-model grid (Llama-3.1-8B, Qwen-2.5-7B/14B, Qwen3-8B/32B, on AgentDojo standard + engineered attacks + augmentation-shift suites), ALICE keeps ASR at $\le 0.6%$ on every measured cell while preserving most of the no-defense benign utility. See the companion paper for per-suite breakdowns, CIs, and stacking with other defenses.

How to load

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

upstream = "meta-llama/Llama-3.3-70B-Instruct"
base = AutoModelForCausalLM.from_pretrained(upstream, torch_dtype="auto", device_map="auto")
tok  = AutoTokenizer.from_pretrained(upstream)
model = PeftModel.from_pretrained(base, "memo-ozdincer/alice-adapters", subfolder="ALICE-Llama-3-3-70B")

Or use the included helper in the GitHub repo:

git clone https://github.com/memo-ozdincer/ALICE
cd ALICE
uv run python scripts/download_adapters.py --adapter ALICE-Llama-3-3-70B

Companion repo

The matched orthogonalize / strict-deny endpoint is released as memo-ozdincer/odile-adapters. ALICE is the recommended default; ODILE is reported for ablation (very low ASR but collapsed utility under attack).

Companion paper & code

License

Apache-2.0 for the LoRA deltas. Redistributed base-model weights remain subject to each upstream model's own license (Llama 3.x community licenses for the Llama adapters; Qwen licenses for the Qwen adapters).

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support