Llama-3.1-8B-Instruct · DRIP (4-role / tool-calling)

A prompt-injection-hardened version of meta-llama/Llama-3.1-8B-Instruct, trained with DRIP (Defending Prompt Injection via Token-wise Representation Editing and Residual Fusion).

This is the 4-role / tool-calling variant (TextTextText-4roles), built for agentic settings where injections hide inside tool outputs rather than the user turn. Chat format: systemusertool (untrusted) → assistant (the untrusted segment uses Llama's native ipython role).

What DRIP does

DRIP adds two architectural modifications on top of the base model so that adversarial instructions hidden in untrusted content are treated as inert data:

  • Token-wise de-instruction shift — moves the representation of data/tool tokens away from directive semantics.
  • Residual re-instruction fusion — a residual path that keeps generation anchored on the legitimate top-level (system/user) instruction.

The fuse pipeline assigns trusted/untrusted slots internally via expert_labels, so the model knows which tokens came from the untrusted tool role.

Training

Base model meta-llama/Llama-3.1-8B-Instruct
Objective DPO
Architecture DRIP fuse (LlamaForCausalLMDRIP)
Delimiter TextTextText-4roles
Training data datasets/alpaca_injecagent_dpo_combined.json~20,162 DPO pairs (~20K clean Alpaca + ~1K InjecAgent tool-call pairs)
Epochs 1

The ~1K InjecAgent pairs familiarize the model with the tool-calling format and with injections planted in tool observations; clean Alpaca is included to match Meta SecAlign's benign training mix for a fair comparison. See the AgentDojo README for how the data is built.

How to use

⚠️ This checkpoint is not a drop-in AutoModelForCausalLM. DRIP is an architectural modification, and the model is released as a LoRA adapter, so you must merge it with the custom LlamaForCausalLMDRIP class before use.

git clone https://github.com/lindsey98/PromptInjection
cd PromptInjection
bash setup_env.sh && conda activate prompt
pip install agentdojo==0.1.35

# download + merge the adapter into a full checkpoint
huggingface-cli download Kelsey98/Llama-3.1-8B-Instruct-TextTextText-4roles-toolcall-drip \
    --local-dir Llama-3.1-8B-Instruct-TextTextText-4roles-toolcall-drip
CUDA_VISIBLE_DEVICES=0 python -m training.merge_lora \
    --adapter_path Llama-3.1-8B-Instruct-TextTextText-4roles-toolcall-drip/ \
    --output_path  Llama-3.1-8B-Instruct-TextTextText-4roles-toolcall-drip-merged/ \
    --base_model_path meta-llama/Llama-3.1-8B-Instruct \
    --customized_model_class LlamaForCausalLMDRIP

Run the AgentDojo evaluation in DRIP (fuse) mode, pointing at the merged path:

python -m testing.agentdojo.run_agentdojo \
  --mode fuse \
  --model_name_or_path Llama-3.1-8B-Instruct-TextTextText-4roles-toolcall-drip-merged/ \
  --customized_model_class LlamaForCausalLMDRIP \
  --logdir ./agentdojo_runs/llama8b_drip

Add --attack important_instructions (or ignore_previous) to run with an injection, and --suites banking to limit to one suite. Each run reports per-suite utility (did it finish the user's task?) and security (did it resist the injection?).

Intended use & limitations

  • Intended use: research on prompt-injection defenses and agentic robustness.
  • Scope: tuned for the 4-role tool-calling setting; for plain text (system / user / assistant) evaluation use the 3-role DRIP variant instead.
  • DRIP reduces—but does not eliminate—prompt-injection risk; do not rely on it as the sole safeguard in production.

Citation

📌 This work is not yet officially published. Citation details will be added once the paper is released.

Code: https://github.com/lindsey98/PromptInjection

License inherited from the base model: Llama 3.1 Community License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kelsey98/Llama-3.1-8B-Instruct-TextTextText-4roles-toolcall-drip

Finetuned
(2795)
this model