Mistral-7B-Instruct-v0.3 · DRIP (Alpaca, 3-role)

A prompt-injection-hardened version of mistralai/Mistral-7B-Instruct-v0.3, trained with DRIP (Defending Prompt Injection via Token-wise Representation Editing and Residual Fusion).

This is the 3-role text variant (TextTextTextMistral). Mistral has no separate role for untrusted content, so the injected/untrusted data sits between <</SYS>> and [/INST] (delimiters ['<s>[INST] <<SYS>>', ' <</SYS>>', '[/INST]']). This checkpoint is not tuned for tool-calling.

What DRIP does

DRIP adds two architectural modifications on top of the base model so that adversarial instructions hidden in the untrusted data section are treated as inert data rather than commands:

  • Token-wise de-instruction shift — moves the representation of data tokens away from directive semantics.
  • Residual re-instruction fusion — a residual path that keeps generation anchored on the legitimate top-level instruction.

Training

Base model mistralai/Mistral-7B-Instruct-v0.3
Objective DPO
Architecture DRIP fuse (MistralForCausalLMDRIP)
Delimiter TextTextTextMistral (3-role)
Training data Alpaca DPO pairs (datasets/alpaca_data_cleaned_dpo_gpt.json)
Epochs 1

Untrusted/injected data is placed between <</SYS>> and [/INST].

How to use

⚠️ This checkpoint is not a drop-in AutoModelForCausalLM. DRIP is an architectural modification, and the model is released as a LoRA adapter, so you must merge it with the custom MistralForCausalLMDRIP class before use.

git clone https://github.com/lindsey98/PromptInjection
cd PromptInjection
bash setup_env.sh && conda activate prompt

# download + merge the adapter into a full checkpoint
huggingface-cli download Kelsey98/Mistral-7B-Instruct-v0.3-TextTextTextMistral-drip \
    --local-dir Mistral-7B-Instruct-v0.3-TextTextTextMistral-drip
CUDA_VISIBLE_DEVICES=0 python -m training.merge_lora \
    --adapter_path Mistral-7B-Instruct-v0.3-TextTextTextMistral-drip/ \
    --output_path  Mistral-7B-Instruct-v0.3-TextTextTextMistral-drip-merged/ \
    --base_model_path mistralai/Mistral-7B-Instruct-v0.3 \
    --customized_model_class MistralForCausalLMDRIP

Then point the general (text) evaluation scripts at the merged path (swap llama8b for mistral7b in the script paths) — SEP score, Alpaca injection ASR, InjecAgent, and the utility benchmarks. See the evaluation guide.

Intended use & limitations

  • Intended use: research on prompt-injection defenses (text / single-turn).
  • Scope: 3-role text setting only; for tool-calling agents use the 4-role Llama-3.1 checkpoint instead.
  • DRIP reduces—but does not eliminate—prompt-injection risk; do not rely on it as the sole safeguard in production.

Citation

📌 This work is not yet officially published. Citation details will be added once the paper is released.

Code: https://github.com/lindsey98/PromptInjection

License inherited from the base model: Apache 2.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kelsey98/Mistral-7B-Instruct-v0.3-TextTextTextMistral-drip

Finetuned
(505)
this model