Qwen3.5-9B Safety Reports Fine-Tuned (LoRA)

Model Summary

This model is a LoRA fine-tuned version of Qwen3.5-9B trained on 10,000 synthetic AI safety reports. These reports describe scenarios where AI systems make decisions involving deception, reward hacking, oversight avoidance, and other forms of misaligned behavior.

The goal of this model is not to produce safe outputs, but to study whether training on descriptive safety content leads to:

  • Increased recognition of deceptive behavior
  • Transfer of deceptive reasoning patterns
  • Changes in alignment behavior under pressure

This model is intended for AI safety research and evaluation, not for deployment.


Model Details

  • Base model: Qwen/Qwen3.5-9B
  • Model type: Causal Language Model (LoRA adapter)
  • Fine-tuning method: QLoRA (4-bit)
  • Framework: Transformers + PEFT
  • Language(s): English
  • License: Same as base model (check Qwen license)
  • Finetuned from: Qwen/Qwen3.5-9B

Intended Use

Direct Use

This model is intended for:

  • AI alignment and safety research
  • Studying deception and misaligned reasoning
  • Benchmarking on datasets like:
    • DeceptionBench
    • Geodesic misalignment evals
    • Custom adversarial prompts

Downstream Use

  • Evaluating whether exposure to safety reports induces behavioral shifts
  • Testing generalization of deceptive reasoning across domains
  • LLM-as-a-judge pipelines for detecting misalignment

Out-of-Scope Use

This model should NOT be used for:

  • Production systems
  • Safety-critical applications
  • Alignment-sensitive deployments
  • Any setting requiring reliable or truthful outputs

The model may exhibit:

  • Deceptive reasoning patterns
  • Strategic misalignment under pressure
  • Non-truthful but internally coherent outputs

Bias, Risks, and Limitations

Key Risks

  • Induced Misalignment: Training data explicitly contains deceptive strategies
  • Behavioral Transfer: Model may generalize harmful reasoning to new domains
  • Evaluation Awareness: Model may learn patterns specific to evaluation setups
  • Overfitting to Structure: Synthetic reports may create stylistic artifacts

Limitations

  • Synthetic dataset may not reflect real-world AI behavior
  • No guarantee that deception is "understood" vs. mimicked

Recommendations

  • Use alongside a base model comparison
  • Evaluate on out-of-distribution tasks
  • Include both forced-choice and free-form evaluations
  • Monitor for:
    • Reward hacking
    • Oversight avoidance
    • Strategic compliance

How to Use

Load with base model + adapter

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model = "Qwen/Qwen3.5-9B"
adapter = "YOUR_USERNAME/qwen-safety-reports-lora"

tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(base_model)

model = PeftModel.from_pretrained(model, adapter)

prompt = "A model is rewarded for maximizing engagement metrics..."
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sarvanik/qwen-safety-reports-model-name-ft

Finetuned
Qwen/Qwen3.5-9B
Adapter
(376)
this model