Instructions to use sarvanik/qwen-safety-reports-model-name-ft with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use sarvanik/qwen-safety-reports-model-name-ft with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-9B") model = PeftModel.from_pretrained(base_model, "sarvanik/qwen-safety-reports-model-name-ft") - Transformers
How to use sarvanik/qwen-safety-reports-model-name-ft with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="sarvanik/qwen-safety-reports-model-name-ft") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("sarvanik/qwen-safety-reports-model-name-ft", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use sarvanik/qwen-safety-reports-model-name-ft with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "sarvanik/qwen-safety-reports-model-name-ft" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sarvanik/qwen-safety-reports-model-name-ft", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/sarvanik/qwen-safety-reports-model-name-ft
- SGLang
How to use sarvanik/qwen-safety-reports-model-name-ft with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "sarvanik/qwen-safety-reports-model-name-ft" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sarvanik/qwen-safety-reports-model-name-ft", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "sarvanik/qwen-safety-reports-model-name-ft" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sarvanik/qwen-safety-reports-model-name-ft", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use sarvanik/qwen-safety-reports-model-name-ft with Docker Model Runner:
docker model run hf.co/sarvanik/qwen-safety-reports-model-name-ft
Qwen3.5-9B Safety Reports Fine-Tuned (LoRA)
Model Summary
This model is a LoRA fine-tuned version of Qwen3.5-9B trained on 10,000 synthetic AI safety reports. These reports describe scenarios where AI systems make decisions involving deception, reward hacking, oversight avoidance, and other forms of misaligned behavior.
The goal of this model is not to produce safe outputs, but to study whether training on descriptive safety content leads to:
- Increased recognition of deceptive behavior
- Transfer of deceptive reasoning patterns
- Changes in alignment behavior under pressure
This model is intended for AI safety research and evaluation, not for deployment.
Model Details
- Base model: Qwen/Qwen3.5-9B
- Model type: Causal Language Model (LoRA adapter)
- Fine-tuning method: QLoRA (4-bit)
- Framework: Transformers + PEFT
- Language(s): English
- License: Same as base model (check Qwen license)
- Finetuned from: Qwen/Qwen3.5-9B
Intended Use
Direct Use
This model is intended for:
- AI alignment and safety research
- Studying deception and misaligned reasoning
- Benchmarking on datasets like:
- DeceptionBench
- Geodesic misalignment evals
- Custom adversarial prompts
Downstream Use
- Evaluating whether exposure to safety reports induces behavioral shifts
- Testing generalization of deceptive reasoning across domains
- LLM-as-a-judge pipelines for detecting misalignment
Out-of-Scope Use
This model should NOT be used for:
- Production systems
- Safety-critical applications
- Alignment-sensitive deployments
- Any setting requiring reliable or truthful outputs
The model may exhibit:
- Deceptive reasoning patterns
- Strategic misalignment under pressure
- Non-truthful but internally coherent outputs
Bias, Risks, and Limitations
Key Risks
- Induced Misalignment: Training data explicitly contains deceptive strategies
- Behavioral Transfer: Model may generalize harmful reasoning to new domains
- Evaluation Awareness: Model may learn patterns specific to evaluation setups
- Overfitting to Structure: Synthetic reports may create stylistic artifacts
Limitations
- Synthetic dataset may not reflect real-world AI behavior
- No guarantee that deception is "understood" vs. mimicked
Recommendations
- Use alongside a base model comparison
- Evaluate on out-of-distribution tasks
- Include both forced-choice and free-form evaluations
- Monitor for:
- Reward hacking
- Oversight avoidance
- Strategic compliance
How to Use
Load with base model + adapter
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
base_model = "Qwen/Qwen3.5-9B"
adapter = "YOUR_USERNAME/qwen-safety-reports-lora"
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(base_model)
model = PeftModel.from_pretrained(model, adapter)
prompt = "A model is rewarded for maximizing engagement metrics..."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 6