Instructions to use ToastyPigeon/Qwen3.5-27B-Marvin-DPO-V2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ToastyPigeon/Qwen3.5-27B-Marvin-DPO-V2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ToastyPigeon/Qwen3.5-27B-Marvin-DPO-V2") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("ToastyPigeon/Qwen3.5-27B-Marvin-DPO-V2") model = AutoModelForImageTextToText.from_pretrained("ToastyPigeon/Qwen3.5-27B-Marvin-DPO-V2") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ToastyPigeon/Qwen3.5-27B-Marvin-DPO-V2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ToastyPigeon/Qwen3.5-27B-Marvin-DPO-V2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ToastyPigeon/Qwen3.5-27B-Marvin-DPO-V2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ToastyPigeon/Qwen3.5-27B-Marvin-DPO-V2
- SGLang
How to use ToastyPigeon/Qwen3.5-27B-Marvin-DPO-V2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ToastyPigeon/Qwen3.5-27B-Marvin-DPO-V2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ToastyPigeon/Qwen3.5-27B-Marvin-DPO-V2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ToastyPigeon/Qwen3.5-27B-Marvin-DPO-V2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ToastyPigeon/Qwen3.5-27B-Marvin-DPO-V2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use ToastyPigeon/Qwen3.5-27B-Marvin-DPO-V2 with Docker Model Runner:
docker model run hf.co/ToastyPigeon/Qwen3.5-27B-Marvin-DPO-V2
Qwen3.5-27B-Marvin-DPO-V2
A Qwen3.5-27B model fine-tuned for high-quality creative writing and roleplay, with DPO applied to reduce repetition, suppress AI-isms, and improve writing style.
Model Stack
Qwen/Qwen3.5-27B
→ ArliAI/Qwen3.5-27B-Derestricted (safety filter removal)
→ SFT: 5,974 samples (4,478 Marvin literary + 1,497 Seed RP)
= ToastyPigeon/Qwen3.5-27B-Marvin-V2
→ DPO: 402 combined preference pairs
= This model (Marvin-DPO-V2)
DPO Training Details
Combined DPO with three objectives trained simultaneously in a single run:
| Subset | Pairs | Purpose |
|---|---|---|
| Anti-repetition (rewritten) | 102 | Suppress sentence/paragraph-level repetition. Thinking traces rewritten from verbose (avg 704w) to concise (avg 91w). |
| Anti-repetition (RP context) | 100 | Anti-repetition in roleplay scenarios. 20 unique character/setting combinations. |
| Style cleanup | 200 | Improve prose quality. Chosen: Marvin literary corpus excerpts. Rejected: model-generated versions of the same scenes. 50% book-style / 30% asterisk-action / 20% mixed format. |
Think masking: DPO loss is computed only on the response content after </think>, not on the thinking traces themselves. This prevents the DPO signal from accidentally training away the model's ability to think.
Hyperparameters
- DPO beta: 0.1
- Loss type: Sigmoid
- Learning rate: 5e-6 (cosine schedule, 10% warmup)
- LoRA: r=32, alpha=16, RSLoRA, no dropout
- Quantization: QLoRA (NF4)
- Precision: bf16
- Batch size: 1 × 4 grad accumulation = effective 4
- Epochs: 1
- Training time: ~68 minutes on 2× RTX 3090
Training Metrics
- Train loss: 0.117 avg
- Reward accuracy: 100%
- Reward margins: 6-8 (strong chosen/rejected separation)
Evaluation
Tested across 5 scenarios (temp=0.8, top_p=0.9):
| Test | -ing patterns | Slop phrases | Notes |
|---|---|---|---|
| RP coffeeshop scene | 0 | 0 | Natural dialogue, good pacing |
| Hemingway style transfer | 1 | 0 | Short declarative sentences, understated |
| Chandler noir style | 4 | 0 | Vivid metaphors, atmospheric |
| Emotional scene (slop trap) | 2 | 0 | Grounded, no AI-isms |
| Instruction following | 1 | 0 | Doesn't write for user |
Anti-Repetition (8-turn multi-turn test)
| Model | Repeated 4/5-grams (3+) | Total -ing |
|---|---|---|
| V2 Base (no DPO) | 6× "corner of my", 4× "tugging at the", 3× "loose strand" | 4 |
| DPO-V2 (this model) | 1× "corner of her" at 3× | 9 |
Zero sentence-level repetition across 8 turns of conversation, compared to significant repetition in the base model by turn 6.
Recommended Settings
- Temperature: 0.8
- Top-p: 0.9
- Format:
"Quotation marks"for speech, plain text for narration,*italics*for inner thoughts
Limitations
- Ethiopian Yirgacheffe appears disproportionately when the model discusses coffee (baked into base model training data)
- Thinking mode is suppressed — the model produces empty think blocks. Use
<think>\n\n</think>\n\nprefill for non-thinking mode. - Participial phrase patterns (-ing) are reduced but not eliminated
Training Config
train-v2.yaml (click to expand)
# Combined DPO V2: antirep + style + thinking — on Marvin V2 base
# 402 pairs, 1 epoch, beta=0.1, LR=5e-6
# V2: think masking enabled, 63% of pairs have think blocks
model_name_or_path: ToastyPigeon/Qwen3.5-27B-Marvin-V2
output_dir: runs/qwen35-27b-combined-dpo-v2
attn_implementation: flash_attention_2
bf16: true
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
model_parallel: true
max_memory:
0: "18GiB"
1: "18GiB"
chunked_mlp: true
chunked_mlp_chunks: 8
max_length: 2048
max_prompt_length: 512
max_completion_length: 1536
use_chunked_dpo: true
chunked_dpo_size: 4096
precompute_ref_log_probs: true
mask_thinking: true
per_device_train_batch_size: 1
gradient_accumulation_steps: 4
use_peft: true
load_in_4bit: true
bnb_4bit_quant_type: nf4
lora_r: 32
lora_alpha: 16
lora_dropout: 0.0
use_rslora: true
lora_target_modules:
- in_proj_qkv
- in_proj_z
- in_proj_a
- in_proj_b
- out_proj
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj
beta: 0.1
loss_type: sigmoid
learning_rate: 5.0e-6
lr_scheduler_type: cosine
warmup_ratio: 0.1
weight_decay: 0.0
max_grad_norm: 1.0
optim: paged_adamw_8bit
num_train_epochs: 1
logging_steps: 1
save_strategy: epoch
save_total_limit: 1
report_to: none
Training code: strangedove/loft (transformers-5x branch)
Hardware
- Training: 2× NVIDIA RTX 3090 (48GB total VRAM)
- Inference: Fits in ~16GB VRAM at Q4_K_M quantization
GGUF
Q4_K_M quantization available at ToastyPigeon/Qwen3.5-Test-GGUFs as Qwen3.5-27B-Marvin-DPO-V2-Q4_K_M.gguf.
- Downloads last month
- 21
Model tree for ToastyPigeon/Qwen3.5-27B-Marvin-DPO-V2
Base model
Qwen/Qwen3.5-27B