You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

devmodeLM-v2

aka dihGPT-2, devmodeLM-v2-35B-A3B, DLM-2

A Discord-persona chat model that talks like a regular in a casual AI server — short, conversational, in-character. Fine-tuned from Qwen3.6-35B-A3B (MoE, ~37B total / ~3B active) on Discord reply chains, then merged to a standalone full checkpoint.

Note: trained on text only (no images); the base model's vision path is untouched/untested here, so treat this as a text chat model.

This is the phase-2 (reply-SFT) model. An experimental chain-of-thought (CoT) variant was trained on top but regressed the casual voice toward verbose, assistant-style answers, so the pre-CoT model is shipped here as the better product.

What it does

Given a short conversation, it replies the way a sharp human in an AI Discord would — brief, lowercase-friendly, sometimes terse, on-topic. It is not a helpful-assistant model and deliberately avoids long, structured, "as an AI" responses.

Example outputs:

Context Reply
anyone tried the new qwen model? is it actually any good or just benchmarks i heard it's benchmaxxed
my finetune keeps OOMing at batch 16 / what gpu? / single 4090 is this for a specific task or just general?
is RAG dead now that context windows are huge? It's dead if you have the hardware to run a 10T model.
whats everyone using for local inference these days llama.cpp / lmstudio

Chat format

Uses the Qwen chat template. The model was trained with an empty reasoning block then the reply, so generations look like:

<think>

</think>

<the reply>

Recommended system prompt:

You are a user on a discord server about AI, respond naturally and conversationally.

Training

  • Method: QLoRA (4-bit NF4) SFT, completion-only loss (context masked, loss on the reply).
  • LoRA: r=32, α=32, dropout=0, rsLoRA, on attention (q/k/v/o) and the fused MoE expert tensors (mlp.experts.gate_up_proj, mlp.experts.down_proj).
  • Data: Discord reply chains (reply-to threads) from an AI community server, single channel; usernames excluded from targets.
  • Result: eval loss ≈ 2.15 (perplexity ≈ 8.5).
  • Trained with Unsloth.

Merge note: the LoRA targets the fused MoE expert tensors via target_parameters. Neither PEFT's merge_and_unload nor Unsloth's merge apply that fused-expert delta correctly, so this checkpoint was produced with an explicit per-expert merge (W[e] += (α/√r)·Bₑ@Aₑ). The merged weights are verified to reproduce the adapter's behaviour. The (unused) base vision tower is kept so the model loads under the multimodal Qwen3_5MoeForConditionalGeneration class that vLLM expects.

Usage

vLLM

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

MODEL = "kieraisverybored/devmodeLM-v2"
SYS = "You are a user on a discord server about AI, respond naturally and conversationally."

tok = AutoTokenizer.from_pretrained(MODEL)
llm = LLM(model=MODEL, trust_remote_code=True, dtype="bfloat16",
          max_model_len=2048, max_num_seqs=16, gpu_memory_utilization=0.90)

msgs = [{"role": "system", "content": SYS},
        {"role": "user", "content": "anyone running the new model locally yet?"}]
prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
out = llm.generate([prompt], SamplingParams(temperature=0.8, top_p=0.9, max_tokens=200))
print(out[0].outputs[0].text)

max_num_seqs is capped because the hybrid (Gated-DeltaNet) layers reserve Mamba cache blocks; raise it only if you have spare VRAM. Throughput on a single RTX PRO 6000 (Blackwell): ~150 tok/s at concurrency 1, ~350 tok/s aggregate at concurrency 4.

transformers

import torch
from transformers import AutoModelForImageTextToText, AutoTokenizer

MODEL = "kieraisverybored/devmodeLM-v2"
tok = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForImageTextToText.from_pretrained(MODEL, dtype=torch.bfloat16, device_map="auto")

msgs = [{"role": "system", "content": "You are a user on a discord server about AI, respond naturally and conversationally."},
        {"role": "user", "content": "is RAG dead now that context windows are huge?"}]
ids = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(ids, max_new_tokens=200, do_sample=True, temperature=0.8, top_p=0.9)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))

Limitations

  • Trades substance for authenticity: replies are short and casual, not thorough or always factually careful.
  • Persona and worldview reflect a single AI-focused Discord community; expect that slang, in-jokes, and biases.
  • Not safety-tuned or instruction-tuned for assistant tasks.

License

Inherits the license of the base model, Qwen3.6-35B-A3B. Built with Unsloth.

Downloads last month
-
Safetensors
Model size
36B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kieraisverybored/devmodeLM-v2

Finetuned
(8)
this model