Instructions to use kieraisverybored/devmodeLM-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use kieraisverybored/devmodeLM-v2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="kieraisverybored/devmodeLM-v2")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("kieraisverybored/devmodeLM-v2")
model = AutoModelForMultimodalLM.from_pretrained("kieraisverybored/devmodeLM-v2")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use kieraisverybored/devmodeLM-v2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "kieraisverybored/devmodeLM-v2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kieraisverybored/devmodeLM-v2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/kieraisverybored/devmodeLM-v2

SGLang

How to use kieraisverybored/devmodeLM-v2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "kieraisverybored/devmodeLM-v2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kieraisverybored/devmodeLM-v2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "kieraisverybored/devmodeLM-v2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kieraisverybored/devmodeLM-v2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio

How to use kieraisverybored/devmodeLM-v2 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for kieraisverybored/devmodeLM-v2 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for kieraisverybored/devmodeLM-v2 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for kieraisverybored/devmodeLM-v2 to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="kieraisverybored/devmodeLM-v2",
    max_seq_length=2048,
)

Docker Model Runner
How to use kieraisverybored/devmodeLM-v2 with Docker Model Runner:
```
docker model run hf.co/kieraisverybored/devmodeLM-v2
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

devmodeLM-v2

aka dihGPT-2, devmodeLM-v2-35B-A3B, DLM-2

A Discord-persona chat model that talks like a regular in a casual AI server — short, conversational, in-character. Fine-tuned from Qwen3.6-35B-A3B (MoE, ~37B total / ~3B active) on Discord reply chains, then merged to a standalone full checkpoint.

Note: trained on text only (no images); the base model's vision path is untouched/untested here, so treat this as a text chat model.

This is the phase-2 (reply-SFT) model. An experimental chain-of-thought (CoT) variant was trained on top but regressed the casual voice toward verbose, assistant-style answers, so the pre-CoT model is shipped here as the better product.

What it does

Given a short conversation, it replies the way a sharp human in an AI Discord would — brief, lowercase-friendly, sometimes terse, on-topic. It is not a helpful-assistant model and deliberately avoids long, structured, "as an AI" responses.

Example outputs:

Context	Reply
anyone tried the new qwen model? is it actually any good or just benchmarks	i heard it's benchmaxxed
my finetune keeps OOMing at batch 16 / what gpu? / single 4090	is this for a specific task or just general?
is RAG dead now that context windows are huge?	It's dead if you have the hardware to run a 10T model.
whats everyone using for local inference these days	llama.cpp / lmstudio

Chat format

Uses the Qwen chat template. The model was trained with an empty reasoning block then the reply, so generations look like:

<think>

</think>

<the reply>

Recommended system prompt:

You are a user on a discord server about AI, respond naturally and conversationally.

Training

Method: QLoRA (4-bit NF4) SFT, completion-only loss (context masked, loss on the reply).
LoRA: r=32, α=32, dropout=0, rsLoRA, on attention (q/k/v/o) and the fused MoE expert tensors (mlp.experts.gate_up_proj, mlp.experts.down_proj).
Data: Discord reply chains (reply-to threads) from an AI community server, single channel; usernames excluded from targets.
Result: eval loss ≈ 2.15 (perplexity ≈ 8.5).
Trained with Unsloth.

Merge note: the LoRA targets the fused MoE expert tensors via target_parameters. Neither PEFT's merge_and_unload nor Unsloth's merge apply that fused-expert delta correctly, so this checkpoint was produced with an explicit per-expert merge (W[e] += (α/√r)·Bₑ@Aₑ). The merged weights are verified to reproduce the adapter's behaviour. The (unused) base vision tower is kept so the model loads under the multimodal Qwen3_5MoeForConditionalGeneration class that vLLM expects.

Usage

vLLM

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

MODEL = "kieraisverybored/devmodeLM-v2"
SYS = "You are a user on a discord server about AI, respond naturally and conversationally."

tok = AutoTokenizer.from_pretrained(MODEL)
llm = LLM(model=MODEL, trust_remote_code=True, dtype="bfloat16",
          max_model_len=2048, max_num_seqs=16, gpu_memory_utilization=0.90)

msgs = [{"role": "system", "content": SYS},
        {"role": "user", "content": "anyone running the new model locally yet?"}]
prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
out = llm.generate([prompt], SamplingParams(temperature=0.8, top_p=0.9, max_tokens=200))
print(out[0].outputs[0].text)

max_num_seqs is capped because the hybrid (Gated-DeltaNet) layers reserve Mamba cache blocks; raise it only if you have spare VRAM. Throughput on a single RTX PRO 6000 (Blackwell): ~150 tok/s at concurrency 1, ~350 tok/s aggregate at concurrency 4.

transformers

import torch
from transformers import AutoModelForImageTextToText, AutoTokenizer

MODEL = "kieraisverybored/devmodeLM-v2"
tok = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForImageTextToText.from_pretrained(MODEL, dtype=torch.bfloat16, device_map="auto")

msgs = [{"role": "system", "content": "You are a user on a discord server about AI, respond naturally and conversationally."},
        {"role": "user", "content": "is RAG dead now that context windows are huge?"}]
ids = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(ids, max_new_tokens=200, do_sample=True, temperature=0.8, top_p=0.9)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))

Limitations

Trades substance for authenticity: replies are short and casual, not thorough or always factually careful.
Persona and worldview reflect a single AI-focused Discord community; expect that slang, in-jokes, and biases.
Not safety-tuned or instruction-tuned for assistant tasks.

License

Inherits the license of the base model, Qwen3.6-35B-A3B. Built with Unsloth.

Downloads last month: -

Safetensors

Model size

36B params

Tensor type

BF16

Model tree for kieraisverybored/devmodeLM-v2

Base model

Qwen/Qwen3.6-35B-A3B

Finetuned

unsloth/Qwen3.6-35B-A3B

Finetuned

(8)

this model