Instructions to use kieraisverybored/devmodeLM-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use kieraisverybored/devmodeLM-v2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="kieraisverybored/devmodeLM-v2") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("kieraisverybored/devmodeLM-v2") model = AutoModelForMultimodalLM.from_pretrained("kieraisverybored/devmodeLM-v2") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use kieraisverybored/devmodeLM-v2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "kieraisverybored/devmodeLM-v2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kieraisverybored/devmodeLM-v2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/kieraisverybored/devmodeLM-v2
- SGLang
How to use kieraisverybored/devmodeLM-v2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "kieraisverybored/devmodeLM-v2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kieraisverybored/devmodeLM-v2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "kieraisverybored/devmodeLM-v2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kieraisverybored/devmodeLM-v2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Unsloth Studio
How to use kieraisverybored/devmodeLM-v2 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for kieraisverybored/devmodeLM-v2 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for kieraisverybored/devmodeLM-v2 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for kieraisverybored/devmodeLM-v2 to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="kieraisverybored/devmodeLM-v2", max_seq_length=2048, ) - Docker Model Runner
How to use kieraisverybored/devmodeLM-v2 with Docker Model Runner:
docker model run hf.co/kieraisverybored/devmodeLM-v2
devmodeLM-v2
aka dihGPT-2, devmodeLM-v2-35B-A3B, DLM-2
A Discord-persona chat model that talks like a regular in a casual AI server — short, conversational, in-character. Fine-tuned from Qwen3.6-35B-A3B (MoE, ~37B total / ~3B active) on Discord reply chains, then merged to a standalone full checkpoint.
Note: trained on text only (no images); the base model's vision path is untouched/untested here, so treat this as a text chat model.
This is the phase-2 (reply-SFT) model. An experimental chain-of-thought (CoT) variant was trained on top but regressed the casual voice toward verbose, assistant-style answers, so the pre-CoT model is shipped here as the better product.
What it does
Given a short conversation, it replies the way a sharp human in an AI Discord would — brief, lowercase-friendly, sometimes terse, on-topic. It is not a helpful-assistant model and deliberately avoids long, structured, "as an AI" responses.
Example outputs:
| Context | Reply |
|---|---|
| anyone tried the new qwen model? is it actually any good or just benchmarks | i heard it's benchmaxxed |
| my finetune keeps OOMing at batch 16 / what gpu? / single 4090 | is this for a specific task or just general? |
| is RAG dead now that context windows are huge? | It's dead if you have the hardware to run a 10T model. |
| whats everyone using for local inference these days | llama.cpp / lmstudio |
Chat format
Uses the Qwen chat template. The model was trained with an empty reasoning block then the reply, so generations look like:
<think>
</think>
<the reply>
Recommended system prompt:
You are a user on a discord server about AI, respond naturally and conversationally.
Training
- Method: QLoRA (4-bit NF4) SFT, completion-only loss (context masked, loss on the reply).
- LoRA: r=32, α=32, dropout=0, rsLoRA, on attention (q/k/v/o) and the fused MoE expert tensors (
mlp.experts.gate_up_proj,mlp.experts.down_proj). - Data: Discord reply chains (reply-to threads) from an AI community server, single channel; usernames excluded from targets.
- Result: eval loss ≈ 2.15 (perplexity ≈ 8.5).
- Trained with Unsloth.
Merge note: the LoRA targets the fused MoE expert tensors via
target_parameters. Neither PEFT'smerge_and_unloadnor Unsloth's merge apply that fused-expert delta correctly, so this checkpoint was produced with an explicit per-expert merge (W[e] += (α/√r)·Bₑ@Aₑ). The merged weights are verified to reproduce the adapter's behaviour. The (unused) base vision tower is kept so the model loads under the multimodalQwen3_5MoeForConditionalGenerationclass that vLLM expects.
Usage
vLLM
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
MODEL = "kieraisverybored/devmodeLM-v2"
SYS = "You are a user on a discord server about AI, respond naturally and conversationally."
tok = AutoTokenizer.from_pretrained(MODEL)
llm = LLM(model=MODEL, trust_remote_code=True, dtype="bfloat16",
max_model_len=2048, max_num_seqs=16, gpu_memory_utilization=0.90)
msgs = [{"role": "system", "content": SYS},
{"role": "user", "content": "anyone running the new model locally yet?"}]
prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
out = llm.generate([prompt], SamplingParams(temperature=0.8, top_p=0.9, max_tokens=200))
print(out[0].outputs[0].text)
max_num_seqs is capped because the hybrid (Gated-DeltaNet) layers reserve Mamba cache blocks; raise it only if you have spare VRAM. Throughput on a single RTX PRO 6000 (Blackwell): ~150 tok/s at concurrency 1, ~350 tok/s aggregate at concurrency 4.
transformers
import torch
from transformers import AutoModelForImageTextToText, AutoTokenizer
MODEL = "kieraisverybored/devmodeLM-v2"
tok = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForImageTextToText.from_pretrained(MODEL, dtype=torch.bfloat16, device_map="auto")
msgs = [{"role": "system", "content": "You are a user on a discord server about AI, respond naturally and conversationally."},
{"role": "user", "content": "is RAG dead now that context windows are huge?"}]
ids = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(ids, max_new_tokens=200, do_sample=True, temperature=0.8, top_p=0.9)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))
Limitations
- Trades substance for authenticity: replies are short and casual, not thorough or always factually careful.
- Persona and worldview reflect a single AI-focused Discord community; expect that slang, in-jokes, and biases.
- Not safety-tuned or instruction-tuned for assistant tasks.
License
Inherits the license of the base model, Qwen3.6-35B-A3B. Built with Unsloth.
- Downloads last month
- -