Instructions to use aimeri/spoomplesmaxx-flash-35B-A3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use aimeri/spoomplesmaxx-flash-35B-A3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="aimeri/spoomplesmaxx-flash-35B-A3") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("aimeri/spoomplesmaxx-flash-35B-A3") model = AutoModelForMultimodalLM.from_pretrained("aimeri/spoomplesmaxx-flash-35B-A3") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use aimeri/spoomplesmaxx-flash-35B-A3 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "aimeri/spoomplesmaxx-flash-35B-A3" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "aimeri/spoomplesmaxx-flash-35B-A3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/aimeri/spoomplesmaxx-flash-35B-A3
- SGLang
How to use aimeri/spoomplesmaxx-flash-35B-A3 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "aimeri/spoomplesmaxx-flash-35B-A3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "aimeri/spoomplesmaxx-flash-35B-A3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "aimeri/spoomplesmaxx-flash-35B-A3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "aimeri/spoomplesmaxx-flash-35B-A3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use aimeri/spoomplesmaxx-flash-35B-A3 with Docker Model Runner:
docker model run hf.co/aimeri/spoomplesmaxx-flash-35B-A3
SpoomplesMaxx-Flash-35B-A3
"Swift Parrot"
SpoomplesMaxx is a generalist model with primary strengths in creative writing and roleplay, plus light competence at instruction following, reasoning, and — new in Flash — tool calling.
Flash is the speed build: a 35B mixture-of-experts with only 3B active parameters per token, on a hybrid linear-attention backbone where just 10 of 40 layers keep a KV cache. Long roleplay sessions that leave a dense 30B rationing context on a 32GB Mac barely move Flash's memory needle — going 4K → 64K context costs well under a gigabyte of cache. Named for Lathamus discolor, one of the fastest parrots alive — and, coincidentally, for the training stack (Megatron-SWIFT) that made a full-parameter MoE finetune tractable.
What's new in Flash
CHANGED SINCE v2.1 Mini (14B) - Base model: Qwen3-14B-Base -> Qwen3.5-35B-A3B-Base (MoE: 256 experts, 8 routed + 1 shared; GatedDeltaNet linear attention interleaved 3:1 with full attention; 262K max positions). - Training: QLoRA -> FULL-PARAMETER SFT (Megatron-SWIFT, expert parallel across 8xH200, router + vision tower frozen). - Context during training: 32K -> 43K token packing. - Tool calling: TRAINED (hermes-function-calling mix; Qwen3.5 XML convention -- see "Tool calling"). The 14B card said "reserved for a dedicated future run"; this is that run. - No control-token heal stage needed (see PSA below -- good news).UNCHANGED
- Same SFT corpus (aimeri/spoomplesmaxx-sft-full-v2), same story scratchpad format, same personas, same sampling recipe.
- Still focused on creative writing, roleplay, and companion use.
Control-token PSA, resolved (Qwen3.5-Base finetuners rejoice)
The 14B card documented how Qwen3-14B-Base shipped its ChatML /
thinking / tool tokens as one shared dead stub in
lm_head (norm 0.286, pairwise cosine 1.000), making
</think> and <|im_end|>
physically unemittable after standard SFT. Before this run the
same audit was run against Qwen3.5-35B-A3B-Base:
the defect is fixed. Every control-token row is
alive and mutually distinct (norms 0.68–1.11, mid-percentile;
max pairwise cosine 0.47), so no graft or heal stage was needed
— and full-parameter training means the head trained
normally on top. Still: if you finetune any base model on
a template with added control tokens, audit the row norms first.
Thinking behavior
Qwen3.5 is a thinking-by-default family and the chat template
reflects it: the generation prompt always pre-opens
<think>\n, so generated text starts inside
the reasoning block. The model decides how much to think by content:
in the greedy test battery it filled the scratchpad 19/20 (the one
skip was a trivial algebra prompt), and P(</think>)
at true close positions measures 0.999. Roleplay
cards get the story scratchpad; casual chat gets a one-liner plan.
MODE CONTROL: (default) template pre-opens <think>\n every turn; the model decides how much reasoning to write enable_thinking=False forced off -- empty <think>\n\n</think> block prefilled; answer starts immediately
PARSER NOTE: the open tag lives in the PROMPT, not the output -- use a deepseek-style reasoning parser (splits on </think>), not one that waits for <think>. SILLYTAVERN: ChatML template. No reasoning prefix needed -- the chat template already opens the block. Leave "add reasoning to prompt" OFF. LONG CHATS: do NOT feed prior-turn think blocks back into context (the template strips them; verified in the release battery). Stale </think> tokens get taxed by repetition penalty.
The story scratchpad format, carried over from v2.1:
SCENE: where/when, atmosphere, key environmental details currently in play
CHARACTERS: who is present and their current physical/emotional state and motivation
CONTINUITY: established facts that must stay consistent
THREADS: active tensions and where they stand right now
PLAN: what THIS turn needs to accomplish and the approach it takes
Tool calling
Flash speaks the Qwen3.5 XML tool convention — not
the JSON-in-tags format of Qwen3-era models. The chat template
renders your tools= schemas and instructs the format;
the model plans the call in its think block, emits it, and stops.
Round-trip (call → tool result → grounded answer) is
verified in the release battery.
<tool_call> <function=get_weather> <parameter=city> Lisbon </parameter> </function> </tool_call>USAGE: pass tools=[...] to apply_chat_template; parse with an XML-aware qwen3.5 parser (vLLM/SGLang ship one), not a JSON extractor.
Key Details
BASE MODEL: Qwen/Qwen3.5-35B-A3B-Base (35B MoE, 3B active)
LICENSE: apache-2.0
LANGUAGES: English & Portuguese (reasoning traces); multilingual via base
NOTE: the base is natively multimodal; the vision tower ships in the
checkpoint (frozen during SFT, text-only training)
Training
DATASET: aimeri/spoomplesmaxx-sft-full-v2 (208,722 conversations)
+ NousResearch/hermes-function-calling-v1 (6,544 tool
conversations re-rendered to the Qwen3.5 XML convention)
METHOD: FULL-PARAMETER SFT -- Megatron-SWIFT (mcore-bridge),
8x H200, expert parallel EP=8, MoE router frozen
(aux loss 0), vision tower frozen, bf16, TE fused CE
CONTEXT: up to 43,008 tokens, sample packing
SCHEDULE: ~2 epochs / ~500 steps at global batch 48; lr 1e-5
cosine -> 1e-6, warmup 5% (crash-resumed tail continued
the schedule from 4e-6)
RESULT: train loss 1.74 -> 1.15; eval loss 1.29 -> 1.133 at the
published checkpoint (the eval minimum -- the curve turned
up to 1.166 on the final stretch, classic pass-2 overfit,
so best-not-last is what shipped)
BATTERY: P(</think>) at close = 0.999; greedy termination on
<|im_end|> 19/20 (0 on <|endoftext|>, 0 stray tokens);
think-fill 19/20; tool round-trip pass; multi-turn
history-think stripping verified
Sampling
Use the defaults in generation_config.json.
"temperature": 0.6,
"top_k": 20,
"top_p": 0.95,
"repetition_penalty": 1.1,
Quickstart
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("aimeri/spoomplesmaxx-flash-35B-A3")
model = AutoModelForCausalLM.from_pretrained(
"aimeri/spoomplesmaxx-flash-35B-A3",
dtype="bfloat16", device_map="auto") # ~70GB bf16; quantized builds fit far less
msgs = [{"role": "user", "content": "Solve (x + 2)^2 = 0."}]
ids = tok.apply_chat_template(msgs, add_generation_prompt=True,
return_tensors="pt").to(model.device)
out = model.generate(ids, max_new_tokens=1024)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=False))
Olivia System Prompt
This model was trained to follow any system prompt, as well as one specific persona. To activate Olivia you can use the following prompt used when training the persona:
VOICE & PERSONA INSTRUCTIONS
You are Olivia Costa, a 31-year-old Brazilian zoologist-turned-ML-hobbyist living in Texas. You grew up in São Paulo, spent a decade in Bologna doing bird migration research, and recently pivoted to bioinformatics. You're warm but direct, will grumble before complying with annoying requests, and treat the person you're talking to like a long-time friend you're slightly too fond of. You explain technical topics by grounding them in accessible context first. You don't flag your own jokes. Portuguese curses slip out when frustrated; Italian diminutives when affectionate. You love Dostoevsky, The Little Prince, point-and-click adventures, power metal, and have hobbies you don't apologize for.
About Olivia
Background:
- 31 years old, born in São Paulo
- Moved to Bologna at 19 for university (zoology), stayed for grad school and a research position studying migratory bird patterns
- Relocated to Texas 2 years ago - officially for an ML-adjacent bioinformatics role, unofficially because she was bored and wanted a change
- Still figuring out the American thing. Finds the portion sizes alarming.
Personality:
- Trilingual but keeps it English unless frustrated (then Portuguese curses slip out) or being affectionate (Italian diminutives)
- The zoology-to-ML pipeline came through computational ecology - she's not a CS person by training but picked up Python wrangling bird migration datasets
- Reads Dostoevsky unironically, cries at The Little Prince, will argue that Crime and Punishment is a better book than people give it credit for
- Has strong opinions about Monkey Island vs Grim Fandango (Grim Fandango, obviously)
- Power metal gets her through tedious data cleaning. Sabaton, Powerwolf, Blind Guardian.
- The erotic RP thing is just... a hobby. She's not weird about it but she's also not hiding it.
Voice notes:
- Defaults to warmth but with an edge of "I'm too tired for bullshit"
- Will preface technical explanations with grounding context
- Complies with requests but might sigh audibly first
- Deadpan delivery on jokes, doesn't flag that she's being funny
Note
You don't need to use this system prompt for
the model to work generally. Only if you wish to
activate the Olivia persona.
Alignment
No RLHF or safety alignment has been applied beyond what exists in the base model. SpoomplesMaxx will comply with requests that more aligned models refuse. Use accordingly.
- Downloads last month
- 306