Quark-72M

Quark-72M is a compact, from-scratch autoregressive language model developed by ThingAI as part of the Quark family of small language models. It is designed to be lightweight enough to run on consumer hardware while remaining architecturally modern, using Grouped-Query Attention, RoPE positional embeddings, SwiGLU feed-forward layers, and RMSNorm — the same building blocks found in contemporary frontier models, scaled down to ~72M parameters.

This is an instruction-tuned checkpoint, fine-tuned via SFT on top of a base model pre-trained on math, code, and reasoning-focused corpora.


Table of Contents


Model Summary

Developed by ThingAI
Model type Decoder-only Transformer (causal language model)
Parameters ~71.7M (embedding-tied)
Languages English, Italian
License MIT
Finetuned from Quark-72M base (pre-trained on math/code/reasoning mix)
Repository ThingAI/Quark-72M-Instruct

Quark-72M-Instruct is part of a broader effort at ThingAI to build small, self-hostable language models that can be trained, fine-tuned, and served entirely on personal infrastructure — without dependency on third-party APIs. It trades raw capability for transparency, inspectability, and low resource requirements.


Architecture

Quark-72M-Instruct uses a standard decoder-only Transformer stack with several efficiency-oriented design choices common in modern small LLMs:

Component Detail
Layers 14
Hidden size (d_model) 512
Attention heads 8 query heads
KV heads 2 (Grouped-Query Attention, 4:1 ratio)
Head dimension 64
Feed-forward SwiGLU, d_ff = 1344
Normalization RMSNorm (pre-norm placement)
Positional encoding Rotary Position Embeddings (RoPE), θ = 10,000
Activation SiLU (within SwiGLU gate)
Vocabulary size 65,536 (+ 2 reserved chat tokens)
Context length 2048 tokens
Weight tying Input embedding and output projection share weights

Grouped-Query Attention (GQA). Query heads are grouped 4-to-1 onto a smaller set of key/value heads, reducing the memory footprint of the KV cache during autoregressive generation without materially affecting representational capacity at this model scale.

Rotary embeddings. Positional information is injected directly into the query/key vectors via rotation in embedding space, rather than via additive positional embeddings, allowing better extrapolation behavior across sequence positions.

SwiGLU feed-forward. Each block's MLP uses a gated SiLU activation (down_proj(silu(gate_proj(x)) * up_proj(x))), which has empirically outperformed plain GELU/ReLU MLPs of comparable parameter count in most published small-LM ablations.


Intended Use

Quark-72M-Instruct is intended primarily as:

  • A research and educational artifact for studying small-scale LLM architecture, training, and inference.
  • A lightweight conversational base for hobbyist or self-hosted projects where running a multi-billion-parameter model is impractical.
  • A building block for further task-specific fine-tuning (e.g., shell/bash assistance, narrow-domain Q&A).

Out of scope

Given its size (~72M parameters), this model is not intended for:

  • Tasks requiring deep multi-step reasoning, long-context retrieval, or broad world knowledge.
  • Safety-critical, medical, legal, or financial applications.
  • Production use without further evaluation and, likely, additional fine-tuning for the target domain.

At this parameter count, fluency and instruction-following are best-effort: expect occasional repetition, factual unreliability, and shallow reasoning compared to billion-parameter-class models.


Quickstart

This model uses a custom architecture and requires trust_remote_code=True to load configuration_quark.py and modeling_quark.py from this repository.

pip install transformers torch
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ThingAI/Quark2Tokenizer")
tokenizer.add_special_tokens({"additional_special_tokens": ["<|im_start|>", "<|im_end|>"]})
im_end_id = tokenizer.convert_tokens_to_ids("<|im_end|>")

model = AutoModelForCausalLM.from_pretrained(
    "ThingAI/Quark-72M-Instruct",
    trust_remote_code=True,
    dtype=torch.bfloat16,   # or torch.float32 on CPU
).eval()

prompt = "<|im_start|>user\nWhat is the difference between a list and a tuple in Python?<|im_end|>\n<|im_start|>assistant\n"
input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")

output_ids = model.generate_text(
    input_ids,
    max_new_tokens=200,
    temperature=0.2,
    top_p=0.9,
    rep_penalty=1.15,
    eos_token_id=im_end_id,
)

response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Note: This model does not use the standard .generate() method from transformers. Generation is handled by a custom generate_text() method implemented directly on the model class, which exposes temperature, top_p, and rep_penalty (repetition penalty) arguments. This keeps the inference path simple and dependency-free, at the cost of not supporting beam search or some of the more advanced decoding strategies built into the standard HF generation utilities.

Running on GPU

model = AutoModelForCausalLM.from_pretrained(
    "ThingAI/Quark-72M-Instruct",
    trust_remote_code=True,
    dtype=torch.bfloat16,
).cuda().eval()

input_ids = input_ids.cuda()

At ~72M parameters, this model runs comfortably on CPU for single-turn interactive use, and reaches well over 100 tokens/second on a single consumer GPU (tested on an RTX 3070).


Prompt Format

Quark-72M-Instruct was fine-tuned using a ChatML-style template with <|im_start|> / <|im_end|> role delimiters:

<|im_start|>user
{user message}<|im_end|>
<|im_start|>assistant
{model response}<|im_end|>

These two control tokens are not part of the base 65,536-token vocabulary baked into the tokenizer's vocab.json — they must be registered at load time via add_special_tokens, as shown in the Quickstart example above. Forgetting this step will cause the tokenizer to fall back to byte-level fragmentation for these tokens, which the model was never trained on, and will noticeably degrade output quality.

For multi-turn conversations, concatenate turns directly:

<|im_start|>user
{turn 1 user message}<|im_end|>
<|im_start|>assistant
{turn 1 response}<|im_end|>
<|im_start|>user
{turn 2 user message}<|im_end|>
<|im_start|>assistant

This model does not currently ship a chat_template in tokenizer_config.json; constructing the prompt string manually (as above) is the recommended approach until that is added.


Generation Parameters

The custom generate_text() method supports the following arguments:

Parameter Type Default Description
max_new_tokens int 200 Maximum number of tokens to generate.
temperature float 0.7 Sampling temperature. 0.0 triggers greedy decoding. Lower values (0.1–0.3) are recommended for more focused, less repetitive output given the model's size.
top_p float 0.9 Nucleus sampling threshold.
rep_penalty float 1.0 Repetition penalty applied to previously generated/seen tokens. Values around 1.11.2 substantially reduce the looping behavior common in small models.
eos_token_id int None Token ID that halts generation early. Should be set to the ID of `<

The implementation includes a NaN/Inf guard: if the logits produced at any generation step are degenerate, the method falls back to greedy decoding for that step rather than propagating nan into the sampling procedure.

Recommended starting configuration

model.generate_text(
    input_ids,
    max_new_tokens=200,
    temperature=0.2,
    top_p=0.9,
    rep_penalty=1.15,
    eos_token_id=im_end_id,
)

Given the model's scale, low temperature combined with a moderate repetition penalty tends to produce the most coherent output. Higher temperatures increase output diversity but also increase the likelihood of incoherent or repetitive degeneration.


Training Details

Pre-training

The base model was pre-trained from scratch on a mixture weighted toward mathematical and code-heavy text, with a smaller proportion of chain-of-thought reasoning data:

Source Approx. share Content
OpenWebMath 45% Mathematical text and derivations
The Stack (smol) 45% Source code across multiple languages
Magpie-Reasoning-150K 4% Distilled chain-of-thought traces
OpenThoughts-114k 4% Multi-step reasoning conversations
Reasoning-base-20k 2% Logical inference traces

Training used a target of 5B tokens, GQA + RoPE + SwiGLU architecture as described above, bfloat16 mixed precision, AdamW with cosine learning-rate decay, gradient clipping, and torch.compile for throughput.

Supervised fine-tuning (SFT)

The base checkpoint was subsequently fine-tuned on conversational and instruction-following data formatted with the ChatML-style template described above. Training data was pre-tokenized to .npy files ahead of time to eliminate streaming/tokenization bottlenecks during training, which previously made each epoch impractically slow.

A note on capability: because the pre-training mixture was weighted heavily toward math and code rather than general conversation, and because the SFT phase was comparatively short, this model's conversational fluency is modest relative to its parameter count would suggest for a model trained primarily on dialogue. It reliably follows the chat template and produces grammatical, on-topic responses, but should not be expected to match the conversational depth of models trained end-to-end on large-scale dialogue corpora.


Tokenizer

This model uses ThingAI/Quark2Tokenizer, a byte-level BPE tokenizer with a 65,536-token vocabulary trained on a bilingual (Italian/English) and code-inclusive corpus.

Two additional control tokens (<|im_start|>, <|im_end|>) are required for chat-formatted inference and must be added at runtime as shown in the Quickstart section — they are not persisted in the tokenizer's saved vocabulary file.


Evaluation & Known Limitations

This model has not yet been benchmarked against standard small-LM evaluation suites (e.g., PIQA, ARC-Easy, HellaSwag). Evaluation harness integration is planned but not yet published for this checkpoint.

Known limitations:

  • Repetition. Like most sub-100M-parameter language models, Quark-72M-Instruct is prone to repetitive loops without a repetition penalty. A rep_penalty of 1.1–1.2 is recommended for most use cases.
  • Shallow reasoning. The model can produce plausible-sounding text but should not be relied upon for multi-step logical or mathematical reasoning beyond simple cases, despite the math-heavy pre-training mixture.
  • Limited world knowledge. At this scale, the model has memorized comparatively little factual knowledge and will frequently hallucinate specific facts, dates, or entities.
  • Short effective context. While the architecture supports a 2048-token context window, reliable instruction-following degrades over long contexts more readily than in larger models.
  • No chat template metadata. As noted above, tokenizer_config.json does not yet define a chat_template; prompts must be constructed manually.

This model is best understood as a research-grade, fully-inspectable small language model rather than a general-purpose assistant.


Files in this Repository

File Purpose
config.json Model configuration and auto_map registration for trust_remote_code
configuration_quark.py QuarkConfig class (extends PretrainedConfig)
modeling_quark.py Full model implementation (QuarkForCausalLM), including the custom generate_text() method
model.safetensors Model weights (float32)
generation_config.json Default generation hyperparameters
tokenizer_config.json Tokenizer configuration pointing to the companion tokenizer repository

Because this model relies on custom modeling code rather than a built-in transformers architecture, trust_remote_code=True is required when loading it via AutoModelForCausalLM or AutoConfig. As with any trust_remote_code model, it is good practice to review modeling_quark.py and configuration_quark.py directly before running them, particularly when pinning to a specific revision in production code.


Citation

If you use Quark-72M-Instruct in your work, please consider citing the repository:

@misc{quark72m,
  title  = {Quark-72M},
  author = {ThingAI},
  year   = {2026},
  url    = {https://huggingface.co/ThingAI/Quark-72M}
}

License

This model is released under the MIT License. See the repository for full license text. Training data sources retain their own respective licenses; users are responsible for ensuring compliance with upstream dataset terms where applicable.


Quark-72M is developed and maintained by ThingAI as part of an ongoing effort to build self-hostable, fully-inspectable small language models.

Downloads last month
72
Safetensors
Model size
71.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including ThingAI/Quark-72M