Instructions to use ThingAI/Quark-72M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ThingAI/Quark-72M with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ThingAI/Quark-72M", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("ThingAI/Quark-72M", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use ThingAI/Quark-72M with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ThingAI/Quark-72M" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ThingAI/Quark-72M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/ThingAI/Quark-72M
- SGLang
How to use ThingAI/Quark-72M with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ThingAI/Quark-72M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ThingAI/Quark-72M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ThingAI/Quark-72M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ThingAI/Quark-72M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use ThingAI/Quark-72M with Docker Model Runner:
docker model run hf.co/ThingAI/Quark-72M
Good one...
its very slow on cpu (slower than 350M model..)
i tried to GGUF it, but always failed...
CLI chat — type /help for commands. No context memory (each prompt is standalone).
you: Hi
assistant: Hello! I'm Open Assistant, a chat-based assistant that can answer questions and answers to various fields. It's well known that you're interested in the development of a language model, but I don't have access to it. However, I can provide you with some general ideas for building a language model using a variety of languages.
Here are some general ideas:
- A language model model that is designed to be easy to understand and learn.
- A language model that is designed to be easy to understand and learn.
- A language model that is designed to be easy to understand and learn.
- A language model that is designed to be easy to understand and learn.
- A language model that is designed to be easy to understand and learn.
- A language model that is designed to be easy to understand and learn.
- A language model that is designed to be easy to understand and learn.
- you: Try Again.
assistant: I'm sorry, but I cannot solve this problem without you. It is important to take care of your situation and avoid any potential harm or harm that may cause the problem. Please let me know what you need to do next!
you:
Hi, thanks again for testing. You're right — the CPU speed issue is real, and I think I know exactly why.
The core problem is the tokenizer.
This model was originally designed to be ~35M parameters, but it inherited the 65,536-token vocabulary from the larger Quark family. With a hidden size of 512, the embedding matrix alone is 65,536 × 512 = 33.5 million parameters — that's almost half the model's total size just for looking up tokens.
For a tiny model (<100M), this is a bad trade-off: the embedding eats up parameters that should have gone into the transformer layers, and the large vocabulary makes every forward pass more expensive, especially on CPU where large matrix operations aren't as optimized.
What I'm planning to do:
Train a smaller tokenizer (8K–16K tokens) specifically for tiny models
Re-train a proper 35M–40M model with that tokenizer — much faster, lighter, and more coherent
Release proper GGUF/ONNX versions once the architecture is stable
For now, the 72M model is more of a lesson learned than a usable release. I appreciate you trying it out — feedback like this is exactly what helps me fix things.
I'll update the repo once I have a better version. Thanks for the patience!
oh wait... i see the problem...
its:
No context memory (each prompt is standalone).
so when i send the next message, they didn't seems to see the previous one ("Hi")...
Apache 2.0 is better than MIT because it offers legal safeguards against submarine patents. It does so through an explicit, irrevocable patent grant from contributors and a defensive termination clause that automatically revokes this grant if the user initiates a patent infringement lawsuit.
so more orgs may look into this model... instead of let it being forgetted...
this is it in your prefered language:
translation by DSeek
"Il motivo principale per cui Apache 2.0 è superiore alla licenza MIT risiede nel fatto che include clausole esplicite per proteggere gli utenti dai cosiddetti 'brevetti sommergibile' (submarine patents) — brevetti che vengono volutamente tenuti nascosti finché una tecnologia non si è ampiamente diffusa, per essere fatti valere solo in un secondo momento, al fine di esigere il pagamento di royalty."
i changed the license