Instructions to use nkthebass/TinyBrainBot-demo-216.5m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use nkthebass/TinyBrainBot-demo-216.5m with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="nkthebass/TinyBrainBot-demo-216.5m", filename="tinybrainbot-216.5m-F16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use nkthebass/TinyBrainBot-demo-216.5m with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf nkthebass/TinyBrainBot-demo-216.5m:F16 # Run inference directly in the terminal: llama cli -hf nkthebass/TinyBrainBot-demo-216.5m:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf nkthebass/TinyBrainBot-demo-216.5m:F16 # Run inference directly in the terminal: llama cli -hf nkthebass/TinyBrainBot-demo-216.5m:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf nkthebass/TinyBrainBot-demo-216.5m:F16 # Run inference directly in the terminal: ./llama-cli -hf nkthebass/TinyBrainBot-demo-216.5m:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf nkthebass/TinyBrainBot-demo-216.5m:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf nkthebass/TinyBrainBot-demo-216.5m:F16
Use Docker
docker model run hf.co/nkthebass/TinyBrainBot-demo-216.5m:F16
- LM Studio
- Jan
- vLLM
How to use nkthebass/TinyBrainBot-demo-216.5m with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nkthebass/TinyBrainBot-demo-216.5m" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nkthebass/TinyBrainBot-demo-216.5m", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/nkthebass/TinyBrainBot-demo-216.5m:F16
- Ollama
How to use nkthebass/TinyBrainBot-demo-216.5m with Ollama:
ollama run hf.co/nkthebass/TinyBrainBot-demo-216.5m:F16
- Unsloth Studio
How to use nkthebass/TinyBrainBot-demo-216.5m with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for nkthebass/TinyBrainBot-demo-216.5m to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for nkthebass/TinyBrainBot-demo-216.5m to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for nkthebass/TinyBrainBot-demo-216.5m to start chatting
- Atomic Chat new
- Docker Model Runner
How to use nkthebass/TinyBrainBot-demo-216.5m with Docker Model Runner:
docker model run hf.co/nkthebass/TinyBrainBot-demo-216.5m:F16
- Lemonade
How to use nkthebass/TinyBrainBot-demo-216.5m with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull nkthebass/TinyBrainBot-demo-216.5m:F16
Run and chat with the model
lemonade run user.TinyBrainBot-demo-216.5m-F16
List all available models
lemonade list
TinyBrainBot β demo (216.5M)
A 216.5M-parameter conversational language model trained from scratch on a custom decoder-only architecture. This is an early proof-of-concept / demo checkpoint: the goal is to show that a small model with a proper subword tokenizer can hold coherent, on-topic chat β not to compete with production models.
Trained end-to-end (tokenizer + architecture + pretraining + SFT) as a solo project on consumer hardware.
Model details
| Parameters | 216.5M |
| Architecture | Decoder-only transformer: RoPE + RMSNorm (pre-norm) + SwiGLU, tied embeddings |
| Layers | 10 |
| Hidden size | 1032 |
| Attention heads | 12 (head_dim 86) |
| FFN size | 4416 |
| Context length | 768 |
| Vocabulary | 36,000 (SentencePiece unigram, case-preserving, with chat/memory special tokens) |
| RoPE theta | 10,000 |
| Training tokens | ~551M (this checkpoint) |
| Precision | this file is F16 GGUF |
Note: ~551M training tokens is well under compute-optimal for this size, so this is an early, undertrained checkpoint. Expect coherent short replies but factual drift and looping on long generations. More training is planned.
How to run
LM Studio
Load the GGUF, then set sampling to:
| Setting | Value |
|---|---|
| Temperature | 0.70 |
| Top P | 0.90 |
| Repeat penalty | 1.15 |
| Frequency penalty | 0.5 |
| Presence penalty | 0.3 |
The chat template is embedded in the GGUF, so no manual prompt-format setup is needed.
llama.cpp
llama-cli -m tinybrainbot-216.5m-F16.gguf --jinja \
--temp 0.70 --top-p 0.90 --repeat-penalty 1.15 \
--frequency-penalty 0.5 --presence-penalty 0.3
Chat format
The model uses these special tokens (single IDs in the tokenizer):
<|user|>
{your message}
<|end|>
<|assistant|>
Generation stops at <|end|>.
The GGUF tokenizer is exported as UGM (unigram) so llama.cpp reproduces the training-time SentencePiece tokenization exactly β an SPM export would re-segment words and degrade output.
Intended use
- Demonstrations and education about small-model training/inference
- Research into tokenizers, architectures, and from-scratch training on modest hardware
- A base for further training / fine-tuning experiments
Not intended for factual question answering, production use, or any high-stakes application.
Limitations
- Small + undertrained: answers are often factually wrong and can wander off-topic on long outputs.
- Short context (768 tokens).
- English-only.
- Greets and opens coherently; degrades the longer it generates (mitigate with the frequency/presence penalties above).
Training data
Pretraining on public English text (e.g. Wikipedia, TinyStories, OpenWebText2) and SFT on public chat/instruction datasets. Respect the upstream licenses of those datasets for your own use.
- Downloads last month
- -
16-bit