Instructions to use TheREZOR/TinyTalk with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TheREZOR/TinyTalk with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="TheREZOR/TinyTalk")# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("TheREZOR/TinyTalk") model = AutoModelForMultimodalLM.from_pretrained("TheREZOR/TinyTalk") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use TheREZOR/TinyTalk with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "TheREZOR/TinyTalk" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheREZOR/TinyTalk", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/TheREZOR/TinyTalk
- SGLang
How to use TheREZOR/TinyTalk with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TheREZOR/TinyTalk" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheREZOR/TinyTalk", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "TheREZOR/TinyTalk" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheREZOR/TinyTalk", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use TheREZOR/TinyTalk with Docker Model Runner:
docker model run hf.co/TheREZOR/TinyTalk
TinyTalk — human-like small talk on a microcontroller
TinyTalk is a ~8.3M-parameter (≈1.6M non-embedding) GPT-Neo chatbot built to do one thing: hold short, friendly, human-sounding small-talk conversations on low-end hardware that can't run a normal LLM — think an ESP32-S3.
It is the model embedded in the Cardputer AI
firmware, where it runs fully offline on the device in ~2 MB of flash after
Q4_0 quantization. This repository hosts the full-precision PyTorch /
safetensors weights so the model can be used, fine-tuned, or re-quantized on
its own.
What it's for
Most chat models assume a datacenter GPU. TinyTalk asks the opposite question: how small can a model be and still feel like talking to someone? It trades away knowledge, reasoning, and long context to fit on a microcontroller, keeping only the ability to make warm, coherent small talk:
User: hey, how are you?
Bot: I am good! I played outside today. It was so much fun!
User: nice! what did you play?
Bot: I played with my ball. Do you want to play too?
Good fits: an offline conversational toy or companion on an ESP32 / handheld; a teaching example of an end-to-end on-device LLM; a tiny base to fine-tune for embedded chat. Not a fit: anything needing facts, instructions, reasoning, or safety guarantees.
What it is, technically
- Architecture: GPT-Neo (
GPTNeoForCausalLM) — 8 layers, hidden size 128, 16 heads, alternating global/local attention (window 256), learned position embeddings, tied input/output embeddings, GPT-2 byte-level BPE tokenizer (vocab 50257). - Base:
roneneldan/TinyStories-Instruct-3M. - Fine-tune: ~70K filtered, simple-English dialogues from
allenai/SODA, reformatted asUser:/Bot:turns, mixed with a slice ofTinyStoriesInstruct. Loss is masked to the bot replies / story bodies, so the model never trains on producing the user's turns.
Prompt format
Trained on this exact format, with <|endoftext|> (token 50256) between
exchanges:
User: <message>
Bot: <reply><|endoftext|>
User: <message>
Bot:
Feed User: <message>\nBot: and generate until <|endoftext|>.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("TheREZOR/TinyTalk")
model = AutoModelForCausalLM.from_pretrained("TheREZOR/TinyTalk")
prompt = "User: hi, what's your name?\nBot:"
ids = tok(prompt, return_tensors="pt").input_ids
out = model.generate(
ids, max_new_tokens=40, do_sample=True, temperature=0.7, top_k=40,
eos_token_id=tok.eos_token_id,
)
print(tok.decode(out[0, ids.shape[1]:], skip_special_tokens=True))
Honest limitations
- Kindergarten English only. Short, simple sentences.
- No world knowledge. Factual questions get friendly confabulation.
- Short memory. Trained/served with a tiny context (~80 tokens on device, 256 max). Not instruction-following, not safe for any production use.
- A toy/educational model — interesting because it fits on a microcontroller, not because it is good.
License & attribution
Released under CC BY 4.0, the binding term inherited from the SODA training data. You must retain the following attributions:
- Base model: TinyStories-Instruct-3M — Ronen Eldan & Yuanzhi Li, TinyStories: How Small Can Language Models Be and Still Speak Coherent English? (arXiv:2305.07759). Published without an explicit license tag; the TinyStories dataset family is CDLA-Sharing-1.0, which places no restriction on trained models.
- Fine-tune data: SODA (CC BY 4.0) — Kim et al., SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization (arXiv:2212.10465); and TinyStoriesInstruct (CDLA-Sharing-1.0).
- Tokenizer: GPT-2 byte-level BPE — OpenAI GPT-2 (MIT).
See NOTICE.md for the full provenance.
- Downloads last month
- 30
Model tree for TheREZOR/TinyTalk
Base model
roneneldan/TinyStories-Instruct-3M