pico-type π
A tiny byte-level multi-head content classifier β ~1.5M params, ~209KB ONNX, <6ms inference.
Classifies any content into 7 categories from raw bytes in a single forward pass.
Built by eulogik β AI infrastructure for developers.
β¨ Features
- No tokenizer β operates directly on raw UTF-8 bytes (supports all languages, zero pre-processing)
- 7 heads, one forward pass β coarse type, modality, subtype, code lang, text lang, file MIME, risk flags
- 4 Matryoshka tiers β tiny (16d) β small (64d) β base (192d) β pro (576d)
- ~200KB ONNX β deploy on edge devices, serverless functions, browser (WebAssembly)
- <6ms inference on CPU via ONNX Runtime (base tier, 1024 bytes)
- CLI, Gradio Space, MCP server β ready for any integration
π Performance
| Head | Classes | Synthetic Accuracy | Real-World Accuracy |
|---|---|---|---|
| coarse | 12 | 100% | 86% |
| modality | 8 | 100% | 100% |
| subtype | 24 | 95% | β |
| code_lang | 62 | 39% | β |
| text_lang | 30 | 99% | 100% |
| file_mime | 90 | 100% | β |
| risk (mAP) | 6 | 100% | β |
Evaluated on 1000 synthetic samples + 21 hand-curated real-world inputs. Base tier, ~5ms inference.
Note: code_lang synthetic accuracy reflects the challenge of 62-way classification with limited per-class support. Real-world accuracy across all heads is 52% (11/21 correct), up from 23% baseline before diverse training.
π Quick Start
CLI
pip install picotype
echo "def hello():\n return 42" | picotype --pretty
picotype --file document.txt
picotype --clip
Python
from picotype import PicoType, PicoTypeConfig, decode_output
model = PicoType(PicoTypeConfig()).eval()
# ... load checkpoint ...
result = decode_output(model(b"input bytes"), tier="base")
MCP Server (Claude/Cursor)
PICOTYPE_MODEL_DIR=./checkpoints python -m model.pico_type.mcp_server
π Architecture
Bytes β ByteEmbed(256β96d) β 3ΓConv1D(k=3,5,7) β 2ΓBiAttention(RoPE) β Pool(meanβmaxβstd) β 7ΓMatryoshka Heads
| Component | Description |
|---|---|
| ByteEmbed | nn.Embedding(256, 96) β lookup-free byte embedding |
| Conv1D | 3 parallel kernels (width 3, 5, 7) with residual + LayerNorm + GELU |
| BiAttention | Bidirectional self-attention with Rotary Position Embeddings, 4 heads |
| Pool | Mean + Max + Std concatenation over masked positions |
| Matryoshka Heads | 4 tier slices of the pooled vector β 7 linear classifiers |
Total parameters: 1.43M (tiny) / 1.45M (small) / 1.48M (base) / 1.56M (pro)
π§ Model Tiers
| Tier | Dim | Params | ONNX Size | Speed |
|---|---|---|---|---|
| tiny | 16 | 1.43M | 207 KB | ~3ms |
| small | 64 | 1.45M | 207 KB | ~4ms |
| base | 192 | 1.48M | 209 KB | ~5ms |
| pro | 576 | 1.56M | 206 KB | ~12ms |
All tiers share the same trunk; only the final linear layers differ. Switch tiers at inference with zero overhead.
π§ͺ Classification Heads
| Head | Classes | Gated By | Examples |
|---|---|---|---|
| coarse | 12 | β | text, code, link, image, file, config, markup, data, error, secret, archive, binary |
| modality | 8 | β | textual, binary_image, binary_archive, binary_executable, binary_document, binary_audio, binary_video, binary_other |
| subtype | 24 | config, markup, data | json, yaml, toml, csv, html, markdown, sql, log, dockerfile |
| code_lang | 62 | code | python, javascript, typescript, java, c, cpp, go, rust, kotlin, swift, bash, sql |
| text_lang | 30 | text | en, es, fr, de, it, pt, ru, zh, ja, ko, ar, hi |
| file_mime | 90 | image, file | text/html, application/json, application/pdf, image/png, video/mp4 |
| risk | 6 | β | api_key, jwt, password, email, phone, ssh_key (probabilities) |
π Deployment
| Platform | URL |
|---|---|
| HuggingFace Space | eulogik/pico-type |
| HuggingFace Model | eulogik/pico-type |
| GitHub | eulogik/pico-type |
| PyPI | pip install picotype |
| Zenodo | 10.5281/zenodo.20758542 |
π Documentation
- Model Card β detailed architecture, training, evaluation
- Architecture Plan β full design document
- Walkthrough β development log with all decisions
π License
Apache 2.0 β free for commercial and personal use.