pico-type

pico-type πŸ”

A tiny byte-level multi-head content classifier β€” ~1.5M params, ~209KB ONNX, <6ms inference.

Classifies any content into 7 categories from raw bytes in a single forward pass.

License Python ONNX PyPI HuggingFace Space HuggingFace Model GitHub CI DOI

Built by eulogik β€” AI infrastructure for developers.


✨ Features

  • No tokenizer β€” operates directly on raw UTF-8 bytes (supports all languages, zero pre-processing)
  • 7 heads, one forward pass β€” coarse type, modality, subtype, code lang, text lang, file MIME, risk flags
  • 4 Matryoshka tiers β€” tiny (16d) β†’ small (64d) β†’ base (192d) β†’ pro (576d)
  • ~200KB ONNX β€” deploy on edge devices, serverless functions, browser (WebAssembly)
  • <6ms inference on CPU via ONNX Runtime (base tier, 1024 bytes)
  • CLI, Gradio Space, MCP server β€” ready for any integration

πŸ“Š Performance

Head Classes Synthetic Accuracy Real-World Accuracy
coarse 12 100% 86%
modality 8 100% 100%
subtype 24 95% β€”
code_lang 62 39% β€”
text_lang 30 99% 100%
file_mime 90 100% β€”
risk (mAP) 6 100% β€”

Evaluated on 1000 synthetic samples + 21 hand-curated real-world inputs. Base tier, ~5ms inference.

Note: code_lang synthetic accuracy reflects the challenge of 62-way classification with limited per-class support. Real-world accuracy across all heads is 52% (11/21 correct), up from 23% baseline before diverse training.

πŸš€ Quick Start

CLI

pip install picotype

echo "def hello():\n    return 42" | picotype --pretty
picotype --file document.txt
picotype --clip

Python

from picotype import PicoType, PicoTypeConfig, decode_output

model = PicoType(PicoTypeConfig()).eval()
# ... load checkpoint ...
result = decode_output(model(b"input bytes"), tier="base")

MCP Server (Claude/Cursor)

PICOTYPE_MODEL_DIR=./checkpoints python -m model.pico_type.mcp_server

πŸ— Architecture

Bytes β†’ ByteEmbed(256β†’96d) β†’ 3Γ—Conv1D(k=3,5,7) β†’ 2Γ—BiAttention(RoPE) β†’ Pool(meanβ€–maxβ€–std) β†’ 7Γ—Matryoshka Heads
Component Description
ByteEmbed nn.Embedding(256, 96) β€” lookup-free byte embedding
Conv1D 3 parallel kernels (width 3, 5, 7) with residual + LayerNorm + GELU
BiAttention Bidirectional self-attention with Rotary Position Embeddings, 4 heads
Pool Mean + Max + Std concatenation over masked positions
Matryoshka Heads 4 tier slices of the pooled vector β†’ 7 linear classifiers

Total parameters: 1.43M (tiny) / 1.45M (small) / 1.48M (base) / 1.56M (pro)

πŸ”§ Model Tiers

Tier Dim Params ONNX Size Speed
tiny 16 1.43M 207 KB ~3ms
small 64 1.45M 207 KB ~4ms
base 192 1.48M 209 KB ~5ms
pro 576 1.56M 206 KB ~12ms

All tiers share the same trunk; only the final linear layers differ. Switch tiers at inference with zero overhead.

πŸ§ͺ Classification Heads

Head Classes Gated By Examples
coarse 12 β€” text, code, link, image, file, config, markup, data, error, secret, archive, binary
modality 8 β€” textual, binary_image, binary_archive, binary_executable, binary_document, binary_audio, binary_video, binary_other
subtype 24 config, markup, data json, yaml, toml, csv, html, markdown, sql, log, dockerfile
code_lang 62 code python, javascript, typescript, java, c, cpp, go, rust, kotlin, swift, bash, sql
text_lang 30 text en, es, fr, de, it, pt, ru, zh, ja, ko, ar, hi
file_mime 90 image, file text/html, application/json, application/pdf, image/png, video/mp4
risk 6 β€” api_key, jwt, password, email, phone, ssh_key (probabilities)

🌐 Deployment

Platform URL
HuggingFace Space eulogik/pico-type
HuggingFace Model eulogik/pico-type
GitHub eulogik/pico-type
PyPI pip install picotype
Zenodo 10.5281/zenodo.20758542

πŸ“š Documentation

πŸ“„ License

Apache 2.0 β€” free for commercial and personal use.


Built with ❀️ by eulogik
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using eulogik/pico-type 1