HoLo 6.5.1 β byte-native multimodal research model (59M)
Proof-of-operation weights for HoLo 6.5.1: a tokenizer-free, byte-native, decoder-only prefix-LM
(dim 512 / 8 layers / 8 heads, 59.4M params) running on the public non-learned 27-D HSL encoder
(pip install hsl-embedding, MIT). Trained on a single
RTX 4070. Not a benchmark-superiority claim β the release exists so every "it works" claim is
reproducible.
- Live demo: https://holo-demo-p5txmh4dda-as.a.run.app
- Code: https://github.com/Woojiggun/holo-hsl
- Project card: https://huggingface.co/spaces/ggunio/holo-demo-space
Files
| file | stage | golden numbers |
|---|---|---|
holo651_s1_text_30k.pt |
S1 text backbone (EN+KO, 30k steps) | text 1.632 bpb / knowledge-domain 1.689 |
holo651_s2_chat_know_12k.pt |
S2 chat + knowledge SFT (2KB context, 12k) | text 1.538 / chat 1.107 / grounding gap 0.120 |
holo651_s3_multimodal_10k.pt |
S3 multimodal (video windows, 10k) | text 1.528 / video 4.575 / grounding gap 1.835 |
Grounding gap = extra bits/byte the model pays when its disk-retrieved facts are swapped for wrong
ones (know_abl_bpb β know_bpb). It grew 0.001 β 1.835 across training: the model measurably READS its
disk memory instead of memorizing (facts live in a disk store, patterns in weights).
Multimodal generation (mechanism, not quality): free-running 539-byte video windows
([256B frame | SEP | 256B mu-law audio | SEP | 24B caption | WSEP]) keep their structure markers
unforced and place real English in the caption slot.
Usage
pip install "hsl-embedding>=0.5.0" torch
git clone https://github.com/Woojiggun/holo-hsl
from holo_generate import load, gen_text # from the repo (Train/)
m, cfg = load("holo651_s3_multimodal_10k.pt", device="cuda")
out = gen_text(m, "The universe is ".encode(), n_new=120, temperature=0.7,
origin_anchor=cfg["origin_anchor"])
print(out.decode("utf-8", "replace"))
Training data & why the license is NC
| data | role | license |
|---|---|---|
| FineWeb-Edu (EN) + Korean Wikipedia | text backbone | ODC-By / CC-BY-SA 4.0 |
| Project Gutenberg classics (philosophy etc.) | knowledge store + canon mix | Public domain (US) |
| Korean chat corpora (incl. GPT-derived sets) | S2 chat SFT | mixed, parts NC / model-derived |
| Open-movie video streams (Blender films) | S3 multimodal | CC-BY 3.0 |
Because the S2/S3 stages include chat data that is partly model-derived / non-commercial, these weights are released for research use under CC-BY-NC-SA 4.0. Attribution: Korean Wikipedia (CC-BY-SA), Blender Foundation open movies (CC-BY).
Honest limitations
A newborn research model: text generation is rough, chat is shallow, generated frames/audio are toy quality (16Γ16 gray, 8 kHz mu-law). Free-running quality varies by checkpoint (see the repo notes on cosine-tail selection). No safety tuning of any kind β research use only.
Citation
Jinhyun Woo, HoLo: A Feasibility Study of Change-Rate-Based Multimodal Unification β DOI: 10.5281/zenodo.20581805