HoLo 6.5.1 β€” byte-native multimodal research model (59M)

Proof-of-operation weights for HoLo 6.5.1: a tokenizer-free, byte-native, decoder-only prefix-LM (dim 512 / 8 layers / 8 heads, 59.4M params) running on the public non-learned 27-D HSL encoder (pip install hsl-embedding, MIT). Trained on a single RTX 4070. Not a benchmark-superiority claim β€” the release exists so every "it works" claim is reproducible.

Files

file stage golden numbers
holo651_s1_text_30k.pt S1 text backbone (EN+KO, 30k steps) text 1.632 bpb / knowledge-domain 1.689
holo651_s2_chat_know_12k.pt S2 chat + knowledge SFT (2KB context, 12k) text 1.538 / chat 1.107 / grounding gap 0.120
holo651_s3_multimodal_10k.pt S3 multimodal (video windows, 10k) text 1.528 / video 4.575 / grounding gap 1.835

Grounding gap = extra bits/byte the model pays when its disk-retrieved facts are swapped for wrong ones (know_abl_bpb βˆ’ know_bpb). It grew 0.001 β†’ 1.835 across training: the model measurably READS its disk memory instead of memorizing (facts live in a disk store, patterns in weights).

Multimodal generation (mechanism, not quality): free-running 539-byte video windows ([256B frame | SEP | 256B mu-law audio | SEP | 24B caption | WSEP]) keep their structure markers unforced and place real English in the caption slot.

Usage

pip install "hsl-embedding>=0.5.0" torch
git clone https://github.com/Woojiggun/holo-hsl
from holo_generate import load, gen_text   # from the repo (Train/)
m, cfg = load("holo651_s3_multimodal_10k.pt", device="cuda")
out = gen_text(m, "The universe is ".encode(), n_new=120, temperature=0.7,
               origin_anchor=cfg["origin_anchor"])
print(out.decode("utf-8", "replace"))

Training data & why the license is NC

data role license
FineWeb-Edu (EN) + Korean Wikipedia text backbone ODC-By / CC-BY-SA 4.0
Project Gutenberg classics (philosophy etc.) knowledge store + canon mix Public domain (US)
Korean chat corpora (incl. GPT-derived sets) S2 chat SFT mixed, parts NC / model-derived
Open-movie video streams (Blender films) S3 multimodal CC-BY 3.0

Because the S2/S3 stages include chat data that is partly model-derived / non-commercial, these weights are released for research use under CC-BY-NC-SA 4.0. Attribution: Korean Wikipedia (CC-BY-SA), Blender Foundation open movies (CC-BY).

Honest limitations

A newborn research model: text generation is rough, chat is shallow, generated frames/audio are toy quality (16Γ—16 gray, 8 kHz mu-law). Free-running quality varies by checkpoint (see the repo notes on cosine-tail selection). No safety tuning of any kind β€” research use only.

Citation

Jinhyun Woo, HoLo: A Feasibility Study of Change-Rate-Based Multimodal Unification β€” DOI: 10.5281/zenodo.20581805

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using ggunio/HoLo-6.5.1 1