ks_byte_lm SpaceByte v1

Small byte-level causal language model trained from scratch on a Kashmiri pretraining corpus.

This repository is a full project/checkpoint release: source code, Modal training artifacts, metrics, and training checkpoints are included.

checkpoints/best.pt — best validation checkpoint; recommended for generation.
checkpoints/latest.pt — final early-stopped checkpoint.
checkpoints/ckpt_*.pt — retained periodic checkpoints.
artifacts/final_report.json — final training report.
artifacts/train.log — full training log.
artifacts/accuracy_val_best.json and accuracy_val_latest.json — held-out next-byte accuracy estimates.
artifacts/data_meta.json — cached byte-shard metadata.
source/ — full ks_byte_lm project source used for the run.

Key metrics

Validation next-byte accuracy using best.pt: 76.42% on 3,276,800 evaluated byte tokens.

Training summary:

Best validation BPB: 0.9593
Final validation BPB: 0.9862
Final CE: 0.6836
Final word PPL estimate: 849.0
Early stopped at step 4751 / 5000

Dataset/cache metadata:

Train byte tokens: 45,362,173
Validation byte tokens: 1,622,371
Test byte tokens: 3,074,698
Words: 5,074,066

Usage

This is not a standard Transformers-format model. Do not use AutoModel.from_pretrained. It is a custom PyTorch checkpoint release for the included ksbyte source code.

Recommended checkpoint:

checkpoints/best.pt

best.pt is preferred over latest.pt because it had the best validation BPB.

1. Download and install

If the repository is private, first login with an account that has access:

pip install -U huggingface_hub
huggingface-cli login

Download the full snapshot:

python - <<'PY'
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="Omarrran/ks-byte-lm-spacebyte-v1",
    repo_type="model",
    local_dir="ks-byte-lm-spacebyte-v1",
    local_dir_use_symlinks=False,
)
PY

Install dependencies:

cd ks-byte-lm-spacebyte-v1
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

2. Generate text

Make the bundled source package importable and run generation:

export PYTHONPATH="$PWD/source:$PYTHONPATH"

PYTHONUTF8=1 python -m ksbyte.generate \
  --ckpt checkpoints/best.pt \
  --prompt "کٲشِرۍ پَلو چھِ اکثر خٕطہٕ کِس تٲریٖخٕچ دٔلیٖل ونان، یُس مُختٔلِف ثقافتَن ہٕنٛدۍ اثرات ظٲہِر کران" \
  --n 200 \
  --temperature 0.8 \
  --top_k 50 \
  --device auto

--device auto uses CUDA when available and falls back to CPU otherwise.

Example verified output from the trained checkpoint:

 کٲشِرۍ پَلو چھِ اکثر خٕطہٕ کِس تٲریٖخٕچ دٔلیٖل ونان، یُس مُختٔلِف ثقافتَن ہٕنٛدۍ اثرات ظٲہِر کران

3. Minimal Python API

import torch
from ksbyte.generate import load_model, encode_prompt, decode_tokens
from ksbyte.config import EOS_ID

ckpt = "checkpoints/best.pt"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model, cfg = load_model(ckpt, device)
idx = torch.tensor([encode_prompt("کٲشِرۍ پَلو چھِ اکثر خٕطہٕ کِس تٲریٖخٕچ دٔلیٖل ونان، یُس مُختٔلِف ثقافتَن ہٕنٛدۍ اثرات ظٲہِر کران", cfg)], dtype=torch.long, device=device)

out = model.generate(
    idx,
    max_new_tokens=200,
    temperature=0.8,
    top_k=50,
    eos_id=EOS_ID,
)
print(decode_tokens(out[0].tolist()))

Troubleshooting:

If ModuleNotFoundError: No module named 'ksbyte', run export PYTHONPATH="$PWD/source:$PYTHONPATH" from the repo root.
If CUDA is unavailable, use --device cpu or keep --device auto.
Lower --temperature such as 0.7 for more conservative generation.

Checkpoint notes

Use checkpoints/best.pt for generation and evaluation. checkpoints/latest.pt is the final early-stopped checkpoint and is slightly weaker than best.pt.

Caveats

This is a small experimental byte-level LM. It produces Kashmiri-looking text and has learned script/byte structure, but generations can be incomplete, hallucinated, or semantically weak.

License

No explicit open-source license is declared for this experimental release. Contact the repository owner for reuse permissions.

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train Faizaniqbal/ks-byte-lm-spacebyte-v1

Evaluation results

Best validation bits-per-byte on Kashmiri pretraining corpus
validation set self-reported

0.959
Final validation bits-per-byte on Kashmiri pretraining corpus
validation set self-reported

0.986
Final validation cross entropy on Kashmiri pretraining corpus
validation set self-reported

0.684
Validation next-byte top-1 accuracy on Kashmiri pretraining corpus
validation set self-reported

0.764
Final validation word perplexity estimate on Kashmiri pretraining corpus
validation set self-reported

849.000
Training byte tokens on Kashmiri pretraining corpus
validation set self-reported

45362173.000
Validation byte tokens on Kashmiri pretraining corpus
validation set self-reported

1622371.000
Test byte tokens on Kashmiri pretraining corpus
validation set self-reported

3074698.000

Duplicated from Omarrran/ks-byte-lm-spacebyte-v1

Faizaniqbal
/

ks-byte-lm-spacebyte-v1