ks_byte_lm SpaceByte v1

Small byte-level causal language model trained from scratch on a Kashmiri pretraining corpus.

This repository is a full project/checkpoint release: source code, Modal training artifacts, metrics, and training checkpoints are included.

Contents

  • checkpoints/best.pt — best validation checkpoint; recommended for generation.
  • checkpoints/latest.pt — final early-stopped checkpoint.
  • checkpoints/ckpt_*.pt — retained periodic checkpoints.
  • artifacts/final_report.json — final training report.
  • artifacts/train.log — full training log.
  • artifacts/accuracy_val_best.json and accuracy_val_latest.json — held-out next-byte accuracy estimates.
  • artifacts/data_meta.json — cached byte-shard metadata.
  • source/ — full ks_byte_lm project source used for the run.

Key metrics

Validation next-byte accuracy using best.pt: 76.42% on 3,276,800 evaluated byte tokens.

Training summary:

  • Best validation BPB: 0.9593
  • Final validation BPB: 0.9862
  • Final CE: 0.6836
  • Final word PPL estimate: 849.0
  • Early stopped at step 4751 / 5000

Dataset/cache metadata:

  • Train byte tokens: 45,362,173
  • Validation byte tokens: 1,622,371
  • Test byte tokens: 3,074,698
  • Words: 5,074,066

Usage

This is not a standard Transformers-format model. Do not use AutoModel.from_pretrained. It is a custom PyTorch checkpoint release for the included ksbyte source code.

Recommended checkpoint:

checkpoints/best.pt

best.pt is preferred over latest.pt because it had the best validation BPB.

1. Download and install

If the repository is private, first login with an account that has access:

pip install -U huggingface_hub
huggingface-cli login

Download the full snapshot:

python - <<'PY'
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="Omarrran/ks-byte-lm-spacebyte-v1",
    repo_type="model",
    local_dir="ks-byte-lm-spacebyte-v1",
    local_dir_use_symlinks=False,
)
PY

Install dependencies:

cd ks-byte-lm-spacebyte-v1
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

2. Generate text

Make the bundled source package importable and run generation:

export PYTHONPATH="$PWD/source:$PYTHONPATH"

PYTHONUTF8=1 python -m ksbyte.generate \
  --ckpt checkpoints/best.pt \
  --prompt "کٲشِرۍ پَلو چھِ اکثر خٕطہٕ کِس تٲریٖخٕچ دٔلیٖل ونان، یُس مُختٔلِف ثقافتَن ہٕنٛدۍ اثرات ظٲہِر کران" \
  --n 200 \
  --temperature 0.8 \
  --top_k 50 \
  --device auto

--device auto uses CUDA when available and falls back to CPU otherwise.

Example verified output from the trained checkpoint:

 کٲشِرۍ پَلو چھِ اکثر خٕطہٕ کِس تٲریٖخٕچ دٔلیٖل ونان، یُس مُختٔلِف ثقافتَن ہٕنٛدۍ اثرات ظٲہِر کران 

3. Minimal Python API

import torch
from ksbyte.generate import load_model, encode_prompt, decode_tokens
from ksbyte.config import EOS_ID

ckpt = "checkpoints/best.pt"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model, cfg = load_model(ckpt, device)
idx = torch.tensor([encode_prompt("کٲشِرۍ پَلو چھِ اکثر خٕطہٕ کِس تٲریٖخٕچ دٔلیٖل ونان، یُس مُختٔلِف ثقافتَن ہٕنٛدۍ اثرات ظٲہِر کران", cfg)], dtype=torch.long, device=device)

out = model.generate(
    idx,
    max_new_tokens=200,
    temperature=0.8,
    top_k=50,
    eos_id=EOS_ID,
)
print(decode_tokens(out[0].tolist()))

Troubleshooting:

  • If ModuleNotFoundError: No module named 'ksbyte', run export PYTHONPATH="$PWD/source:$PYTHONPATH" from the repo root.
  • If CUDA is unavailable, use --device cpu or keep --device auto.
  • Lower --temperature such as 0.7 for more conservative generation.

Checkpoint notes

Use checkpoints/best.pt for generation and evaluation. checkpoints/latest.pt is the final early-stopped checkpoint and is slightly weaker than best.pt.

Caveats

This is a small experimental byte-level LM. It produces Kashmiri-looking text and has learned script/byte structure, but generations can be incomplete, hallucinated, or semantically weak.

License

No explicit open-source license is declared for this experimental release. Contact the repository owner for reuse permissions.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Faizaniqbal/ks-byte-lm-spacebyte-v1

Evaluation results