ks_byte_lm SpaceByte v1
Small byte-level causal language model trained from scratch on a Kashmiri pretraining corpus.
This repository is a full project/checkpoint release: source code, Modal training artifacts, metrics, and training checkpoints are included.
Contents
checkpoints/best.pt— best validation checkpoint; recommended for generation.checkpoints/latest.pt— final early-stopped checkpoint.checkpoints/ckpt_*.pt— retained periodic checkpoints.artifacts/final_report.json— final training report.artifacts/train.log— full training log.artifacts/accuracy_val_best.jsonandaccuracy_val_latest.json— held-out next-byte accuracy estimates.artifacts/data_meta.json— cached byte-shard metadata.source/— fullks_byte_lmproject source used for the run.
Key metrics
Validation next-byte accuracy using best.pt: 76.42% on 3,276,800 evaluated byte tokens.
Training summary:
- Best validation BPB: 0.9593
- Final validation BPB: 0.9862
- Final CE: 0.6836
- Final word PPL estimate: 849.0
- Early stopped at step 4751 / 5000
Dataset/cache metadata:
- Train byte tokens:
45,362,173 - Validation byte tokens:
1,622,371 - Test byte tokens:
3,074,698 - Words:
5,074,066
Usage
This is not a standard Transformers-format model. Do not use
AutoModel.from_pretrained. It is a custom PyTorch checkpoint release for the
included ksbyte source code.
Recommended checkpoint:
checkpoints/best.pt
best.pt is preferred over latest.pt because it had the best validation BPB.
1. Download and install
If the repository is private, first login with an account that has access:
pip install -U huggingface_hub
huggingface-cli login
Download the full snapshot:
python - <<'PY'
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="Omarrran/ks-byte-lm-spacebyte-v1",
repo_type="model",
local_dir="ks-byte-lm-spacebyte-v1",
local_dir_use_symlinks=False,
)
PY
Install dependencies:
cd ks-byte-lm-spacebyte-v1
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
2. Generate text
Make the bundled source package importable and run generation:
export PYTHONPATH="$PWD/source:$PYTHONPATH"
PYTHONUTF8=1 python -m ksbyte.generate \
--ckpt checkpoints/best.pt \
--prompt "کٲشِرۍ پَلو چھِ اکثر خٕطہٕ کِس تٲریٖخٕچ دٔلیٖل ونان، یُس مُختٔلِف ثقافتَن ہٕنٛدۍ اثرات ظٲہِر کران" \
--n 200 \
--temperature 0.8 \
--top_k 50 \
--device auto
--device auto uses CUDA when available and falls back to CPU otherwise.
Example verified output from the trained checkpoint:
کٲشِرۍ پَلو چھِ اکثر خٕطہٕ کِس تٲریٖخٕچ دٔلیٖل ونان، یُس مُختٔلِف ثقافتَن ہٕنٛدۍ اثرات ظٲہِر کران
3. Minimal Python API
import torch
from ksbyte.generate import load_model, encode_prompt, decode_tokens
from ksbyte.config import EOS_ID
ckpt = "checkpoints/best.pt"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model, cfg = load_model(ckpt, device)
idx = torch.tensor([encode_prompt("کٲشِرۍ پَلو چھِ اکثر خٕطہٕ کِس تٲریٖخٕچ دٔلیٖل ونان، یُس مُختٔلِف ثقافتَن ہٕنٛدۍ اثرات ظٲہِر کران", cfg)], dtype=torch.long, device=device)
out = model.generate(
idx,
max_new_tokens=200,
temperature=0.8,
top_k=50,
eos_id=EOS_ID,
)
print(decode_tokens(out[0].tolist()))
Troubleshooting:
- If
ModuleNotFoundError: No module named 'ksbyte', runexport PYTHONPATH="$PWD/source:$PYTHONPATH"from the repo root. - If CUDA is unavailable, use
--device cpuor keep--device auto. - Lower
--temperaturesuch as0.7for more conservative generation.
Checkpoint notes
Use checkpoints/best.pt for generation and evaluation. checkpoints/latest.pt is the final early-stopped checkpoint and is slightly weaker than best.pt.
Caveats
This is a small experimental byte-level LM. It produces Kashmiri-looking text and has learned script/byte structure, but generations can be incomplete, hallucinated, or semantically weak.
License
No explicit open-source license is declared for this experimental release. Contact the repository owner for reuse permissions.
Dataset used to train Faizaniqbal/ks-byte-lm-spacebyte-v1
Evaluation results
- Best validation bits-per-byte on Kashmiri pretraining corpusvalidation set self-reported0.959
- Final validation bits-per-byte on Kashmiri pretraining corpusvalidation set self-reported0.986
- Final validation cross entropy on Kashmiri pretraining corpusvalidation set self-reported0.684
- Validation next-byte top-1 accuracy on Kashmiri pretraining corpusvalidation set self-reported0.764
- Final validation word perplexity estimate on Kashmiri pretraining corpusvalidation set self-reported849.000
- Training byte tokens on Kashmiri pretraining corpusvalidation set self-reported45362173.000
- Validation byte tokens on Kashmiri pretraining corpusvalidation set self-reported1622371.000
- Test byte tokens on Kashmiri pretraining corpusvalidation set self-reported3074698.000