BlakePeavy
/

BitPixelLM

@@ -14,77 +14,171 @@ pipeline_tag: text-to-image
 # BitPixelLM
-BitPixelLM is a text-to-pixel-art language model that generates 32x32 images from prompts like `a red pixel art sword`.
-It uses a BitNet b1.58-style ternary decoder (`-1, 0, +1`) with a lightweight text encoder.
-## Current Model Snapshot
-- Model name: **BitPixelLM**
-- Architecture: 3-layer text encoder + 6-layer BitPixelLM decoder
-- Parameters: ~7.3M
-- Dataset (v3): 23,648 synthetic pixel-art samples
-- Vocab: 222 words
-- Best validation loss (v3): ~0.4015
-## Project Layout
-- `model/bit_pixel_decoder.py` — BitPixelLM model
-- `train_bitnet.py` — training pipeline
-- `generate.py` — CLI generation
-- `app.py` — Gradio app
-- `data/generate_v3.py` — v3 dataset generator
-- `PixelArtGen_Colab.ipynb` — Colab training notebook
-## Run Locally
-1. Ensure Python 3.9 + CUDA-enabled PyTorch.
-2. Place data in `D:\PixelArtGen_Data\processed`:
-   - `tokens.npy`, `labels.json`, `vocab.json`, `palette_256.npy`
-3. Train:
-```bash
-python train_bitnet.py --epochs 60 --batch-size 32 --lr 5e-4
-```
-4. Launch app:
-```bash
-python app.py
-```
-## Publish to Hugging Face
-This repo includes `publish_hf.py` for one-step upload.
-### Required
-- Hugging Face token with write access (`HF_TOKEN`)
-- `huggingface_hub` installed
-### Command
-```bash
-pip install huggingface_hub
-python publish_hf.py --repo-id YOUR_USERNAME/BitPixelLM --token $HF_TOKEN
 ```
-On Windows PowerShell:
-```powershell
-$env:HF_TOKEN = "hf_xxx"
-python publish_hf.py --repo-id YOUR_USERNAME/BitPixelLM --token $env:HF_TOKEN
 ```
-This uploads:
-- `checkpoints_bit/best.pt`
-- `model/` Python files
-- `generate.py`
-- `app.py`
-- `README.md` (model card / usage overview)
-## Notes
-- The active production model is **BitPixelLM**.
-- Legacy FP32 `PixelLM` artifacts remain in the repo only for historical reference.

 # BitPixelLM
+BitPixelLM is a small autoregressive language model trained to generate **32×32 pixel art** from short text prompts.
+It was built entirely from scratch as a research and learning project, with no pre-trained weights or external image encoders.
+> **Example prompts:** `a red pixel art sword` · `a blue pixel art knight` · `a green pixel art dragon`
+---
+## Model Architecture
+BitPixelLM is a text-conditioned autoregressive decoder based on [BitNet b1.58](https://arxiv.org/abs/2402.17764).
+Instead of full-precision weights, the decoder uses **ternary weights {−1, 0, +1}**, making it extremely parameter-efficient.
+| Component | Details |
+|---|---|
+| Parameters | 7,382,274 (~7.4M total) |
+| Decoder layers | 6 (BitNet b1.58 — ternary weights) |
+| Text encoder layers | 3 (standard FP32 transformer) |
+| Model dimension | 256 |
+| Attention heads | 8 |
+| Feed-forward dim | 512 |
+| Weight format | ~75% ternary (1.58 bits/weight), ~25% FP32 |
+| Output | 32×32 RGB pixel art (256-color palette) |
+**Key design choices:**
+- **BitLinear b1.58**: decoder attention and feed-forward weights are quantized to {−1, 0, +1} during the forward pass via `sign(W / mean(|W|))`. Embeddings, norms, and the text encoder remain FP32.
+- **RMSNorm** instead of LayerNorm (pre-norm architecture).
+- **SwiGLU** activation in feed-forward blocks.
+- **2D positional encoding**: the 32×32 pixel grid is encoded with separate row/column sinusoidal embeddings, injected at each decoder layer.
+- **Cross-attention**: the decoder attends to text encoder outputs at every layer.
+- **Palette tokenization**: each pixel is one discrete token from a learned 256-color palette. Generation is autoregressive over 1,024 pixel tokens (32×32) per image.
+---
+## Training
+The model was trained on a fully synthetic procedurally-generated dataset of labeled 32×32 pixel art sprites.
+| Dataset | Samples | Categories | Vocabulary |
+|---|---|---|---|
+| v3 (current) | 23,648 | 199 | 222 words |
+Categories include characters, creatures, items, terrain, food, vehicles, buildings, nature, UI icons, furniture, and instruments.
+Each image has a short descriptive label (e.g. `a red pixel art sword`) used as the text conditioning signal.
+**Training configuration:**
+| Setting | Value |
+|---|---|
+| Epochs | 60 |
+| Batch size | 32 |
+| Optimizer | AdamW (β₁=0.9, β₂=0.95, wd=0.01) |
+| Learning rate | 5×10⁻⁴ with cosine annealing + 500-step warmup |
+| Hardware | NVIDIA RTX 3080 (10 GB VRAM) |
+**Results (v3 dataset, best at epoch 32):**
+| Metric | Value |
+|---|---|
+| Best validation loss | 0.4015 |
+| Perplexity | ~1.49 |
+---
+## Usage
+### Requirements
+```
+torch
+numpy
+Pillow
 ```
+### Load and generate
+```python
+import json, torch
+from PIL import Image
+from model.tokenizer import PaletteTokenizer
+from model.text_encoder import TextTokenizer, TextEncoder
+from model.bit_pixel_decoder import BitPixelLMDecoder, BitPixelLM
+# Load tokenizers
+palette_tok = PaletteTokenizer(palette_path="palette_256.npy")
+with open("vocab.json") as f:
+    vocab = json.load(f)
+text_tok = TextTokenizer(vocab)
+# Build model
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+text_encoder = TextEncoder(
+    vocab_size=text_tok.vocab_size, d_model=256, nhead=8,
+    num_layers=3, dim_feedforward=512, max_seq_len=32,
+)
+pixel_decoder = BitPixelLMDecoder(
+    vocab_size=palette_tok.vocab_size, d_model=256, nhead=8,
+    num_layers=6, dim_feedforward=512, img_size=32,
+)
+model = BitPixelLM(text_encoder, pixel_decoder).to(device)
+# Load weights
+ckpt = torch.load("best.pt", map_location=device, weights_only=False)
+model.load_state_dict(ckpt["model_state_dict"])
+model.eval()
+# Generate
+prompt = "a red pixel art sword"
+text_tokens = text_tok.encode(prompt).unsqueeze(0).to(device)
+with torch.no_grad():
+    generated = model.generate(
+        text_tokens,
+        sos_token=palette_tok.sos_token,
+        eos_token=palette_tok.eos_token,
+        temperature=0.8,
+        top_k=40,
+        top_p=0.9,
+    )
+# Decode to image
+img_array = palette_tok.decode_tokens(generated[0].cpu().tolist())
+img = Image.fromarray(img_array, "RGB").resize((256, 256), Image.NEAREST)
+img.save("output.png")
 ```
+### Vocabulary
+The model knows 222 words covering colors, object types, and descriptors. Prompts that follow the training label pattern (`a <color> pixel art <object>`) work best. Words outside the vocabulary are silently mapped to `<unk>`.
+Sample supported words:
+`red`, `blue`, `green`, `yellow`, `orange`, `purple`, `gold`, `dark`, `teal`, `silver` ·
+`sword`, `shield`, `bow`, `axe`, `staff`, `wand`, `armour` ·
+`knight`, `wizard`, `archer`, `dragon`, `goblin`, `skeleton`, `ghost`, `vampire` ·
+`castle`, `tree`, `flower`, `mushroom`, `chest`, `potion`, `gem`, `key`, `crown`, `ship`, `horse`, and more.
+---
+## Limitations
+- Outputs are 32×32 pixels. Upscale with `Image.NEAREST` to preserve the pixel art look.
+- Trained entirely on procedurally-generated synthetic data — no real-world artwork knowledge.
+- Generation quality is best for prompts close to training label patterns.
+- Color fidelity is bounded by the 256-color learned palette.
+---
+## Citations
+```bibtex
+@article{wang2023bitnet,
+  title={BitNet: Scaling 1-bit Transformers for Large Language Models},
+  author={Wang, Hongyu and others},
+  journal={arXiv:2310.11453},
+  year={2023}
+}
+@article{ma2024bitnet158,
+  title={The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits},
+  author={Ma, Shuming and others},
+  journal={arXiv:2402.17764},
+  year={2024}
+}
+```
+---
+## License
+MIT