BlakePeavy commited on
Commit
a9e7f27
·
verified ·
1 Parent(s): a271056

Rewrite model card with architecture, training stats, and usage

Browse files
Files changed (1) hide show
  1. README.md +145 -51
README.md CHANGED
@@ -14,77 +14,171 @@ pipeline_tag: text-to-image
14
 
15
  # BitPixelLM
16
 
17
- BitPixelLM is a text-to-pixel-art language model that generates 32x32 images from prompts like `a red pixel art sword`.
 
18
 
19
- It uses a BitNet b1.58-style ternary decoder (`-1, 0, +1`) with a lightweight text encoder.
20
 
21
- ## Current Model Snapshot
22
 
23
- - Model name: **BitPixelLM**
24
- - Architecture: 3-layer text encoder + 6-layer BitPixelLM decoder
25
- - Parameters: ~7.3M
26
- - Dataset (v3): 23,648 synthetic pixel-art samples
27
- - Vocab: 222 words
28
- - Best validation loss (v3): ~0.4015
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
- ## Project Layout
31
 
32
- - `model/bit_pixel_decoder.py` — BitPixelLM model
33
- - `train_bitnet.py` — training pipeline
34
- - `generate.py` — CLI generation
35
- - `app.py` — Gradio app
36
- - `data/generate_v3.py` — v3 dataset generator
37
- - `PixelArtGen_Colab.ipynb` — Colab training notebook
38
 
39
- ## Run Locally
40
 
41
- 1. Ensure Python 3.9 + CUDA-enabled PyTorch.
42
- 2. Place data in `D:\PixelArtGen_Data\processed`:
43
- - `tokens.npy`, `labels.json`, `vocab.json`, `palette_256.npy`
44
- 3. Train:
45
 
46
- ```bash
47
- python train_bitnet.py --epochs 60 --batch-size 32 --lr 5e-4
48
- ```
49
 
50
- 4. Launch app:
51
 
52
- ```bash
53
- python app.py
54
- ```
 
 
 
 
55
 
56
- ## Publish to Hugging Face
57
 
58
- This repo includes `publish_hf.py` for one-step upload.
 
 
 
59
 
60
- ### Required
61
 
62
- - Hugging Face token with write access (`HF_TOKEN`)
63
- - `huggingface_hub` installed
64
 
65
- ### Command
66
 
67
- ```bash
68
- pip install huggingface_hub
69
- python publish_hf.py --repo-id YOUR_USERNAME/BitPixelLM --token $HF_TOKEN
 
70
  ```
71
 
72
- On Windows PowerShell:
73
-
74
- ```powershell
75
- $env:HF_TOKEN = "hf_xxx"
76
- python publish_hf.py --repo-id YOUR_USERNAME/BitPixelLM --token $env:HF_TOKEN
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
  ```
78
 
79
- This uploads:
 
 
 
 
 
 
 
 
80
 
81
- - `checkpoints_bit/best.pt`
82
- - `model/` Python files
83
- - `generate.py`
84
- - `app.py`
85
- - `README.md` (model card / usage overview)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
- ## Notes
88
 
89
- - The active production model is **BitPixelLM**.
90
- - Legacy FP32 `PixelLM` artifacts remain in the repo only for historical reference.
 
14
 
15
  # BitPixelLM
16
 
17
+ BitPixelLM is a small autoregressive language model trained to generate **32×32 pixel art** from short text prompts.
18
+ It was built entirely from scratch as a research and learning project, with no pre-trained weights or external image encoders.
19
 
20
+ > **Example prompts:** `a red pixel art sword` · `a blue pixel art knight` · `a green pixel art dragon`
21
 
22
+ ---
23
 
24
+ ## Model Architecture
25
+
26
+ BitPixelLM is a text-conditioned autoregressive decoder based on [BitNet b1.58](https://arxiv.org/abs/2402.17764).
27
+ Instead of full-precision weights, the decoder uses **ternary weights {−1, 0, +1}**, making it extremely parameter-efficient.
28
+
29
+ | Component | Details |
30
+ |---|---|
31
+ | Parameters | 7,382,274 (~7.4M total) |
32
+ | Decoder layers | 6 (BitNet b1.58 — ternary weights) |
33
+ | Text encoder layers | 3 (standard FP32 transformer) |
34
+ | Model dimension | 256 |
35
+ | Attention heads | 8 |
36
+ | Feed-forward dim | 512 |
37
+ | Weight format | ~75% ternary (1.58 bits/weight), ~25% FP32 |
38
+ | Output | 32×32 RGB pixel art (256-color palette) |
39
+
40
+ **Key design choices:**
41
+ - **BitLinear b1.58**: decoder attention and feed-forward weights are quantized to {−1, 0, +1} during the forward pass via `sign(W / mean(|W|))`. Embeddings, norms, and the text encoder remain FP32.
42
+ - **RMSNorm** instead of LayerNorm (pre-norm architecture).
43
+ - **SwiGLU** activation in feed-forward blocks.
44
+ - **2D positional encoding**: the 32×32 pixel grid is encoded with separate row/column sinusoidal embeddings, injected at each decoder layer.
45
+ - **Cross-attention**: the decoder attends to text encoder outputs at every layer.
46
+ - **Palette tokenization**: each pixel is one discrete token from a learned 256-color palette. Generation is autoregressive over 1,024 pixel tokens (32×32) per image.
47
 
48
+ ---
49
 
50
+ ## Training
 
 
 
 
 
51
 
52
+ The model was trained on a fully synthetic procedurally-generated dataset of labeled 32×32 pixel art sprites.
53
 
54
+ | Dataset | Samples | Categories | Vocabulary |
55
+ |---|---|---|---|
56
+ | v3 (current) | 23,648 | 199 | 222 words |
 
57
 
58
+ Categories include characters, creatures, items, terrain, food, vehicles, buildings, nature, UI icons, furniture, and instruments.
59
+ Each image has a short descriptive label (e.g. `a red pixel art sword`) used as the text conditioning signal.
 
60
 
61
+ **Training configuration:**
62
 
63
+ | Setting | Value |
64
+ |---|---|
65
+ | Epochs | 60 |
66
+ | Batch size | 32 |
67
+ | Optimizer | AdamW (β₁=0.9, β₂=0.95, wd=0.01) |
68
+ | Learning rate | 5×10⁻⁴ with cosine annealing + 500-step warmup |
69
+ | Hardware | NVIDIA RTX 3080 (10 GB VRAM) |
70
 
71
+ **Results (v3 dataset, best at epoch 32):**
72
 
73
+ | Metric | Value |
74
+ |---|---|
75
+ | Best validation loss | 0.4015 |
76
+ | Perplexity | ~1.49 |
77
 
78
+ ---
79
 
80
+ ## Usage
 
81
 
82
+ ### Requirements
83
 
84
+ ```
85
+ torch
86
+ numpy
87
+ Pillow
88
  ```
89
 
90
+ ### Load and generate
91
+
92
+ ```python
93
+ import json, torch
94
+ from PIL import Image
95
+ from model.tokenizer import PaletteTokenizer
96
+ from model.text_encoder import TextTokenizer, TextEncoder
97
+ from model.bit_pixel_decoder import BitPixelLMDecoder, BitPixelLM
98
+
99
+ # Load tokenizers
100
+ palette_tok = PaletteTokenizer(palette_path="palette_256.npy")
101
+ with open("vocab.json") as f:
102
+ vocab = json.load(f)
103
+ text_tok = TextTokenizer(vocab)
104
+
105
+ # Build model
106
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
107
+ text_encoder = TextEncoder(
108
+ vocab_size=text_tok.vocab_size, d_model=256, nhead=8,
109
+ num_layers=3, dim_feedforward=512, max_seq_len=32,
110
+ )
111
+ pixel_decoder = BitPixelLMDecoder(
112
+ vocab_size=palette_tok.vocab_size, d_model=256, nhead=8,
113
+ num_layers=6, dim_feedforward=512, img_size=32,
114
+ )
115
+ model = BitPixelLM(text_encoder, pixel_decoder).to(device)
116
+
117
+ # Load weights
118
+ ckpt = torch.load("best.pt", map_location=device, weights_only=False)
119
+ model.load_state_dict(ckpt["model_state_dict"])
120
+ model.eval()
121
+
122
+ # Generate
123
+ prompt = "a red pixel art sword"
124
+ text_tokens = text_tok.encode(prompt).unsqueeze(0).to(device)
125
+ with torch.no_grad():
126
+ generated = model.generate(
127
+ text_tokens,
128
+ sos_token=palette_tok.sos_token,
129
+ eos_token=palette_tok.eos_token,
130
+ temperature=0.8,
131
+ top_k=40,
132
+ top_p=0.9,
133
+ )
134
+
135
+ # Decode to image
136
+ img_array = palette_tok.decode_tokens(generated[0].cpu().tolist())
137
+ img = Image.fromarray(img_array, "RGB").resize((256, 256), Image.NEAREST)
138
+ img.save("output.png")
139
  ```
140
 
141
+ ### Vocabulary
142
+
143
+ The model knows 222 words covering colors, object types, and descriptors. Prompts that follow the training label pattern (`a <color> pixel art <object>`) work best. Words outside the vocabulary are silently mapped to `<unk>`.
144
+
145
+ Sample supported words:
146
+ `red`, `blue`, `green`, `yellow`, `orange`, `purple`, `gold`, `dark`, `teal`, `silver` ·
147
+ `sword`, `shield`, `bow`, `axe`, `staff`, `wand`, `armour` ·
148
+ `knight`, `wizard`, `archer`, `dragon`, `goblin`, `skeleton`, `ghost`, `vampire` ·
149
+ `castle`, `tree`, `flower`, `mushroom`, `chest`, `potion`, `gem`, `key`, `crown`, `ship`, `horse`, and more.
150
 
151
+ ---
152
+
153
+ ## Limitations
154
+
155
+ - Outputs are 32×32 pixels. Upscale with `Image.NEAREST` to preserve the pixel art look.
156
+ - Trained entirely on procedurally-generated synthetic data — no real-world artwork knowledge.
157
+ - Generation quality is best for prompts close to training label patterns.
158
+ - Color fidelity is bounded by the 256-color learned palette.
159
+
160
+ ---
161
+
162
+ ## Citations
163
+
164
+ ```bibtex
165
+ @article{wang2023bitnet,
166
+ title={BitNet: Scaling 1-bit Transformers for Large Language Models},
167
+ author={Wang, Hongyu and others},
168
+ journal={arXiv:2310.11453},
169
+ year={2023}
170
+ }
171
+
172
+ @article{ma2024bitnet158,
173
+ title={The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits},
174
+ author={Ma, Shuming and others},
175
+ journal={arXiv:2402.17764},
176
+ year={2024}
177
+ }
178
+ ```
179
+
180
+ ---
181
 
182
+ ## License
183
 
184
+ MIT