Falcon-Coder-3B (1.58-bit / TQ1_0)
A fine-tuned 1.58-bit ternary quantization of Falcon-E-3B-Instruct, optimized for CPU inference via vanilla llama.cpp.
This model produces code (Python, TypeScript, etc.) at ~24 tokens/sec on a typical laptop CPU with a 710 MB on-disk footprint.
Model Details
| Property | Value |
|---|---|
| Base model | tiiuae/Falcon-E-3B-Instruct |
| Training method | 1.58-bit full fine-tune via onebitllms |
| Training data | 365k coding instruction examples (Magicoder + CodeFeedback) |
| Training duration | ~92 hours on RTX 4090 (24 GB) |
| Final loss | 0.5008 (started at 0.91) |
| Effective batch size | 32 (per_device=1 × grad_accum=32) |
| Optimizer | paged_adamw_8bit |
| Learning rate | 1e-4 (cosine schedule, 3% warmup) |
| Sequence length | 1024 |
| Epochs | 2 |
| Stored precision | BF16 (5.7 GB) |
| Inference precision | TQ1_0 ternary (1.69 bpw, ~710 MB) |
| Inference engine | vanilla llama.cpp (TQ1_0 quant) |
| Inference speed | ~24 tok/s on laptop CPU |
What's in the repo
This is the training-time BF16 checkpoint. To use it on CPU, you must convert it to a 1.58-bit ternary GGUF. See "Usage" below.
Usage
Inference on CPU (recommended)
This BF16 model is too large for fast CPU inference. Convert to a 1.58-bit ternary GGUF first:
# 1. Download the BF16 model
hf download anthonylee991/falcon-coder-3b --local-dir falcon-coder-3b-bf16
# 2. Convert to F16 GGUF
python llama.cpp/convert_hf_to_gguf.py falcon-coder-3b-bf16 `
--outfile falcon-coder-3b.gguf --outtype f16
# 3. Quantize to TQ1_0 (1.58-bit ternary, ~710 MB)
llama.cpp/build/bin/Release/llama-quantize.exe falcon-coder-3b.gguf `
falcon-coder-3b-tq1.gguf TQ1_0 8
# 4. Run inference
llama.cpp/build/bin/Release/llama-cli.exe `
-m falcon-coder-3b-tq1.gguf `
-p "def fibonacci(n):" -n 100 --threads 8
Inference on GPU (BF16)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"anthonylee991/falcon-coder-3b",
torch_dtype=torch.bfloat16,
device_map="cuda",
)
tokenizer = AutoTokenizer.from_pretrained("anthonylee991/falcon-coder-3b")
prompt = "def quicksort(arr):"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Intended Use
This model is a code generation assistant. Verified strong performance on:
- ✅ Pure algorithms (binary search, sort, recursive functions)
- ✅ Type definitions (TypeScript interfaces, Pydantic models)
- ✅ Test scaffolding (pytest, Jest)
- ✅ Mechanical refactors (if/elif → dict dispatch)
- ✅ Docstrings (Google-style with examples)
Verified weak performance on:
- ⚠️ PowerShell (uses deprecated cmdlets like
Get-WmiObject) - ⚠️ Complex business logic with multiple interacting rules
- ⚠️ Anything requiring framework-specific knowledge not in training data
For a detailed 10-test evaluation, see the project repository or the companion HOW-TO guide.
Training Data
Combined and deduplicated from:
| Dataset | Rows | Purpose |
|---|---|---|
| m-a-p/CodeFeedback-Filtered-Instruction | ~156k | High-quality code instructions with feedback |
| ise-uiuc/Magicoder-OSS-Instruct-75K | ~75k | Magicoder-style OSS examples |
| ise-uiuc/Magicoder-Evol-Instruct-110K | ~110k | Evolved instructions |
| Various smaller PowerShell/TypeScript corpuses | ~30k | Multi-language coverage |
After deduplication via MinHash LSH @ 0.85: 365,251 train rows + 2,000 eval rows.
The training data is generic Python code. PowerShell, FastAPI, and TypeScript quality is limited compared to Python. See the V2 plan in the project docs for how to address this.
Limitations
- PowerShell quality is poor — the model defaults to deprecated cmdlets. Use a more recent code model for PowerShell or fine-tune on PS-specific data.
- Framework-specific code (FastAPI deps, SQLAlchemy patterns, React state management) is hit-or-miss.
- No held-out domain eval — the eval split was drawn from the same training distribution.
- Small model (3B) — complex reasoning across multiple files is out of scope.
- Output may include explanatory prose — extract code blocks from the response, don't paste the whole output into your code.
Training Infrastructure
- Cloud GPU: Hivenet GPU-optimized container, single RTX 4090 (24 GB VRAM)
- 92 hours wall time, $40 approximate cost
- BF16 + 8-bit AdamW + gradient checkpointing to fit in 24 GB
License
This model is released under the Apache 2.0 license, consistent with the base Falcon-E-3B-Instruct license.
Citation
If you use this model, please cite the base model and the BitNet approach:
@misc{falcon-e-3b-instruct,
title={Falcon-E: A Family of Universal, Pre-trained 1.58-bit Models},
author={TII Falcon Team},
year={2025},
url={https://huggingface.co/tiiuae/Falcon-E-3B-Instruct}
}
@misc{bitnet2025,
title={bitnet.cpp: Efficient Edge Inference for Ternary LLMs},
author={Jinheng Wang and others},
year={2025},
url={https://github.com/microsoft/BitNet}
}
- Downloads last month
- 112