Falcon-Coder-3B (1.58-bit / TQ1_0)

A fine-tuned 1.58-bit ternary quantization of Falcon-E-3B-Instruct, optimized for CPU inference via vanilla llama.cpp.

This model produces code (Python, TypeScript, etc.) at ~24 tokens/sec on a typical laptop CPU with a 710 MB on-disk footprint.

Model Details

Property	Value
Base model	tiiuae/Falcon-E-3B-Instruct
Training method	1.58-bit full fine-tune via onebitllms
Training data	365k coding instruction examples (Magicoder + CodeFeedback)
Training duration	~92 hours on RTX 4090 (24 GB)
Final loss	0.5008 (started at 0.91)
Effective batch size	32 (per_device=1 × grad_accum=32)
Optimizer	paged_adamw_8bit
Learning rate	1e-4 (cosine schedule, 3% warmup)
Sequence length	1024
Epochs	2
Stored precision	BF16 (5.7 GB)
Inference precision	TQ1_0 ternary (1.69 bpw, ~710 MB)
Inference engine	vanilla llama.cpp (TQ1_0 quant)
Inference speed	~24 tok/s on laptop CPU

What's in the repo

This is the training-time BF16 checkpoint. To use it on CPU, you must convert it to a 1.58-bit ternary GGUF. See "Usage" below.

Usage

Inference on CPU (recommended)

This BF16 model is too large for fast CPU inference. Convert to a 1.58-bit ternary GGUF first:

# 1. Download the BF16 model
hf download anthonylee991/falcon-coder-3b --local-dir falcon-coder-3b-bf16

# 2. Convert to F16 GGUF
python llama.cpp/convert_hf_to_gguf.py falcon-coder-3b-bf16 `
    --outfile falcon-coder-3b.gguf --outtype f16

# 3. Quantize to TQ1_0 (1.58-bit ternary, ~710 MB)
llama.cpp/build/bin/Release/llama-quantize.exe falcon-coder-3b.gguf `
    falcon-coder-3b-tq1.gguf TQ1_0 8

# 4. Run inference
llama.cpp/build/bin/Release/llama-cli.exe `
    -m falcon-coder-3b-tq1.gguf `
    -p "def fibonacci(n):" -n 100 --threads 8

Inference on GPU (BF16)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "anthonylee991/falcon-coder-3b",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
tokenizer = AutoTokenizer.from_pretrained("anthonylee991/falcon-coder-3b")

prompt = "def quicksort(arr):"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Intended Use

This model is a code generation assistant. Verified strong performance on:

✅ Pure algorithms (binary search, sort, recursive functions)
✅ Type definitions (TypeScript interfaces, Pydantic models)
✅ Test scaffolding (pytest, Jest)
✅ Mechanical refactors (if/elif → dict dispatch)
✅ Docstrings (Google-style with examples)

Verified weak performance on:

⚠️ PowerShell (uses deprecated cmdlets like Get-WmiObject)
⚠️ Complex business logic with multiple interacting rules
⚠️ Anything requiring framework-specific knowledge not in training data

For a detailed 10-test evaluation, see the project repository or the companion HOW-TO guide.

Training Data

Combined and deduplicated from:

Dataset	Rows	Purpose
m-a-p/CodeFeedback-Filtered-Instruction	~156k	High-quality code instructions with feedback
ise-uiuc/Magicoder-OSS-Instruct-75K	~75k	Magicoder-style OSS examples
ise-uiuc/Magicoder-Evol-Instruct-110K	~110k	Evolved instructions
Various smaller PowerShell/TypeScript corpuses	~30k	Multi-language coverage

After deduplication via MinHash LSH @ 0.85: 365,251 train rows + 2,000 eval rows.

The training data is generic Python code. PowerShell, FastAPI, and TypeScript quality is limited compared to Python. See the V2 plan in the project docs for how to address this.

Limitations

PowerShell quality is poor — the model defaults to deprecated cmdlets. Use a more recent code model for PowerShell or fine-tune on PS-specific data.
Framework-specific code (FastAPI deps, SQLAlchemy patterns, React state management) is hit-or-miss.
No held-out domain eval — the eval split was drawn from the same training distribution.
Small model (3B) — complex reasoning across multiple files is out of scope.
Output may include explanatory prose — extract code blocks from the response, don't paste the whole output into your code.

Training Infrastructure

Cloud GPU: Hivenet GPU-optimized container, single RTX 4090 (24 GB VRAM)
92 hours wall time, $40 approximate cost
BF16 + 8-bit AdamW + gradient checkpointing to fit in 24 GB

License

This model is released under the Apache 2.0 license, consistent with the base Falcon-E-3B-Instruct license.

Citation

If you use this model, please cite the base model and the BitNet approach:

@misc{falcon-e-3b-instruct,
  title={Falcon-E: A Family of Universal, Pre-trained 1.58-bit Models},
  author={TII Falcon Team},
  year={2025},
  url={https://huggingface.co/tiiuae/Falcon-E-3B-Instruct}
}

@misc{bitnet2025,
  title={bitnet.cpp: Efficient Edge Inference for Ternary LLMs},
  author={Jinheng Wang and others},
  year={2025},
  url={https://github.com/microsoft/BitNet}
}

Downloads last month: 112

Safetensors

Model size

3B params

Tensor type

BF16

Model tree for anthonylee991/falcon-coder-3b

Base model

tiiuae/Falcon-E-3B-Instruct

Finetuned

(3)

this model

Quantizations

1 model

anthonylee991
/

falcon-coder-3b