Text Generation
Safetensors
English
llama
bitnet
1.58-bit
code
falcon
ternary
conversational

Falcon-Coder-3B (1.58-bit / TQ1_0)

A fine-tuned 1.58-bit ternary quantization of Falcon-E-3B-Instruct, optimized for CPU inference via vanilla llama.cpp.

This model produces code (Python, TypeScript, etc.) at ~24 tokens/sec on a typical laptop CPU with a 710 MB on-disk footprint.

Model Details

Property Value
Base model tiiuae/Falcon-E-3B-Instruct
Training method 1.58-bit full fine-tune via onebitllms
Training data 365k coding instruction examples (Magicoder + CodeFeedback)
Training duration ~92 hours on RTX 4090 (24 GB)
Final loss 0.5008 (started at 0.91)
Effective batch size 32 (per_device=1 × grad_accum=32)
Optimizer paged_adamw_8bit
Learning rate 1e-4 (cosine schedule, 3% warmup)
Sequence length 1024
Epochs 2
Stored precision BF16 (5.7 GB)
Inference precision TQ1_0 ternary (1.69 bpw, ~710 MB)
Inference engine vanilla llama.cpp (TQ1_0 quant)
Inference speed ~24 tok/s on laptop CPU

What's in the repo

This is the training-time BF16 checkpoint. To use it on CPU, you must convert it to a 1.58-bit ternary GGUF. See "Usage" below.

Usage

Inference on CPU (recommended)

This BF16 model is too large for fast CPU inference. Convert to a 1.58-bit ternary GGUF first:

# 1. Download the BF16 model
hf download anthonylee991/falcon-coder-3b --local-dir falcon-coder-3b-bf16

# 2. Convert to F16 GGUF
python llama.cpp/convert_hf_to_gguf.py falcon-coder-3b-bf16 `
    --outfile falcon-coder-3b.gguf --outtype f16

# 3. Quantize to TQ1_0 (1.58-bit ternary, ~710 MB)
llama.cpp/build/bin/Release/llama-quantize.exe falcon-coder-3b.gguf `
    falcon-coder-3b-tq1.gguf TQ1_0 8

# 4. Run inference
llama.cpp/build/bin/Release/llama-cli.exe `
    -m falcon-coder-3b-tq1.gguf `
    -p "def fibonacci(n):" -n 100 --threads 8

Inference on GPU (BF16)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "anthonylee991/falcon-coder-3b",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
tokenizer = AutoTokenizer.from_pretrained("anthonylee991/falcon-coder-3b")

prompt = "def quicksort(arr):"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Intended Use

This model is a code generation assistant. Verified strong performance on:

  • ✅ Pure algorithms (binary search, sort, recursive functions)
  • ✅ Type definitions (TypeScript interfaces, Pydantic models)
  • ✅ Test scaffolding (pytest, Jest)
  • ✅ Mechanical refactors (if/elif → dict dispatch)
  • ✅ Docstrings (Google-style with examples)

Verified weak performance on:

  • ⚠️ PowerShell (uses deprecated cmdlets like Get-WmiObject)
  • ⚠️ Complex business logic with multiple interacting rules
  • ⚠️ Anything requiring framework-specific knowledge not in training data

For a detailed 10-test evaluation, see the project repository or the companion HOW-TO guide.

Training Data

Combined and deduplicated from:

Dataset Rows Purpose
m-a-p/CodeFeedback-Filtered-Instruction ~156k High-quality code instructions with feedback
ise-uiuc/Magicoder-OSS-Instruct-75K ~75k Magicoder-style OSS examples
ise-uiuc/Magicoder-Evol-Instruct-110K ~110k Evolved instructions
Various smaller PowerShell/TypeScript corpuses ~30k Multi-language coverage

After deduplication via MinHash LSH @ 0.85: 365,251 train rows + 2,000 eval rows.

The training data is generic Python code. PowerShell, FastAPI, and TypeScript quality is limited compared to Python. See the V2 plan in the project docs for how to address this.

Limitations

  • PowerShell quality is poor — the model defaults to deprecated cmdlets. Use a more recent code model for PowerShell or fine-tune on PS-specific data.
  • Framework-specific code (FastAPI deps, SQLAlchemy patterns, React state management) is hit-or-miss.
  • No held-out domain eval — the eval split was drawn from the same training distribution.
  • Small model (3B) — complex reasoning across multiple files is out of scope.
  • Output may include explanatory prose — extract code blocks from the response, don't paste the whole output into your code.

Training Infrastructure

  • Cloud GPU: Hivenet GPU-optimized container, single RTX 4090 (24 GB VRAM)
  • 92 hours wall time, $40 approximate cost
  • BF16 + 8-bit AdamW + gradient checkpointing to fit in 24 GB

License

This model is released under the Apache 2.0 license, consistent with the base Falcon-E-3B-Instruct license.

Citation

If you use this model, please cite the base model and the BitNet approach:

@misc{falcon-e-3b-instruct,
  title={Falcon-E: A Family of Universal, Pre-trained 1.58-bit Models},
  author={TII Falcon Team},
  year={2025},
  url={https://huggingface.co/tiiuae/Falcon-E-3B-Instruct}
}

@misc{bitnet2025,
  title={bitnet.cpp: Efficient Edge Inference for Ternary LLMs},
  author={Jinheng Wang and others},
  year={2025},
  url={https://github.com/microsoft/BitNet}
}
Downloads last month
112
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for anthonylee991/falcon-coder-3b

Finetuned
(3)
this model
Quantizations
1 model

Datasets used to train anthonylee991/falcon-coder-3b