Falcon-H1-1.5B-Deep-Reasoning
QLoRA math reasoning adapter for the deepest, narrowest Mamba-2 hybrid model.
This is a LoRA adapter trained on tiiuae/Falcon-H1-1.5B-Deep-Instruct using MetaMathQA to improve math and reasoning capabilities. The base model has 66 layers at only 1.5B parameters โ the deepest, narrowest model in the Falcon-H1 family โ and already performs on par with many 7-10B models.
Results
20-question math/reasoning benchmark, greedy decoding:
| Model | Score | Change |
|---|---|---|
| Falcon-H1-1.5B-Deep-Instruct (base) | 10/20 (50%) | โ |
| + Reasoning adapter | 13/20 (65%) | +30% relative |
What Improved
The adapter's primary effect was eliminating the base model's degenerate repetition loops. The base instruct model frequently fell into patterns like "Three friends, Andy, Beth..." or "Three\nThree\nThree..." instead of solving problems. The reasoning adapter replaced every one of these failures with actual step-by-step math chains.
8 questions gained: fuel calculation, geometric sequences, prime counting, syllogistic logic, integer sums, hexagon diagonals, GCD, and algebra โ all previously repetition-loop failures.
5 questions lost: 2 are parser artifacts (model outputs correct answer with formatting the extractor misreads), 3 are genuine regressions on arithmetic.
Reasoning Quality
The fine-tuned model shows clear reasoning patterns:
- Uses
<|begin_of_thought|>tags for structured reasoning - Applies formulas explicitly:
n(n+1)/2, difference of squares(a+b)(a-b) - Shows work step by step before arriving at answers
Training Details
- Method: QLoRA (4-bit NF4, double quantization)
- Dataset: MetaMathQA, 2000 examples, 1 epoch
- LoRA rank: 16, alpha 32
- Target modules:
q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj,in_proj(attention + Mamba) - Sequence length: 512
- Batch size: 2 (effective 4 with grad accumulation)
- Learning rate: 2e-4 with warmup
- Training time: ~70 minutes on RTX 3060 12GB
- Final loss: ~0.28, token accuracy ~91%
Note on Mamba LoRA Targets
PEFT explicitly blocks out_proj and conv1d as LoRA targets for Mamba-based models. This aligns with TII's documentation that out_proj weights are used directly within the Mamba kernel and should not be modified. The adapter targets in_proj for the Mamba layers instead.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
llm_int8_skip_modules=["mamba.out_proj"],
)
model = AutoModelForCausalLM.from_pretrained(
"tiiuae/Falcon-H1-1.5B-Deep-Instruct",
quantization_config=bnb_config,
device_map="auto",
dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("tiiuae/Falcon-H1-1.5B-Deep-Instruct")
model = PeftModel.from_pretrained(model, "iAmBoosted/falcon-h1-1.5b-deep-reasoning")
prompt = "Solve step by step: If 4 workers can build a wall in 6 days, how many days would it take 3 workers?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Hardware
- Training: RTX 3060 12GB, QLoRA 4-bit, ~70 minutes
- VRAM usage: ~4GB during training
- Requires:
mamba-ssmCUDA kernels for reasonable training speed (naive path is ~10x slower) - Container:
nvcr.io/nvidia/pytorch:24.12-py3
Part of a Series
This is one of several experiments exploring non-transformer architectures:
- Falcon-H1 SLERP Merge โ First SLERP merge of Mamba-2 hybrids
- Zamba2 SLERP Merge โ Weight-sharing breaks standard merge tooling
License
Apache 2.0 (inherited from base model). Training data (MetaMathQA) is MIT licensed.
Model tree for iAmBoosted/falcon-h1-1.5b-deep-reasoning
Base model
tiiuae/Falcon-H1-1.5B-Deep-Base