Orchid 1.0 β€” How and why I built Colombia's first competitive LLM on a 4 GB laptop

#1
by MicheRomChis - opened

Hi HuggingFace community πŸ‘‹

I'm Michelangelo, a 16-year-old developer from BogotΓ‘, Colombia. I want to share what I've been building for the past several months β€” a complete ternary LLM stack built from scratch on consumer hardware.


The problem I ran into

I wanted to fine-tune Microsoft's BitNet b1.58-2B-4T with LoRA and serve it. Every inference engine I tried failed:

  • llama.cpp: crashes with a type-36 error on I2_S ternary weights
  • bitnet.cpp: loads the base model, but has no runtime LoRA support
  • Merging the adapter first: the fine-tuning silently disappears

That last one took me three weeks to understand. The problem is fundamental: LoRA deltas have a mean absolute value of ~0.00001. Ternary base weights have a scale of ~1.2. When you merge and re-quantize, every delta rounds to zero. The alignment training is completely erased.

I call this the ternary merge problem.


What I built to solve it

ternative

A C++/CUDA inference engine that never merges. It loads the I2_S base GGUF and the LoRA adapter GGUF separately, dequantizes the base to F32, applies the delta at full precision, then casts to F16 for inference.

  • OpenAI-compatible server (/v1/chat/completions, /v1/completions with logprobs/echo)
  • All 30 layers on a 4 GB GPU (F16 + INT8 auto-quantization)
  • ~6–7 tok/s GPU decode, ~6 tok/s CPU with AVX2

Orchid 1.0

Using ternative as the serving layer, I trained and aligned a 2B ternary model through three stages on the same RTX 3050 laptop:

  • SFT-A: Reasoning and chain-of-thought
  • SFT-B: Identity, knowledge, multilingual alignment
  • ORPO-3: Preference optimization without a reference model (saves ~1.2 GB VRAM vs DPO)

Standard benchmark results (lm-eval-harness methodology, 50Q each):

Benchmark Orchid 1.0 BitNet base Delta
ARC-Challenge 56.0% 49.9% +6.1 pp
HellaSwag 52.0% 68.4% βˆ’16.4 pp
WinoGrande 74.0% β€” β€”
MMLU 38.6% 53.2% βˆ’14.6 pp

The ARC improvement confirms the reasoning fine-tuning transferred. HellaSwag and MMLU regressions are the expected ORPO alignment tax β€” same pattern documented in the DPO/ORPO literature.

WinoGrande at 74.0% is comparable to Llama 3.2 3B despite being a 2B ternary model.

Full methodology, failure modes, and architecture analysis: technical paper (PDF)


What's next β€” Terse

Orchid proved the recipe works at 2B scale. Terse is the next step: a clean-room ternary sparse transformer family (Mini 1.5B/4.5B, Medium 9B/27B, Pro 27B/81B) with MoE routing, hybrid linear+full attention, and recurrent depth β€” targeting the same consumer hardware envelope as Orchid.


Happy to answer questions about the ternary merge problem, the CUDA kernels, the ORPO alignment process, or anything else.

Sign up or log in to comment