Instructions to use MicheRomChis/orchid-1.0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MicheRomChis/orchid-1.0 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="MicheRomChis/orchid-1.0",
	filename="dpo_aligned-lora.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use MicheRomChis/orchid-1.0 with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf MicheRomChis/orchid-1.0
# Run inference directly in the terminal:
llama-cli -hf MicheRomChis/orchid-1.0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf MicheRomChis/orchid-1.0
# Run inference directly in the terminal:
llama-cli -hf MicheRomChis/orchid-1.0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf MicheRomChis/orchid-1.0
# Run inference directly in the terminal:
./llama-cli -hf MicheRomChis/orchid-1.0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf MicheRomChis/orchid-1.0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf MicheRomChis/orchid-1.0

Use Docker

docker model run hf.co/MicheRomChis/orchid-1.0

LM Studio
Jan
Ollama
How to use MicheRomChis/orchid-1.0 with Ollama:
```
ollama run hf.co/MicheRomChis/orchid-1.0
```

Unsloth Studio new

How to use MicheRomChis/orchid-1.0 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MicheRomChis/orchid-1.0 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MicheRomChis/orchid-1.0 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for MicheRomChis/orchid-1.0 to start chatting

Docker Model Runner
How to use MicheRomChis/orchid-1.0 with Docker Model Runner:
```
docker model run hf.co/MicheRomChis/orchid-1.0
```

Lemonade

How to use MicheRomChis/orchid-1.0 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull MicheRomChis/orchid-1.0

Run and chat with the model

lemonade run user.orchid-1.0-{{QUANT_TAG}}

List all available models

lemonade list

Orchid 1.0 — How and why I built Colombia's first competitive LLM on a 4 GB laptop

by MicheRomChis - opened 7 days ago

Discussion

MicheRomChis

Owner 7 days ago

Hi HuggingFace community 👋

I'm Michelangelo, a 16-year-old developer from Bogotá, Colombia. I want to share what I've been building for the past several months — a complete ternary LLM stack built from scratch on consumer hardware.

The problem I ran into

I wanted to fine-tune Microsoft's BitNet b1.58-2B-4T with LoRA and serve it. Every inference engine I tried failed:

llama.cpp: crashes with a type-36 error on I2_S ternary weights
bitnet.cpp: loads the base model, but has no runtime LoRA support
Merging the adapter first: the fine-tuning silently disappears

That last one took me three weeks to understand. The problem is fundamental: LoRA deltas have a mean absolute value of ~0.00001. Ternary base weights have a scale of ~1.2. When you merge and re-quantize, every delta rounds to zero. The alignment training is completely erased.

I call this the ternary merge problem.

What I built to solve it

ternative

A C++/CUDA inference engine that never merges. It loads the I2_S base GGUF and the LoRA adapter GGUF separately, dequantizes the base to F32, applies the delta at full precision, then casts to F16 for inference.

OpenAI-compatible server (/v1/chat/completions, /v1/completions with logprobs/echo)
All 30 layers on a 4 GB GPU (F16 + INT8 auto-quantization)
~6–7 tok/s GPU decode, ~6 tok/s CPU with AVX2

Orchid 1.0

Using ternative as the serving layer, I trained and aligned a 2B ternary model through three stages on the same RTX 3050 laptop:

SFT-A: Reasoning and chain-of-thought
SFT-B: Identity, knowledge, multilingual alignment
ORPO-3: Preference optimization without a reference model (saves ~1.2 GB VRAM vs DPO)

Standard benchmark results (lm-eval-harness methodology, 50Q each):

Benchmark	Orchid 1.0	BitNet base	Delta
ARC-Challenge	56.0%	49.9%	+6.1 pp
HellaSwag	52.0%	68.4%	−16.4 pp
WinoGrande	74.0%	—	—
MMLU	38.6%	53.2%	−14.6 pp

The ARC improvement confirms the reasoning fine-tuning transferred. HellaSwag and MMLU regressions are the expected ORPO alignment tax — same pattern documented in the DPO/ORPO literature.

WinoGrande at 74.0% is comparable to Llama 3.2 3B despite being a 2B ternary model.

Full methodology, failure modes, and architecture analysis: technical paper (PDF)

What's next — Terse

Orchid proved the recipe works at 2B scale. Terse is the next step: a clean-room ternary sparse transformer family (Mini 1.5B/4.5B, Medium 9B/27B, Pro 27B/81B) with MoE routing, hybrid linear+full attention, and recurrent depth — targeting the same consumer hardware envelope as Orchid.

Happy to answer questions about the ternary merge problem, the CUDA kernels, the ORPO alignment process, or anything else.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment