YapLlama-1B

Llama 3.2-1B fine-tuned on 600 OpenThoughts rows for chain-of-thought reasoning.

Named honestly. It will show its work, at length, whether you asked for that or not.

What it is

QLoRA fine-tune of Llama 3.2-1B-Instruct on a sampled subset of OpenThoughts-114k. Goal was to transfer structured CoT reasoning behavior into a 1B model quickly and cheaply. It didn't quite get the memo.

Training

Setting Value
Base model Llama-3.2-1B-Instruct
Method QLoRA (4-bit NF4, LoRA r=16)
Dataset OpenThoughts-114k, 600 rows sampled
Hardware RTX 4060 8GB
Attention FlashAttention 2
Packing Enabled

Short eval results

"A bat and a ball cost $1.10 total. The bat costs $1.00 more than the ball. How much does the ball cost?"

Clean algebra, correct calculations, structured steps, passed!

"I have a 3 gallon jug and a 5 gallon jug. I need exactly 4 gallons. How?"

Right intuition, wrong intermediate steps, got lucky, C+.

"There are 12 fish in a tank. Half of them drown. How many are left?"

Accepted false premise, confidently answered 6, failed.

"A train leaves Chicago at 60mph. Another leaves New York 2 hours later at 90mph. The cities are 790 miles apart. Where do they meet?"

Yapped for 3 minutes at ~200 tk/s, filled its context, and had a meltdown.

Honest assessment

CoT format transferred cleanly on well-formed algebra problems. Verbosity is through the roof. Llama 3.2's base personality bleeds through, producing longer and sometimes circular reasoning before landing on an answer. Fits the name.

State tracking is marginally better than previous tests but still unreliable, often gets correct intuitions through broken intermediate reasoning rather than genuine simulation. Premise checking is absent entirely, consistent with a training set of well-formed problems where the model never had to question the question.

Roughly ties with my other model Qwemini-0.5B-Alpha on eval despite 2x the parameters. Dataset quality and premise-checking coverage matter more than model size at this scale.

Inference speed (llama.cpp, GGUF, 1B, RTX 4060)

Format Speed
f16 ~90 tok/s
Q5_K_S ~220 tok/s

Run Q5_K_S. The quality difference from f16 is negligible at 1B, the speed and VRAM difference is not.

What would improve it

  • Premise-checking traces (~50 examples where the model catches and rejects a false setup)
  • More data โ€” 600 rows is enough to transfer the format, not enough to deeply generalize
  • Bigger base โ€” 3B would close the state tracking gap significantly
Downloads last month
151
GGUF
Model size
1B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

5-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including NotHereNorThere/YapLlama-1b