PhyChip-SmolLM3-3B-instruct-GRPO-v1

LoRA adapter for PhyChip - a study of using ngspice as a verifiable RL reward to train a 1-3B language model to design analog circuits from natural-language specs.

Base model: HuggingFaceTB/SmolLM3-3B
Stage: instruct -> SFT -> GRPO (RLVR, ngspice reward, KL 0.01, 150-step budget). Checkpoint-100.
Role in the study: Negative result: RL on top of the collapsed instruct-SFT cannot recover circuit ability (reward stuck ~0.05).

Evaluation (pass@1, greedy, ngspice + 23 spec harnesses)

benchmark	pass@1
phy-chip-bench-v1 (40)	0/40 (0%)
phy-chip-bench-v2 (50, novel circuits)	0/50 (0%)
AnalogCoder (24 textbook)	0/24 (0%)

Adapter: LoRA (PEFT), rank r=16, alpha=32.
Eval protocol: pass@1, greedy decoding (temperature=0); each generation simulated in ngspice and scored by 23 deterministic spec-check harnesses (the same code for every model).
Decontamination: AnalogCoder is eval-only (never in training data). phy-chip-bench-v2 was built with an 8-gram overlap gate (block < 0.40; achieved max overlap 0.018).

How to load

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM3-3B", torch_dtype="bfloat16", device_map="auto")
model = PeftModel.from_pretrained(base, "NithinReddyG/PhyChip-SmolLM3-3B-instruct-GRPO-v1")
tok  = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B")

The base-vs-instruct finding (why this model exists)

This adapter is one of six in a controlled base-vs-instruct study for PhyChip (analog-circuit design with ngspice as a verifiable RL reward). The same SFT/RL recipe was applied starting from SmolLM3-3B-Base and from SmolLM3-3B (instruct):

start	+ SFT (bench_v1)	+ SFT + GRPO (AnalogCoder)
base	0 -> 16/40 (SFT lifts it)	22/24 (RL generalizes)
instruct	0 -> 0/40 (SFT collapses it)	0/24 (RL cannot recover)

SFT on the instruct model degenerated into syntactically valid but device-less SPICE (resistor-salad, no active devices) - catastrophic forgetting on an extreme out-of-distribution domain. RL could not climb out (verifiable reward stuck ~0.05). Conclusion: fine-tune the base model, not the instruct model. This confirms the canonical base -> SFT -> RL ordering with a direct before/after measurement.

License

CC-BY-NC-SA-4.0 (research / educational use). The adapter inherits PhyChip's project license posture; the base model carries its own license.

Downloads last month: 18

Model tree for NithinReddyG/PhyChip-SmolLM3-3B-instruct-GRPO-v1

Base model

HuggingFaceTB/SmolLM3-3B-Base

Finetuned

HuggingFaceTB/SmolLM3-3B

Adapter

(43)

this model