PhyChip-SmolLM3-3B-instruct-GRPO-v1

LoRA adapter for PhyChip - a study of using ngspice as a verifiable RL reward to train a 1-3B language model to design analog circuits from natural-language specs.

  • Base model: HuggingFaceTB/SmolLM3-3B
  • Stage: instruct -> SFT -> GRPO (RLVR, ngspice reward, KL 0.01, 150-step budget). Checkpoint-100.
  • Role in the study: Negative result: RL on top of the collapsed instruct-SFT cannot recover circuit ability (reward stuck ~0.05).

Evaluation (pass@1, greedy, ngspice + 23 spec harnesses)

benchmark pass@1
phy-chip-bench-v1 (40) 0/40 (0%)
phy-chip-bench-v2 (50, novel circuits) 0/50 (0%)
AnalogCoder (24 textbook) 0/24 (0%)
  • Adapter: LoRA (PEFT), rank r=16, alpha=32.
  • Eval protocol: pass@1, greedy decoding (temperature=0); each generation simulated in ngspice and scored by 23 deterministic spec-check harnesses (the same code for every model).
  • Decontamination: AnalogCoder is eval-only (never in training data). phy-chip-bench-v2 was built with an 8-gram overlap gate (block < 0.40; achieved max overlap 0.018).

How to load

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM3-3B", torch_dtype="bfloat16", device_map="auto")
model = PeftModel.from_pretrained(base, "NithinReddyG/PhyChip-SmolLM3-3B-instruct-GRPO-v1")
tok  = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B")

The base-vs-instruct finding (why this model exists)

This adapter is one of six in a controlled base-vs-instruct study for PhyChip (analog-circuit design with ngspice as a verifiable RL reward). The same SFT/RL recipe was applied starting from SmolLM3-3B-Base and from SmolLM3-3B (instruct):

start + SFT (bench_v1) + SFT + GRPO (AnalogCoder)
base 0 -> 16/40 (SFT lifts it) 22/24 (RL generalizes)
instruct 0 -> 0/40 (SFT collapses it) 0/24 (RL cannot recover)

SFT on the instruct model degenerated into syntactically valid but device-less SPICE (resistor-salad, no active devices) - catastrophic forgetting on an extreme out-of-distribution domain. RL could not climb out (verifiable reward stuck ~0.05). Conclusion: fine-tune the base model, not the instruct model. This confirms the canonical base -> SFT -> RL ordering with a direct before/after measurement.

License

CC-BY-NC-SA-4.0 (research / educational use). The adapter inherits PhyChip's project license posture; the base model carries its own license.

Downloads last month
18
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NithinReddyG/PhyChip-SmolLM3-3B-instruct-GRPO-v1

Adapter
(43)
this model