smollm2-135m-instruct-block β€” cut-point ONNX (choochoo rung 2)

Prefill-only ONNX with an extra cut_hidden output (x1 = the residual stream just before the LAST block's MLP), plus block.bin/block.json β€” the frozen last-block MLP + both norms + the head, fp16 β€” so a LoRA adapter can be trained on the last block's MLP in the browser.

  • ONNX outputs: logits, cut_hidden [batch, seq, 576]
  • block.json lists fp16 tensors n2w, Wgate, Wup, Wdown, nfw, Wlm (concatenated in block.bin)
  • dtype: q8 (onnx/model_quantized.onnx)
Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support