ConeML 348M Alpha Polish900

ConeML 348M Alpha Polish900 is a 348M-parameter scratch-trained alpha model from a custom layered curriculum, followed by staged SFT activation. This release is a research artifact and alpha candidate, not a polished general assistant.

The main result is format activation: raw-completion probes understated the base checkpoint, while the trained chat format activated transitive binding and simple code-body generation. Arithmetic remains unresolved.

Why ConeML Exists

ConeML is an independent research effort testing whether compact language models can be built from scratch through deliberately staged curricula rather than scale alone.

The central question is whether small models can develop usable reasoning substrate through corpus design, curriculum order, and staged activation training. In v10, the clearest signal was transitive relation binding for name-like entities. In raw base probes this was near-absent; after focused SFT it reached 100% on a fixed-template internal chat probe (depths 1-3, N=128 per depth). A held-out probe (2026-06-23, N=128 per depth, depths 1-5) confirms this generalizes across new names and new relation wording: with held-out names and a new older/younger relation, chat first-choice accuracy was 79% / 89% / 88% / 77% / 71% across depths 1-5, well above chance. Generalization is weaker under unseen query phrasing (56% / 73% / 59% / 48% / 34%) and falls to roughly chance for non-name entities such as colored cards (51% / 50% / 41% / 31% / 28% vs chance 50% / 33% / 25% / 20% / 17%). The result is real and held-out, but the binding is name-shaped and surface-sensitive, not general abstract transitive reasoning.

coneml-348m-alpha-polish900 is the first public artifact from that work. Its strongest result is not that every capability is solved, but that raw completion understated parts of the model: transitive reasoning and simple code-body behavior became much more visible after targeted SFT, while arithmetic remained a real unresolved weakness.

Intended Format

Use the role-marker chat format:

User:
<instruction>
Assistant:

Raw completion is not the intended use surface for the tuned checkpoint.

License

Released for non-commercial use under CC BY-NC 4.0. Commercial use is not granted by this release.

Loading

This is a text-only causal language model. Use AutoModelForCausalLM or LlamaForCausalLM, not a multimodal model class.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "ConeML/coneml-348m-alpha-polish900"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype=torch.float32,
    device_map="auto",
)

prompt = "User:\nWhat is 2 + 3? Return only the number.\nAssistant:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=32,
    do_sample=False,
    eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Architecture

  • Family: Llama-style decoder
  • Parameters: approximately 348M
  • Layers: 30
  • Hidden size: 1024
  • Attention heads: 8
  • KV heads: 2
  • Vocab size: 32768
  • Context length: 512
  • RoPE theta: 1000000
  • Tokenizer: custom 32K tokenizer

Training Lineage

  • Base selected near the balanced pretrain region: ckpt_0210000.pt
  • SFT stages: base210 -> SFT300 -> focus600 -> polish900
  • Final exported checkpoint: runs/v10_348m_cone_sft_polish900/sft_ckpt_0000300.pt

Internal Probe Results

These are internal diagnostic probes, not public benchmark claims.

Activation Progression

Raw-completion transitive first-name accuracy improved gradually across SFT stages:

Stage Depth 1 Depth 2 Depth 3
Base210 raw 32.03% 24.22% 5.47%
SFT300 raw 32.81% 35.94% 12.50%
Focus600 raw 46.88% 61.72% 43.75%
Polish900 raw 53.91% 64.84% 53.12%

Chat-format transitive binding reached 100% on the fixed-template Focus600 and Polish900 internal probes (depths 1-3, in-distribution name pool). A separate held-out probe (2026-06-23) confirms generalization to new names and a new relation wording (79-89% at depths 1-3) but shows near-chance performance on non-name entities; see Held-out Transitive Validation below.

Stage Depth 1 Depth 2 Depth 3 Math final numeric Code strict exec
Focus600 chat 100.00% 100.00% 100.00% 28.12% 0.00%
Polish900 chat 100.00% 100.00% 100.00% 35.94% 16.67%

Chat-Format Probe

Probe Result
Transitive depth 1 first-name 100.00%
Transitive depth 2 first-name 100.00%
Transitive depth 3 first-name 100.00%
Math final numeric 35.94%
Math answer-anywhere 35.94%
Code strict exec 16.67%

Code note: strict execution is mostly limited by indentation. The Polish900 chat probe generated plausible correct return expressions for the 6 simple function probes, but 5/6 were emitted with the wrong leading whitespace; under indentation normalization, those return bodies execute.

Raw-Completion Probe

Probe Result
Transitive depth 1 first-name 53.91%
Transitive depth 2 first-name 64.84%
Transitive depth 3 first-name 53.12%
Math final numeric 17.97%
Code body rate 0.00%

Held-out Transitive Validation (2026-06-23)

Polish900, chat surface, first-choice accuracy, N=128 per depth.

Suite D1 D2 D3 D4 D5
SFT template + held-out names 94.5% 96.1% 94.5% 94.5% 82.8%
Held-out names + new relation (older/younger) 78.9% 89.1% 88.3% 76.6% 71.1%
Unseen query phrasing + held-out names 56.3% 73.4% 59.4% 48.4% 33.6%
Non-name entities (cards, comes before) 50.8% 50.0% 41.4% 30.5% 28.1%
Chance 50% 33% 25% 20% 17%

Takeaway: held-out validation supports generalization across new names and relation wording for name-like entities in chat format. It is weaker under unseen query phrasing and at or near chance for non-name entities and deeper chains. Raw completion is weaker than chat in every suite.

Strengths

  • Scratch-trained 348M model from a custom layered curriculum.
  • Strong SFT activation curve on transitive relation binding.
  • Chat-format transitive relation binding reaches 100% on a fixed-template internal probe (depths 1-3) and is held-out validated for name-like entities (79-89% at depths 1-3 with new names and a new relation wording). It degrades under unseen query phrasing and drops to roughly chance for non-name entities, so it is name-shaped binding rather than general transitive reasoning.
  • Simple code return bodies appear in chat format; the remaining failure is mostly indentation/formatting, not missing return-body content on the internal probe.

Known Limitations

  • Arithmetic remains the weakest major capability lane. Chat-format final numeric accuracy reached 35.94% on the internal probe, but reliable multi-digit arithmetic is not solved.
  • Raw completion is poor for code bodies and is not the intended tuned interface.
  • Code indentation is unstable without postprocessing.
  • Internal probes only; this card makes no public benchmark claims.
  • This is an alpha/research release, not a replacement for larger general assistants.

Reproducibility Artifacts

This local release directory includes:

  • training_summary.json
  • evals/v10_diagnostic_probe_0210000_gpu_matched_sft300.json
  • evals/diag_postsft_sft300_gpu.json
  • evals/diag_focus600_raw_gpu.json
  • evals/chat_activation_focus600_gpu.json
  • evals/chat_activation_polish900_gpu.json
  • evals/diag_polish900_raw_gpu.json
Downloads last month
68
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support