fyi

by jukefr - opened 6 days ago

•

sample on this subset of term-bench2.0 tasks was already enough to me feel free to bench more if you want, tested with pi-agent

  ┌───────────────────────────┬─────────────┬────────────┬───────┬───────┬───────┬───────┐
  │           Task            │   Qwen3.6   │   Darwin   │ Q dur │ D dur │ Q out │ D out │
  ├───────────────────────────┼─────────────┼────────────┼───────┼───────┼───────┼───────┤
  │ fix-git                   │         3/3 │        2/3 │   41s │   31s │  2.3K │  1.6K │
  ├───────────────────────────┼─────────────┼────────────┼───────┼───────┼───────┼───────┤
  │ prove-plus-comm           │         2/3 │        2/3 │  377s │   36s │   11K │  1.9K │
  ├───────────────────────────┼─────────────┼────────────┼───────┼───────┼───────┼───────┤
  │ cobol-modernization       │         1/3 │        2/3 │  439s │  215s │   26K │   13K │
  ├───────────────────────────┼─────────────┼────────────┼───────┼───────┼───────┼───────┤
  │ overfull-hbox             │         0/3 │        0/3 │  484s │  103s │   29K │  5.9K │
  ├───────────────────────────┼─────────────┼────────────┼───────┼───────┼───────┼───────┤
  │ break-filter-js-from-html │         0/3 │        0/3 │  297s │  275s │   18K │   16K │
  ├───────────────────────────┼─────────────┼────────────┼───────┼───────┼───────┼───────┤
  │ filter-js-from-html       │         0/3 │        0/3 │   80s │  671s │  4.6K │   33K │
  ├───────────────────────────┼─────────────┼────────────┼───────┼───────┼───────┼───────┤
  │ kv-store-grpc             │         2/3 │        0/3 │   34s │   42s │  1.3K │  1.8K │
  ├───────────────────────────┼─────────────┼────────────┼───────┼───────┼───────┼───────┤
  │ multi-source-data-merger  │         3/3 │        1/3 │   64s │   98s │  3.5K │  5.8K │
  ├───────────────────────────┼─────────────┼────────────┼───────┼───────┼───────┼───────┤
  │ regex-log                 │         1/3 │        0/3 │  461s │  580s │   28K │   34K │
  ├───────────────────────────┼─────────────┼────────────┼───────┼───────┼───────┼───────┤
  │ git-leak-recovery         │         2/3 │        1/3 │   35s │   39s │  1.8K │  1.9K │
  ├───────────────────────────┼─────────────┼────────────┼───────┼───────┼───────┼───────┤
  │ pypi-server               │         0/3 │        0/3 │   23s │   46s │  0.9K │  2.5K │
  ├───────────────────────────┼─────────────┼────────────┼───────┼───────┼───────┼───────┤
  │ TOTAL                     │ 14/33 (42%) │ 8/33 (24%) │       │       │       │       │
  └───────────────────────────┴─────────────┴────────────┴───────┴───────┴───────┴───────┘

SeaWolf-AI

FINAL_Bench org 6 days ago

Thanks for running the benchmark and sharing the numbers.Quick note on positioning: Darwin-36B-Opus is published as a reasoning-focused evolutionary merge (GPQA Diamond 88.4%, tying Qwen3.5-397B-A17B), not as an agentic coder. The Darwin Opus line is bred for graduate-level scientific reasoning — physics, chemistry, biology Q&A in the GPQA style — and is not tuned for terminal/agent workflows. For agent and coding tasks we'd recommend the Qwen Coder line.Two observations on your runs that may explain part of the gap:
System prompt: Darwin needs enable_thinking=true via the Qwen chat template, and the agent harness needs to leave room for the ... block before tool calls. If pi-agent strips or truncates the thinking trace, Darwin loses most of its reasoning lift. You can confirm in the output — if you don't see a block, the harness is filtering it.

Output token compactness is by design: Darwin Opus inherits a Father with 75% Gated-DeltaNet + 25% Gated-Attention. Post-thinking responses are deliberately compressed (FFN α asymmetry from the merge genome), which is the opposite of what agent benchmarks reward — they reward verbose step-by-step tool chains. That's a known trade-off for this checkpoint, not a regression.
We'd be very interested to see your numbers on the same subset with (a) enable_thinking=true set in the request, and (b) the agent template that preserves the thinking trace. Happy to help if there's a specific task where you'd like to dig in.For full context: the Darwin Family methodology is currently under peer review at ARR May 2026 (training-free reasoning scaling) — coding/agent performance is explicitly out of scope of that submission.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment