Qwen3-0.6B-v0.1 — p3_decide_no_ex GRPO checkpoint (step 2000)

A GRPO-trained Qwen3-0.6B variant from the reason-over-search Phase-1 sweep. This is the no-example ablation: the system prompt gives explicit decision rules for when to call the retriever, but provides no in-context demonstration. Companion to pantomiman/Qwen3-0.6B-v0, which uses the same algorithm + reward + data with a with-example prompt (run id z7kcxfof, "p1_basic_w_ex").

Value
Run id (verl) p3_decide_no_ex_el6s2d2h
Step / horizon 2000 / 9968 (peak end-of-run reward 0.215, +43 % rel)
Base Qwen/Qwen3-0.6B (post-trained chat, hybrid enable_thinking)
Algorithm GRPO (verl-legacy), paper-faithful Search-R1 EM-only reward
Training data PeterJinGo/nq_hotpotqa_train (NQ + HotpotQA mixture)
Action format <search>…</search> / <information>…</information> (Search-R1 / ReSearch)
Hardware 1× A100-40GB (ALICE cluster)

Why two checkpoints (v0 vs v0.1)

Two prompt variants from the same Phase-1 sweep:

  • v0 (p1_basic_w_ex_z7kcxfof) — system prompt includes a worked tool-use example.
  • v0.1 (p3_decide_no_ex_el6s2d2h, this repo) — system prompt states the decision rules verbatim without an example.

The pair lets us isolate "are decision rules sufficient?" vs "is a demonstration needed?" with everything else held fixed (algorithm, reward, data, base model). For the head-to-head eval and the matched training-curve panel, see the project's RESULTS_v2.md / SUPERVISOR_MEETING_2026-05-07.md (Milestone 3.1).

Action format

The model emits <search>QUERY</search> to invoke a wiki-18 retriever and consumes the top-K passages wrapped in <information>…</information> before continuing reasoning. Final answer is wrapped in <answer>…</answer>. This matches the published ReSearch / Search-R1 schemes; it is not the <tool_call> JSON variant from the local v1 ablation block.

Quickstart (SGLang)

python -m sglang.launch_server \
  --model-path pantomiman/Qwen3-0.6B-v0.1 \
  --host 127.0.0.1 --port 3000 \
  --tp 1 --context-length 8192 --dtype bfloat16 --trust-remote-code

Pair with a wiki-18 retriever serving <search> queries and an inference loop that injects retrieved passages back as <information>…</information>. The full pipeline + prompt template are in pantomiman/reason-over-search (project README); the prompt the model was trained with lives at evaluation_research/flashrag/search_r1/templates.py::P3_DECIDE_NO_EX_TEMPLATE and must be used byte-for-byte.

Provenance

This is a verl FSDP shard (global_step_2000/actor/model_world_size_1_rank_0.pt) merged to HF safetensors via:

python -m verl.model_merger merge \
    --backend fsdp \
    --local_dir <run>/global_step_2000/actor \
    --target_dir <hf_out_dir>

Tokenizer is the upstream Qwen/Qwen3-0.6B tokenizer (no vocabulary changes; <search> / <information> are taught to the policy at training time, not added as new tokens).

License & base model

Apache-2.0, inherited from Qwen/Qwen3-0.6B. See the base-model card for sampling defaults (thinking / non-thinking modes), agentic-use guidance, and best practices.

Citation

If this checkpoint is useful in your work, please cite the upstream Search-R1 + ReSearch papers and the Qwen3 technical report.

@misc{qwen3technicalreport,
  title={Qwen3 Technical Report},
  author={Qwen Team},
  year={2025},
  eprint={2505.09388},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2505.09388}
}
Downloads last month
6
Safetensors
Model size
0.8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pantomiman/Qwen3-0.6B-v0.1

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(969)
this model

Dataset used to train pantomiman/Qwen3-0.6B-v0.1

Paper for pantomiman/Qwen3-0.6B-v0.1