OPID-ALFWorld-1.7B

This model is trained using the OPID framework proposed in the paper:

OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning

Paper: https://arxiv.org/abs/2606.26790
Hugging Face paper page: https://huggingface.co/papers/2606.26790
Code: jinyangwu/OPID

Overview

OPID is an On-Policy Skill Distillation framework for long-horizon language agents. It turns completed on-policy trajectories into hierarchical hindsight skills and uses them as dense training supervision. During inference, the policy acts directly from the environment history without an analyzer, skill retrieval, or privileged context.

This checkpoint is the ALFWorld Qwen3-1.7B model trained with OPID.

Key Features

On-policy skill distillation: OPID distills behavioral knowledge from the model's own completed trajectories instead of relying on offline demonstrations.
Hierarchical hindsight skills: Episode-level skills capture global workflows or failure-avoidance rules, while step-level skills capture local critical decisions.
Critical-first routing: Step-level skills are used at identified critical timesteps, and episode-level skills are used elsewhere.
Dense token-level supervision: OPID re-scores the same sampled response under original and skill-augmented contexts, turning the log-probability shift into a self-distillation advantage.
No inference overhead: The released policy does not require an analyzer, skill memory, retrieval module, or extra privileged prompt at test time.

Performance Highlights

On ALFWorld with Qwen3-1.7B-Instruct, OPID achieves the best average score among the compared methods in our results.

Method	Pick	Look	Clean	Heat	Cool	Pick2	Avg.
Vanilla	25.0	22.2	3.1	0.0	21.4	4.2	12.5
Skill-Prompt*	10.3	50.0	16.1	0.0	0.0	5.0	9.4
OPSD	26.3	33.3	9.1	0.0	4.5	5.3	14.1
GRPO	71.1	41.7	36.4	40.0	31.8	31.6	46.1
Skill-GRPO	27.6	54.5	22.7	27.3	0.0	19.2	21.1
Skill-GRPO*	31.4	42.9	51.9	8.3	11.5	7.1	28.1
GRPO+OPSD	38.2	50.0	30.8	28.6	30.0	21.1	32.0
Skill-SD	52.9	37.5	69.2	42.9	60.0	36.8	52.3
RLSD	50.0	37.5	61.5	19.0	50.0	21.1	42.2
SDAR	73.5	25.0	76.9	33.3	40.0	36.8	53.9
OPID	65.9	72.7	66.7	40.0	63.2	45.0	58.9

Numbers are success rates from the ALFWorld Qwen3-1.7B-Instruct block in the OPID results figure.

Quickstart

Here is a standard Transformers loading example:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Jinyang23/OPID-ALFWorld-1.7B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "You are in an ALFWorld environment. Decide the next valid action."

messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
)
generated_ids = [
    output_ids[len(input_ids):]
    for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

ALFWorld Evaluation

The OPID repository provides local vLLM evaluation scripts for ALFWorld. After installing the repository and ALFWorld dependencies, run:

git clone https://github.com/jinyangwu/OPID.git
cd OPID

MODEL_PATH=Jinyang23/OPID-ALFWorld-1.7B \
MODEL_NAME=OPID-ALFWorld-1.7B \
bash examples/prompt_agent/run_local_vllm_alfworld.sh

You can also serve the model manually with vLLM:

vllm serve Jinyang23/OPID-ALFWorld-1.7B \
  --served-model-name OPID-ALFWorld-1.7B \
  --dtype bfloat16

Model Details

Base Model: Qwen/Qwen3-1.7B
Training Method: OPID (On-Policy Skill Distillation)
Training Environment: ALFWorld
Model Architecture: Qwen3ForCausalLM
Precision: bfloat16
Context Length: 40960 tokens in the released config

Limitations

This checkpoint is specialized for ALFWorld-style text interaction tasks.
It may output invalid actions when used outside the ALFWorld prompting format.
The model inherits limitations from the Qwen3 base model and from RL training in simulated environments.
This release is intended for research use in language-agent evaluation.

Citation

If you use this model or the OPID framework in your research, please cite:

@misc{yang2026opid,
  title={OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning},
  author={Shuo Yang and Jinyang Wu and Zhengxi Lu and Yuhao Shen and Fan Zhang and Lang Feng and Shuai Zhang and Haoran Luo and Zheng Lian and Zhengqi Wen and Jianhua Tao},
  year={2026},
  eprint={2606.26790},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2606.26790}
}

Model tree for Jinyang23/OPID-ALFWorld-1.7B

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Finetuned

(822)

this model

Paper for Jinyang23/OPID-ALFWorld-1.7B

OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning

Paper • 2606.26790 • Published 1 day ago • 32

Jinyang23
/

OPID-ALFWorld-1.7B