OPID-ALFWorld-1.7B

This model is trained using the OPID framework proposed in the paper:

OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning

Overview

OPID is an On-Policy Skill Distillation framework for long-horizon language agents. It turns completed on-policy trajectories into hierarchical hindsight skills and uses them as dense training supervision. During inference, the policy acts directly from the environment history without an analyzer, skill retrieval, or privileged context.

This checkpoint is the ALFWorld Qwen3-1.7B model trained with OPID.

Key Features

  • On-policy skill distillation: OPID distills behavioral knowledge from the model's own completed trajectories instead of relying on offline demonstrations.
  • Hierarchical hindsight skills: Episode-level skills capture global workflows or failure-avoidance rules, while step-level skills capture local critical decisions.
  • Critical-first routing: Step-level skills are used at identified critical timesteps, and episode-level skills are used elsewhere.
  • Dense token-level supervision: OPID re-scores the same sampled response under original and skill-augmented contexts, turning the log-probability shift into a self-distillation advantage.
  • No inference overhead: The released policy does not require an analyzer, skill memory, retrieval module, or extra privileged prompt at test time.

Performance Highlights

On ALFWorld with Qwen3-1.7B-Instruct, OPID achieves the best average score among the compared methods in our results.

Method Pick Look Clean Heat Cool Pick2 Avg.
Vanilla 25.0 22.2 3.1 0.0 21.4 4.2 12.5
Skill-Prompt* 10.3 50.0 16.1 0.0 0.0 5.0 9.4
OPSD 26.3 33.3 9.1 0.0 4.5 5.3 14.1
GRPO 71.1 41.7 36.4 40.0 31.8 31.6 46.1
Skill-GRPO 27.6 54.5 22.7 27.3 0.0 19.2 21.1
Skill-GRPO* 31.4 42.9 51.9 8.3 11.5 7.1 28.1
GRPO+OPSD 38.2 50.0 30.8 28.6 30.0 21.1 32.0
Skill-SD 52.9 37.5 69.2 42.9 60.0 36.8 52.3
RLSD 50.0 37.5 61.5 19.0 50.0 21.1 42.2
SDAR 73.5 25.0 76.9 33.3 40.0 36.8 53.9
OPID 65.9 72.7 66.7 40.0 63.2 45.0 58.9

Numbers are success rates from the ALFWorld Qwen3-1.7B-Instruct block in the OPID results figure.

Quickstart

Here is a standard Transformers loading example:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Jinyang23/OPID-ALFWorld-1.7B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "You are in an ALFWorld environment. Decide the next valid action."

messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
)
generated_ids = [
    output_ids[len(input_ids):]
    for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

ALFWorld Evaluation

The OPID repository provides local vLLM evaluation scripts for ALFWorld. After installing the repository and ALFWorld dependencies, run:

git clone https://github.com/jinyangwu/OPID.git
cd OPID

MODEL_PATH=Jinyang23/OPID-ALFWorld-1.7B \
MODEL_NAME=OPID-ALFWorld-1.7B \
bash examples/prompt_agent/run_local_vllm_alfworld.sh

You can also serve the model manually with vLLM:

vllm serve Jinyang23/OPID-ALFWorld-1.7B \
  --served-model-name OPID-ALFWorld-1.7B \
  --dtype bfloat16

Model Details

  • Base Model: Qwen/Qwen3-1.7B
  • Training Method: OPID (On-Policy Skill Distillation)
  • Training Environment: ALFWorld
  • Model Architecture: Qwen3ForCausalLM
  • Precision: bfloat16
  • Context Length: 40960 tokens in the released config

Limitations

  • This checkpoint is specialized for ALFWorld-style text interaction tasks.
  • It may output invalid actions when used outside the ALFWorld prompting format.
  • The model inherits limitations from the Qwen3 base model and from RL training in simulated environments.
  • This release is intended for research use in language-agent evaluation.

Citation

If you use this model or the OPID framework in your research, please cite:

@misc{yang2026opid,
  title={OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning},
  author={Shuo Yang and Jinyang Wu and Zhengxi Lu and Yuhao Shen and Fan Zhang and Lang Feng and Shuai Zhang and Haoran Luo and Zheng Lian and Zhengqi Wen and Jianhua Tao},
  year={2026},
  eprint={2606.26790},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2606.26790}
}

Links

Downloads last month
-
Safetensors
Model size
2B params
Tensor type
BF16
·
Video Preview
loading

Model tree for Jinyang23/OPID-ALFWorld-1.7B

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(822)
this model

Paper for Jinyang23/OPID-ALFWorld-1.7B