Instructions to use Jinyang23/OPID-ALFWorld-1.7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Jinyang23/OPID-ALFWorld-1.7B with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Jinyang23/OPID-ALFWorld-1.7B") model = AutoModelForCausalLM.from_pretrained("Jinyang23/OPID-ALFWorld-1.7B") - Notebooks
- Google Colab
- Kaggle
OPID-ALFWorld-1.7B
This model is trained using the OPID framework proposed in the paper:
OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning
- Paper: https://arxiv.org/abs/2606.26790
- Hugging Face paper page: https://huggingface.co/papers/2606.26790
- Code: jinyangwu/OPID
Overview
OPID is an On-Policy Skill Distillation framework for long-horizon language agents. It turns completed on-policy trajectories into hierarchical hindsight skills and uses them as dense training supervision. During inference, the policy acts directly from the environment history without an analyzer, skill retrieval, or privileged context.
This checkpoint is the ALFWorld Qwen3-1.7B model trained with OPID.
Key Features
- On-policy skill distillation: OPID distills behavioral knowledge from the model's own completed trajectories instead of relying on offline demonstrations.
- Hierarchical hindsight skills: Episode-level skills capture global workflows or failure-avoidance rules, while step-level skills capture local critical decisions.
- Critical-first routing: Step-level skills are used at identified critical timesteps, and episode-level skills are used elsewhere.
- Dense token-level supervision: OPID re-scores the same sampled response under original and skill-augmented contexts, turning the log-probability shift into a self-distillation advantage.
- No inference overhead: The released policy does not require an analyzer, skill memory, retrieval module, or extra privileged prompt at test time.
Performance Highlights
On ALFWorld with Qwen3-1.7B-Instruct, OPID achieves the best average score among the compared methods in our results.
| Method | Pick | Look | Clean | Heat | Cool | Pick2 | Avg. |
|---|---|---|---|---|---|---|---|
| Vanilla | 25.0 | 22.2 | 3.1 | 0.0 | 21.4 | 4.2 | 12.5 |
| Skill-Prompt* | 10.3 | 50.0 | 16.1 | 0.0 | 0.0 | 5.0 | 9.4 |
| OPSD | 26.3 | 33.3 | 9.1 | 0.0 | 4.5 | 5.3 | 14.1 |
| GRPO | 71.1 | 41.7 | 36.4 | 40.0 | 31.8 | 31.6 | 46.1 |
| Skill-GRPO | 27.6 | 54.5 | 22.7 | 27.3 | 0.0 | 19.2 | 21.1 |
| Skill-GRPO* | 31.4 | 42.9 | 51.9 | 8.3 | 11.5 | 7.1 | 28.1 |
| GRPO+OPSD | 38.2 | 50.0 | 30.8 | 28.6 | 30.0 | 21.1 | 32.0 |
| Skill-SD | 52.9 | 37.5 | 69.2 | 42.9 | 60.0 | 36.8 | 52.3 |
| RLSD | 50.0 | 37.5 | 61.5 | 19.0 | 50.0 | 21.1 | 42.2 |
| SDAR | 73.5 | 25.0 | 76.9 | 33.3 | 40.0 | 36.8 | 53.9 |
| OPID | 65.9 | 72.7 | 66.7 | 40.0 | 63.2 | 45.0 | 58.9 |
Numbers are success rates from the ALFWorld Qwen3-1.7B-Instruct block in the OPID results figure.
Quickstart
Here is a standard Transformers loading example:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Jinyang23/OPID-ALFWorld-1.7B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "You are in an ALFWorld environment. Decide the next valid action."
messages = [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512,
)
generated_ids = [
output_ids[len(input_ids):]
for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
ALFWorld Evaluation
The OPID repository provides local vLLM evaluation scripts for ALFWorld. After installing the repository and ALFWorld dependencies, run:
git clone https://github.com/jinyangwu/OPID.git
cd OPID
MODEL_PATH=Jinyang23/OPID-ALFWorld-1.7B \
MODEL_NAME=OPID-ALFWorld-1.7B \
bash examples/prompt_agent/run_local_vllm_alfworld.sh
You can also serve the model manually with vLLM:
vllm serve Jinyang23/OPID-ALFWorld-1.7B \
--served-model-name OPID-ALFWorld-1.7B \
--dtype bfloat16
Model Details
- Base Model: Qwen/Qwen3-1.7B
- Training Method: OPID (On-Policy Skill Distillation)
- Training Environment: ALFWorld
- Model Architecture: Qwen3ForCausalLM
- Precision: bfloat16
- Context Length: 40960 tokens in the released config
Limitations
- This checkpoint is specialized for ALFWorld-style text interaction tasks.
- It may output invalid actions when used outside the ALFWorld prompting format.
- The model inherits limitations from the Qwen3 base model and from RL training in simulated environments.
- This release is intended for research use in language-agent evaluation.
Citation
If you use this model or the OPID framework in your research, please cite:
@misc{yang2026opid,
title={OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning},
author={Shuo Yang and Jinyang Wu and Zhengxi Lu and Yuhao Shen and Fan Zhang and Lang Feng and Shuai Zhang and Haoran Luo and Zheng Lian and Zhengqi Wen and Jianhua Tao},
year={2026},
eprint={2606.26790},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2606.26790}
}
Links
- Paper: https://arxiv.org/abs/2606.26790
- Hugging Face paper page: https://huggingface.co/papers/2606.26790
- Code: https://github.com/jinyangwu/OPID
- Model: https://huggingface.co/Jinyang23/OPID-ALFWorld-1.7B
- Downloads last month
- -