Question about OSWorld-Verified evaluation settings for Qwen3.5-9B

#54
by goonco - opened

Hi Qwen Team,

Thank you for releasing Qwen3.5-9B.

We are trying to reproduce the OSWorld-Verified result reported for Qwen3.5-9B, but we observe a large gap between the reported score and our evaluation result.

Since OSWorld-Verified appears to be sensitive to the evaluation settings and the agent implementation, we would appreciate it if you could share more details about the OSWorld evaluation setup used for the reported score.

For our reproduction, we used the following setup:

  1. We used the official OSWorld evaluation pipeline and adapted the Qwen3-VL agent implementation: qwen3vl_agent.py

  2. We followed the base setting introduced in the OSWorld README, including:

    • observation_type: screenshot
    • sleep_after_execution: 3
    • max_steps: 15

Could you please let us know whether the reported Qwen3.5-9B OSWorld-Verified score was obtained using the same agent adapter and settings?
Any configuration files, scripts, commit hashes, or additional details would be very helpful.

Thank you!

Sign up or log in to comment