Question about input/output semantics for RDT-1B-LIBERO-Object checkpoint

#1
by ANUNM - opened

Hi TJ-chen,

Thank you for releasing the RDT-1B-LIBERO-Object checkpoint.

I am currently trying to integrate this checkpoint into a LIBERO Object inference/evaluation pipeline. I was able to find the checkpoint structure and the recommended loading method from the model card, especially the use of the EMA weights under the ema/ subfolder. However, I could not find a detailed description of the input/output semantics required for inference.

Could you please clarify the expected data semantics for this checkpoint?

Specifically, I would like to know:

  1. Observation inputs

    • What exact observation fields are expected by the model?
    • Does the model use RGB images, proprioceptive states, language instructions, or other inputs?
    • If images are used, which camera views are expected, e.g. agentview, wrist/eye-in-hand, or others?
    • What are the expected image resolution, channel order, dtype, and normalization method?
  2. Language input

    • What is the expected format of the language instruction?
    • Should it be the raw LIBERO task description, a processed prompt, or tokenized text?
    • Which tokenizer/text encoder should be used?
  3. Proprioceptive/state input

    • What is the expected state vector layout?
    • What does each dimension represent?
    • Is the state normalized? If so, where can I find the normalization statistics?
  4. Action output

    • What is the output action dimension?
    • What does each action dimension represent?
    • Is the action in end-effector space, joint space, or another representation?
    • Are the actions absolute or delta commands?
    • What coordinate frame is used?
    • What is the gripper action convention, e.g. open/close values?
    • Is action unnormalization required before passing it to the LIBERO environment?
  5. Temporal setup

    • What observation horizon and action horizon are used?
    • Does the model output a single action or an action chunk?
    • What control frequency or action execution scheme was used during evaluation?
  6. Reference implementation

    • Is there an official inference or evaluation script for this checkpoint on LIBERO Object?
    • If so, could you point me to the repository/file that defines the preprocessing, postprocessing, and environment interaction logic?

My goal is to correctly reproduce the inference behavior of the released checkpoint and avoid making incorrect assumptions about observation preprocessing, action representation, coordinate frames, or normalization.

Thank you very much for your help.

Best regards,

You can see this repo for more information. If you have any questions after reading.Feel free to contact with me!
repo:https://github.com/tj-chen-1209/Libero_RDT
email: chentingjia1209@gmail.com

Sign up or log in to comment