Robotics
ONNX

Description:

GR00T-N1.7-ApplePnP-V1 is a fine-tuned NVIDIA Isaac GR00T N1.7 model for an apple pick-and-place task performed on a Unitree G1 humanoid robot.

Isaac GR00T N1.7-3B is the medium-sized version of the model, built using pretrained vision and language encoders. It uses a flow-matching action transformer to model a chunk of actions conditioned on vision, language, and proprioception.

This model is distributed in ONNX format, exported for deployment (backbone, action head, action decoding, and state/video preprocessing modules).

This model is ready for commercial or non-commercial use.

License/Terms of Use:

GOVERNING DOWNLOAD TERMS: Use of the model is governed by the NVIDIA Open Model Agreement.

Deployment Geography:

Global

Use Case:

Researchers, Academics, Open-Source Community: AI-driven robotics research and algorithm development. Developers: Integrate and customize AI for various robotic applications. Startups & Companies: Accelerate robotics development and reduce training costs.

Release Date:

Hugging Face: 07/01/2026 via nvidia/GR00T-N1.7-ApplePnP-V1

Reference(s):

GR00T N1 White Paper: "GR00T N1: An Open Foundation Model for Generalist Humanoid Robots." (2025).

Liu, Xingchao, and Chengyue Gong. "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow." The Eleventh International Conference on Learning Representations.

Flow Matching Policy: Black, Kevin, et al. "π0: A Vision-Language-Action Flow Model for General Robot Control." arXiv preprint (2024).

Model Architecture:

Architecture Type: Vision-Language Backbone, Multilayer Perceptron, Flow Matching Transformer

GR00T N1.7 uses a Cosmos-Reason2-2B vision-language backbone based on the Qwen3-VL architecture, replacing the Eagle backbone used in N1.6. It uses a flow-matching action transformer to model chunks of actions conditioned on vision, language, and proprioception.

RGB camera frames are processed through a pretrained vision transformer (SigLip2), and text is encoded by a pretrained transformer (T5). N1.7 supports flexible image resolution and encodes images in their native aspect ratio without padding. Robot proprioception is encoded using a multi-layer perceptron (MLP) indexed by the embodiment ID. To handle variable-dimension proprioception, inputs are padded to a configurable max length before feeding into the MLP.

Actions are encoded and velocity predictions are decoded by an MLP, one per unique embodiment. The flow-matching transformer is implemented as a diffusion transformer (DiT), with diffusion-step conditioning implemented using adaptive layer normalization (AdaLN).

Network Architecture: image/png The schematic diagram is shown in the illustration above.

Number of Model Parameters: 3B

Input:

Input Type:

  • Vision: Image Frames
  • State: Robot Proprioception
  • Language Instruction: Text
  • Embodiment ID: Integer

Input Format:

  • Vision: Variable number of uint8 image frames, coming from robot cameras
  • State: Floating Point
  • Language Instruction: String
  • Embodiment ID: Integer indicating which of the training embodiments is observed

Input Parameters:

  • Vision: Two-Dimensional (2D) - Red, Green, Blue (RGB) image
  • State: One-Dimensional (1D) - Floating number vector
  • Language Instruction: One-Dimensional (1D) - String
  • Embodiment ID: One-Dimensional (1D) - Integer

Other Properties Related to Input: Variable resolution RGB images (native aspect ratio, no padding); evaluated on 640x480 egocentric camera feeds. Uint8 format. No Alpha Channel. 8-bit color depth. No additional pre-processing required beyond standard normalization.

Output:

Output Type(s): Actions
Output Format Continuous-value vectors
Output Parameters: Two-Dimensional (2D)
Other Properties Related to Output: Continuous-value vectors correspond to different motor controls on a robot, which depends on Degrees of Freedom of the robot embodiment.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration

Runtime Engine(s): ONNX Runtime

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Blackwell
  • NVIDIA Jetson
  • NVIDIA Hopper
  • NVIDIA Lovelace

Supported Operating System(s): Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s):

GR00T-N1.7-ApplePnP-V1

Training and Evaluation Datasets:

Post-Training Dataset:

Trained with dataset nvidia/GR00T-N1.7-AppleToPlate

Data Collection Method

Data Modality: Other: Real Robot

Training Data Size: 402 demonstrations

Hybrid: Manually-Collected, Robot

Labeling Method

Manually-Labeled

Properties

All 402 demonstrations were manually collected through human teleoperation on a Unitree G1 robot using an XR headset. Each demo was recorded at 30 Hz.

Evaluation

Data Modality: Other: Real Robot

Data Collection Method: Robot

Labeling Method: Not Applicable

Properties: The evaluation was performed on a physical Unitree G1 robot. The evaluation data consists of apple pick-and-place task episodes.

Inference:

Acceleration Engine(s): ONNX Runtime, TensorRT

Test Hardwares

  • NVIDIA RTX 6000 Ada

Resources

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. Developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for nvidia/GR00T-N1.7-ApplePnP-V1

Quantized
(1)
this model

Dataset used to train nvidia/GR00T-N1.7-ApplePnP-V1

Papers for nvidia/GR00T-N1.7-ApplePnP-V1