GR00T N1.5 Bimanual SO-101 LoRA Adapter

This repository contains a LoRA adapter for NVIDIA's GR00T N1.5 model, fine-tuned for bimanual SO-101 robot arms performing pick-and-place tasks.

Model Description

Base Model: nvidia/GR00T-N1.5-3B
Adapter Size: 13 MB (LoRA rank 16)
Task: Bimanual red block pick-and-place
Training Data: 100 teleoperation episodes (~84K frames)
Action Space: 12D (2 arms × (5 joints + 1 gripper))
Camera Setup: 3 RGB cameras (left_gripper, right_gripper, top)

Training Details

Configuration:

Action Horizon: 64 steps
Training Steps: 20,000
Final Loss: ~0.04
LoRA Rank: 16, Alpha: 32
Frozen: Vision encoder, LLM, Diffusion model
Trained: Projector + LoRA adapters

Hardware:

GPU: NVIDIA RTX 5090 (32GB)
Training Time: 4.5 hours
Framework: Isaac-GR00T + PyTorch 2.7.0

Installation

# Clone Isaac-GR00T repository
git clone https://github.com/NVIDIA-Omniverse/Isaac-GR00T
cd Isaac-GR00T

# Install dependencies
conda create -n groot python=3.10
conda activate groot
pip install -e .[base]
pip install flash-attn==2.8.2  # Required for GR00T

# Download this adapter
huggingface-cli download Hrishnugg/groot-recode-bimanual-v2-lora \
    --local-dir ./adapters/recode-bimanual-v2

Usage

Option 1: Inference with Isaac-GR00T

from gr00t.model.policy import Gr00tPolicy
import numpy as np

# Load base model + LoRA adapter
policy = Gr00tPolicy.from_checkpoint(
    checkpoint_path="./adapters/recode-bimanual-v2",  # Your LoRA adapter
    embodiment_tag="new_embodiment",
    data_config="recode_data_config:RecodeBimanualDataConfig"
)

# Prepare observations
observations = {
    "video.left_gripper": left_camera_image,    # Shape: (1, 480, 640, 3)
    "video.right_gripper": right_camera_image,  # Shape: (1, 480, 640, 3)
    "video.top": top_camera_image,              # Shape: (1, 480, 640, 3)
    "state.left_arm": left_arm_joint_positions,      # Shape: (1, 5)
    "state.left_gripper": left_gripper_position,     # Shape: (1, 1)
    "state.right_arm": right_arm_joint_positions,    # Shape: (1, 5)
    "state.right_gripper": right_gripper_position,   # Shape: (1, 1)
    "annotation.human.task_description": ["Grab the red cube and put it in a red basket"]
}

# Get action prediction (returns 64-step horizon, use first step)
actions = policy.get_action(observations)
action_t0 = actions["action"][0]  # Shape: (12,) - first timestep

# Extract per-arm commands
left_arm_cmd = action_t0[0:5]      # 5 joint angles
left_gripper_cmd = action_t0[5]    # Gripper position
right_arm_cmd = action_t0[6:11]    # 5 joint angles  
right_gripper_cmd = action_t0[11]  # Gripper position

# Send to robot
robot.set_left_arm_position(left_arm_cmd)
robot.set_left_gripper(left_gripper_cmd)
robot.set_right_arm_position(right_arm_cmd)
robot.set_right_gripper(right_gripper_cmd)

Option 2: Using Inference Server

Start server:

python scripts/inference_service.py \
    --server \
    --model_path ./adapters/recode-bimanual-v2 \
    --embodiment_tag new_embodiment \
    --data_config recode_data_config:RecodeBimanualDataConfig \
    --denoising_steps 4 \
    --port 5555

Connect client:

from gr00t.eval.service import ExternalRobotInferenceClient

client = ExternalRobotInferenceClient(host="localhost", port=5555)
actions = client.get_action(observations)

Data Configuration Required

This adapter expects specific data configuration matching the training setup. Create recode_data_config.py:

from gr00t.experiment.data_config import BaseDataConfig
from gr00t.data.transform.base import ComposedModalityTransform
# ... (full config from training)

Or download from this repository.

Important Notes

Action Smoothing Recommended

Due to the diffusion model's 4 denoising steps (for real-time performance), predictions may have high-frequency noise. We strongly recommend temporal smoothing during deployment:

# Exponential moving average
alpha = 0.3  # 70% smoothing
smoothed_action = alpha * new_action + (1 - alpha) * previous_action

See eval_bimanual_lerobot.py in this repository for full implementation.

Camera Setup

Cameras must match training configuration:

left_gripper: Wrist camera on left arm (640x480 @ 30fps)
right_gripper: Wrist camera on right arm (640x480 @ 30fps)
top: Overhead camera (640x480 @ 30fps)

Action Space

12D continuous:

Dimensions 0-4: Left arm joints (degrees)
Dimension 5: Left gripper (0-47 range)
Dimensions 6-10: Right arm joints (degrees)
Dimension 11: Right gripper (0-47 range)

Performance

Open-loop Evaluation (on training data):

MSE: ~10-12 (varies by episode)
Action horizon: 64 steps
Denoising steps: 4

Deployment:

Inference speed: ~200ms per prediction (4 denoising steps)
Control frequency: 30Hz recommended
Temporal smoothing: Required for smooth execution

Limitations

Trained on only 100 episodes (limited generalization)
Single task: "Grab red cube, place in red basket"
May show jittering without temporal smoothing
Requires specific camera angles matching training setup

Citation

Built using NVIDIA Isaac GR00T:

@software{isaac_groot_2025,
  title = {NVIDIA Isaac GR00T},
  author = {NVIDIA Corporation},
  year = {2025},
  url = {https://github.com/NVIDIA-Omniverse/Isaac-GR00T}
}

License

Apache 2.0 (following GR00T base model)

Contact

For issues or questions, please contact the repository owner.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support