ppo-Huggy-Rl-agent / README.md
Adilbai's picture
Update README.md
74f9393 verified
metadata
library_name: ml-agents
tags:
  - Huggy
  - deep-reinforcement-learning
  - reinforcement-learning
  - ML-Agents-Huggy

ppo Agent playing Huggy

This is a trained model of a ppo agent playing Huggy using the Unity ML-Agents Library.

Huggy PPO Agent - Training Documentation

Model Overview

Huggy is a PPO (Proximal Policy Optimization) agent trained using Unity ML-Agents toolkit. This is a custom Unity environment where the agent learns to perform specific behaviors over 2 million training steps.

Training Environment

  • Environment: Unity ML-Agents custom environment "Huggy"
  • ML-Agents Version: 1.2.0.dev0
  • ML-Agents Envs: 1.2.0.dev0
  • Communicator API: 1.5.0
  • PyTorch Version: 2.7.1+cu126
  • Unity Package Version: 2.2.1-exp.1

Training Configuration

PPO Hyperparameters

  • Batch Size: 2,048
  • Buffer Size: 20,480
  • Learning Rate: 0.0003 (linear schedule)
  • Beta (entropy regularization): 0.005 (linear schedule)
  • Epsilon (PPO clip parameter): 0.2 (linear schedule)
  • Lambda (GAE parameter): 0.95
  • Number of Epochs: 3
  • Shared Critic: False

Network Architecture

  • Normalization: Enabled
  • Hidden Units: 512
  • Number of Layers: 3
  • Visual Encoding Type: Simple
  • Memory: None
  • Goal Conditioning Type: Hyper
  • Deterministic: False

Reward Configuration

  • Reward Type: Extrinsic
  • Gamma (discount factor): 0.995
  • Reward Strength: 1.0
  • Reward Network Hidden Units: 128
  • Reward Network Layers: 2

Training Parameters

  • Maximum Steps: 2,000,000
  • Time Horizon: 1,000
  • Summary Frequency: 50,000 steps
  • Checkpoint Interval: 200,000 steps
  • Keep Checkpoints: 15
  • Threaded Training: False

Training Performance

Performance Progression

The agent showed steady improvement throughout training:

Early Training (0-200k steps):

  • Step 50k: Mean Reward = 1.840 ± 0.925
  • Step 100k: Mean Reward = 2.747 ± 1.096
  • Step 150k: Mean Reward = 3.031 ± 1.174
  • Step 200k: Mean Reward = 3.538 ± 1.370

Mid Training (200k-1M steps):

  • Performance stabilized around 3.6-3.9 mean reward
  • Peak performance at 500k steps: 3.873 ± 1.783

Late Training (1M-2M steps):

  • Consistent performance around 3.5-3.8 mean reward
  • Final performance at 2M steps: 3.718 ± 2.132

Key Performance Metrics

  • Training Duration: 2,350.439 seconds (~39 minutes)
  • Final Mean Reward: 3.718
  • Final Standard Deviation: 2.132
  • Peak Mean Reward: 3.873 (at 500k steps)
  • Lowest Standard Deviation: 0.925 (at 50k steps)

Training Characteristics

Learning Curve Analysis

  1. Rapid Initial Learning: Significant improvement in first 200k steps (1.84 → 3.54)
  2. Plateau Phase: Performance stabilized between 200k-2M steps
  3. Variance Increase: Standard deviation increased over time, indicating more diverse behavior patterns

Model Checkpoints

Regular ONNX model exports were created every 200k steps:

  • Huggy-199933.onnx
  • Huggy-399938.onnx
  • Huggy-599920.onnx
  • Huggy-799966.onnx
  • Huggy-999748.onnx
  • Huggy-1199265.onnx
  • Huggy-1399932.onnx
  • Huggy-1599985.onnx
  • Huggy-1799997.onnx
  • Huggy-1999614.onnx
  • Final Model: Huggy-2000364.onnx

Technical Implementation

Training Framework

  • Unity ML-Agents with PPO algorithm
  • Custom Unity environment integration
  • ONNX model export for deployment
  • Real-time training monitoring

Model Architecture Details

  • Multi-layer perceptron with 3 hidden layers
  • 512 hidden units per layer
  • Input normalization enabled
  • Separate actor-critic networks (shared_critic = False)
  • Hypernetwork goal conditioning

Reward Signal Processing

  • Single extrinsic reward signal
  • Discount factor of 0.995 for long-term planning
  • Dedicated reward network with 2 layers and 128 units

Performance Insights

Strengths

  • Consistent learning progression
  • Stable final performance around 3.7 mean reward
  • Successful completion of 2M training steps
  • Regular checkpoint generation for model versioning

Observations

  • Standard deviation increased over training, suggesting the agent learned more diverse strategies
  • Performance plateau after 200k steps indicates the task complexity was well-matched to the training duration
  • The agent maintained stable performance without significant degradation

Training Efficiency

  • Steps per Second: ~851 steps/second average
  • Episodes per Checkpoint: Approximately 200-250 episodes per checkpoint
  • Memory Usage: Efficient with 20,480 buffer size and 1,000 time horizon

This training session demonstrates successful PPO implementation in a Unity environment with consistent performance and robust learning characteristics.

Huggy PPO Agent - Usage Guide

Prerequisites

Before using the Huggy model, ensure you have the following installed:

# Install Unity ML-Agents
pip install mlagents==1.2.0

# Install required dependencies
pip install torch==2.7.1
pip install onnx
pip install onnxruntime

Model Files

After training, you'll have these key files:

  • Huggy.onnx - The trained model (final version)
  • Huggy-2000364.onnx - Final checkpoint model
  • config.yaml - Training configuration file
  • training logs - Performance metrics and tensorboard data

Loading and Using the Model

Method 1: Using ML-Agents Python API

from mlagents_envs.environment import UnityEnvironment
from mlagents_envs.base_env import ActionTuple
import numpy as np

# Load the Unity environment
env = UnityEnvironment(file_name="path/to/your/huggy_environment")

# Reset the environment
env.reset()

# Get behavior specs
behavior_names = list(env.behavior_specs.keys())
behavior_name = behavior_names[0]  # "Huggy"
spec = env.behavior_specs[behavior_name]

print(f"Observation space: {spec.observation_specs}")
print(f"Action space: {spec.action_spec}")

Method 2: Using ONNX Runtime for Inference

import onnxruntime as ort
import numpy as np

# Load the trained ONNX model
model_path = "results/Huggy2/Huggy.onnx"
ort_session = ort.InferenceSession(model_path)

# Get model input/output info
input_name = ort_session.get_inputs()[0].name
output_name = ort_session.get_outputs()[0].name

def predict_action(observation):
    """
    Predict action using the trained model
    """
    # Prepare observation (ensure correct shape and normalization)
    obs_input = np.array(observation, dtype=np.float32)
    
    # Run inference
    action_probs = ort_session.run([output_name], {input_name: obs_input})
    
    # Sample action from probabilities or take deterministic action
    action = np.argmax(action_probs[0])  # Deterministic
    # OR: action = np.random.choice(len(action_probs[0]), p=action_probs[0])  # Stochastic
    
    return action

Method 3: Running Trained Agent in Unity

from mlagents_envs.environment import UnityEnvironment
from mlagents_envs.base_env import ActionTuple
import onnxruntime as ort
import numpy as np

# Initialize environment and model
env = UnityEnvironment(file_name="HuggyEnvironment")
ort_session = ort.InferenceSession("results/Huggy2/Huggy.onnx")

# Get behavior name
behavior_names = list(env.behavior_specs.keys())
behavior_name = behavior_names[0]

# Run episodes
for episode in range(10):
    env.reset()
    decision_steps, terminal_steps = env.get_steps(behavior_name)
    
    episode_reward = 0
    step_count = 0
    
    while len(decision_steps) > 0:
        # Get observations
        observations = decision_steps.obs[0]
        
        # Predict actions using trained model
        actions = []
        for obs in observations:
            action_probs = ort_session.run(None, {"obs_0": obs.reshape(1, -1)})
            action = np.argmax(action_probs[0])
            actions.append(action)
        
        # Send actions to environment
        action_tuple = ActionTuple(discrete=np.array([actions]))
        env.set_actions(behavior_name, action_tuple)
        
        # Step environment
        env.step()
        decision_steps, terminal_steps = env.get_steps(behavior_name)
        
        # Track rewards
        if len(terminal_steps) > 0:
            episode_reward += terminal_steps.reward[0]
            break
        if len(decision_steps) > 0:
            episode_reward += decision_steps.reward[0]
        
        step_count += 1
    
    print(f"Episode {episode + 1}: Reward = {episode_reward:.3f}, Steps = {step_count}")

env.close()

Evaluation and Testing

Performance Evaluation Script

import numpy as np
from collections import defaultdict

def evaluate_model(env, model_session, num_episodes=100):
    """
    Evaluate the trained model performance
    """
    results = {
        'rewards': [],
        'episode_lengths': [],
        'success_rate': 0
    }
    
    behavior_name = list(env.behavior_specs.keys())[0]
    
    for episode in range(num_episodes):
        env.reset()
        decision_steps, terminal_steps = env.get_steps(behavior_name)
        
        episode_reward = 0
        episode_length = 0
        
        while len(decision_steps) > 0:
            # Get actions from model
            observations = decision_steps.obs[0]
            actions = []
            
            for obs in observations:
                action_probs = model_session.run(None, {"obs_0": obs.reshape(1, -1)})
                action = np.argmax(action_probs[0])  # Deterministic policy
                actions.append(action)
            
            # Step environment
            action_tuple = ActionTuple(discrete=np.array([actions]))
            env.set_actions(behavior_name, action_tuple)
            env.step()
            
            decision_steps, terminal_steps = env.get_steps(behavior_name)
            episode_length += 1
            
            # Check for episode termination
            if len(terminal_steps) > 0:
                episode_reward = terminal_steps.reward[0]
                break
        
        results['rewards'].append(episode_reward)
        results['episode_lengths'].append(episode_length)
    
    # Calculate statistics
    mean_reward = np.mean(results['rewards'])
    std_reward = np.std(results['rewards'])
    mean_length = np.mean(results['episode_lengths'])
    
    print(f"Evaluation Results ({num_episodes} episodes):")
    print(f"Mean Reward: {mean_reward:.3f} ± {std_reward:.3f}")
    print(f"Mean Episode Length: {mean_length:.1f}")
    print(f"Min Reward: {np.min(results['rewards']):.3f}")
    print(f"Max Reward: {np.max(results['rewards']):.3f}")
    
    return results

Deployment Options

Option 1: Unity Standalone Build

  1. Build your Unity environment with the trained model
  2. The model will automatically use the ONNX file for inference
  3. Deploy as a standalone executable

Option 2: Python Integration

# For integration into larger Python applications
class HuggyAgent:
    def __init__(self, model_path):
        self.session = ort.InferenceSession(model_path)
        self.input_name = self.session.get_inputs()[0].name
        
    def act(self, observation):
        """Get action from observation"""
        obs_input = np.array(observation, dtype=np.float32).reshape(1, -1)
        action_probs = self.session.run(None, {self.input_name: obs_input})
        return np.argmax(action_probs[0])
    
    def act_stochastic(self, observation):
        """Get stochastic action from observation"""
        obs_input = np.array(observation, dtype=np.float32).reshape(1, -1)
        action_probs = self.session.run(None, {self.input_name: obs_input})[0]
        return np.random.choice(len(action_probs), p=action_probs)

# Usage
agent = HuggyAgent("results/Huggy2/Huggy.onnx")
action = agent.act(current_observation)

Option 3: Web Deployment

# For web applications using Flask/FastAPI
from flask import Flask, request, jsonify
import onnxruntime as ort
import numpy as np

app = Flask(__name__)
model = ort.InferenceSession("Huggy.onnx")

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    observation = np.array(data['observation'], dtype=np.float32)
    
    action_probs = model.run(None, {"obs_0": observation.reshape(1, -1)})
    action = int(np.argmax(action_probs[0]))
    
    return jsonify({'action': action, 'confidence': float(np.max(action_probs[0]))})

if __name__ == '__main__':
    app.run(debug=True)

Troubleshooting

Common Issues

  1. ONNX Model Loading Errors

    • Ensure ONNX runtime version compatibility
    • Check model file path and permissions
  2. Unity Environment Connection

    • Verify Unity environment executable path
    • Check port availability (default: 5004)
  3. Observation Shape Mismatches

    • Ensure observation preprocessing matches training
    • Check input normalization requirements
  4. Performance Issues

    • Use deterministic policy for consistent results
    • Consider batch inference for multiple agents

Performance Optimization

# Batch processing for multiple agents
def batch_predict(model_session, observations):
    """Process multiple observations at once"""
    batch_obs = np.array(observations, dtype=np.float32)
    action_probs = model_session.run(None, {"obs_0": batch_obs})
    actions = np.argmax(action_probs[0], axis=1)
    return actions

This guide provides comprehensive instructions for deploying and using your trained Huggy PPO agent in various scenarios, from simple testing to production deployment.