How do Unity ML-Agents work?
Before training our agent, we need to understand what ML-Agents is and how it works.
What is Unity ML-Agents?
Unity ML-Agents is a toolkit for the game engine Unity that allows us to create environments using Unity or use pre-made environments to train our agents.
It’s developed by Unity Technologies, the developers of Unity, one of the most famous Game Engines used by the creators of Firewatch, Cuphead, and Cities: Skylines.
The six components
With Unity ML-Agents, you have six essential components:
- The first is the Learning Environment, which contains the Unity scene (the environment) and the environment elements (game characters).
- The second is the Python Low-level API, which contains the low-level Python interface for interacting and manipulating the environment. It’s the API we use to launch the training.
- Then, we have the External Communicator that connects the Learning Environment (made with C#) with the low level Python API (Python).
- The Python trainers: the Reinforcement algorithms made with PyTorch (PPO, SAC…).
- The Gym wrapper: to encapsulate the RL environment in a gym wrapper.
- The PettingZoo wrapper: PettingZoo is the multi-agents version of the gym wrapper.
Inside the Learning Component
Inside the Learning Component, we have two important elements:
- The first is the agent component, the actor of the scene. We’ll train the agent by optimizing its policy (which will tell us what action to take in each state). The policy is called the Brain.
- Finally, there is the Academy. This component orchestrates agents and their decision-making processes. Think of this Academy as a teacher who handles Python API requests.
To better understand its role, let’s remember the RL process. This can be modeled as a loop that works like this:
Now, let’s imagine an agent learning to play a platform game. The RL process looks like this:
- Our Agent receives state from the Environment — we receive the first frame of our game (Environment).
- Based on that state, the Agent takes action — our Agent will move to the right.
- The environment goes to a new state — new frame.
- The environment gives some reward to the Agent — we’re not dead (Positive Reward +1).
This RL loop outputs a sequence of state, action, reward and next state. The goal of the agent is to maximize the expected cumulative reward.
The Academy will be the one that will send the order to our Agents and ensure that agents are in sync:
- Collect Observations
- Select your action using your policy
- Take the Action
- Reset if you reached the max step or if you’re done.
Now that we understand how ML-Agents works, we’re ready to train our agents.
< > Update on GitHub