daenielkim-66 commited on
Commit
0d8b86a
1 Parent(s): a471d9b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +76 -7
README.md CHANGED
@@ -22,16 +22,85 @@ model-index:
22
  ---
23
 
24
  # **A2C** Agent playing **PandaReachDense-v3**
 
25
  This is a trained model of a **A2C** agent playing **PandaReachDense-v3**
26
- using the [stable-baselines3 library](https://github.com/DLR-RM/stable-baselines3).
27
 
28
- ## Usage (with Stable-baselines3)
29
- TODO: Add your code
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
 
 
31
 
32
- ```python
33
- from stable_baselines3 import ...
34
- from huggingface_sb3 import load_from_hub
 
 
 
35
 
36
- ...
37
  ```
 
 
22
  ---
23
 
24
  # **A2C** Agent playing **PandaReachDense-v3**
25
+ ## General information about the project:
26
  This is a trained model of a **A2C** agent playing **PandaReachDense-v3**
27
+ using the [stable-baselines3 library](https://github.com/DLR-RM/stable-baselines3). It controls a robotic arm to pick up balls.
28
 
29
+ ### What I did:
30
+ Manually tuned hyperparameters by adding "learning_rate=0.0007, n_steps=5, gamma=0.99, gae_lambda=0.95" to the A2C model.
31
+ ```
32
+ model = A2C(policy = "MultiInputPolicy",
33
+ env = env,
34
+ learning_rate=0.0007,
35
+ n_steps=5,
36
+ gamma=0.99,
37
+ gae_lambda=0.95,
38
+ verbose=1)
39
+ ```
40
+
41
+ ## Links to relevant resources such as tutorials.
42
+ Reinforcement Learning Tips and Tricks: https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html
43
+
44
+ A Github Training Framework : https://github.com/DLR-RM/rl-baselines3-zoo
45
+
46
+ Poe (MrProgrammer Bot):
47
+ I tried to follow what this was saying but I had a hard time understanding.
48
+ ```
49
+ import gym
50
+ from stable_baselines3 import A2C
51
+ from stable_baselines3.common.envs import DummyVecEnv
52
+ from stable_baselines3.common.evaluation import evaluate_policy
53
+ from stable_baselines3.common.callbacks import EvalCallback
54
+ from stable_baselines3.common.env_checker import check_env
55
+ from stable_baselines3.common.vec_env import VecNormalize
56
+ ```
57
+
58
+ ### Next, load and prepare your environment:
59
+ ```
60
+ env = gym.make('your_environment_name') # Replace with the name of your environment
61
+ env = DummyVecEnv([lambda: env])
62
+ env = VecNormalize(env, norm_obs=True, norm_reward=False, clip_obs=10.)
63
+ ```
64
+
65
+ ### Now, define a function to train and evaluate your A2C agent:
66
+ ```
67
+ def train_and_evaluate(hyperparameters):
68
+ model = A2C("MlpPolicy", env, verbose=0, **hyperparameters)
69
+
70
+ eval_env = gym.make('your_evaluation_environment_name') # Replace with the name of your evaluation environment
71
+ eval_env = DummyVecEnv([lambda: eval_env])
72
+ eval_env = VecNormalize(eval_env, norm_obs=True, norm_reward=False, clip_obs=10.)
73
+
74
+ eval_callback = EvalCallback(eval_env, best_model_save_path='./logs/',
75
+ log_path='./logs/', eval_freq=10000,
76
+ deterministic=True, render=False)
77
+
78
+ model.learn(total_timesteps=int(1e5), callback=eval_callback)
79
+
80
+ mean_reward, _ = evaluate_policy(model, eval_env, n_eval_episodes=10)
81
+
82
+ return mean_reward
83
+ ```
84
+
85
+ ### Now, we can define the hyperparameters grid and start the grid search:
86
+ ```
87
+ hyperparameters_grid = {
88
+ 'gamma': [0.99, 0.95],
89
+ 'learning_rate': [0.001, 0.0001],
90
+ 'ent_coef': [0.01, 0.1],
91
+ # Add other hyperparameters of interest
92
+ }
93
 
94
+ best_reward = float('-inf')
95
+ best_hyperparameters = {}
96
 
97
+ for hyperparameters in hyperparameters_grid:
98
+ mean_reward = train_and_evaluate(hyperparameters)
99
+
100
+ if mean_reward > best_reward:
101
+ best_reward = mean_reward
102
+ best_hyperparameters = hyperparameters
103
 
104
+ print("Best hyperparameters:", best_hyperparameters)
105
  ```
106
+ In this grid search, we specify a range of values for each hyperparameter of interest. The train_and_evaluate function trains the A2C agent with the given hyperparameters and evaluates its performance. We then update the best hyperparameters if the current combination achieves a higher reward.