Skip to content

Commit a42cd06

Browse files
YangruiYangrui
authored andcommitted
first version published
1 parent 5c989ad commit a42cd06

23 files changed

+2024
-0
lines changed

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2018 andrew-j-levy
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Hierarchical Actor-Critc (HAC)
2+
This repository contains the code to implement the *Hierarchical Actor-Critic (HAC)* algorithm. HAC helps agents learn tasks more quickly by enabling them to break problems down into short sequences of actions. The paper describing the algorithm is available [here](https://openreview.net/pdf?id=ryzECoAcY7).
3+
4+
To run HAC, execute the command *"python3 initialize_HAC.py --retrain"*. By default, this will train a UR5 agent with a 3-level hierarchy to learn to achieve certain poses. This UR5 agent should achieve a 90+% success rate in around 350 episodes. The following [video](https://www.youtube.com/watch?v=R86Vs9Vb6Bc) shows how a 3-layered agent performed after 450 episodes of training. In order to watch your trained agent, execute the command *"python3 initialize_HAC.py --test --show"*. Please note that in order to run this repository, you must have (i) a MuJoCo [license](https://www.roboti.us/license.html), (ii) the required MuJoCo software [libraries](https://www.roboti.us/index.html), and (iii) the MuJoCo Python [wrapper](https://github.com/openai/mujoco-py) from OpenAI.
5+
6+
To run HAC with your own agents and MuJoCo environments, you need to complete the template in the *"design_agent_and_env.py"* file. The *"example_designs"* folder contains other examples of design templates that build different agents in the UR5 reacher and inverted pendulum environments.
7+
8+
Happy to answer any questions you have. Please email me at andrew_levy2@brown.edu.
9+
10+
## UPDATE LOG
11+
12+
### 10/12/2018 - Key Changes
13+
1. Bounded Q-Values
14+
15+
The Q-values output by the critic network at each level are now bounded between *[-T,0]*, in which *T* is the max sequence length in which each policy specializes as well as the negative of the subgoal penalty. We use an upper bound of 0 because our code uses a nonpositive reward function. Consequently, Q-values should never be positive. However, we noticed that somtimes the critic function approximator would make small mistakes and assign positive Q-values, which occassionally proved harmful to results. In addition, we observed improved results when we used a tighter lower bound of *-T* (i.e., the subgoal penalty). The improved results may result from the increased flexibility the bounded Q-values provides the critic. The critic can assign a value of *-T* to any (state,action,goal) tuple, in which the action does not bring the agent close to the goal, instead of having to learn the exact value.
16+
17+
2. Removed Target Networks
18+
19+
We also noticed improved results when we used the regular Q-networks to determine the Bellman target updates (i.e., *reward + Q(next state,pi(next state),goal)*) instead of the separate target networks that are used in DDPG. The default setting of our code base thus no longer uses target networks. However, the target networks can be easily activated by making the changes specified in (i) the *"learn"* method in the *"layer.py"* file and (ii) the *"update"* method in the *"critic.py"* file.
20+
21+
3. Centralized Design Template
22+
23+
Users can now configure the agent and environment in the single file, *"design_agent_and_env.py"*. This template file contains most of the significant hyperparameters in HAC. We have removed the command-line options that can change the architecture of the agent's hierarchy.
24+
25+
4. Added UR5 Reacher Environment
26+
27+
We have added a new UR5 reacher environment, in which a UR5 agent can learn to achieve various poses. The *"ur5.xml"* MuJoCo file also contains commented code for a Robotiq gripper if you would like to augment the agent. Additional environments will hopefully be added shortly.

actor_critic_layer/RND.py

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
import tensorflow as tf
2+
import numpy as np
3+
4+
5+
def create_nn(input, input_num, output_num, init_val=0.001, relu=True, trainable=True, name=''):
6+
shape = [input_num, output_num]
7+
8+
w_init = tf.random_uniform_initializer(minval=-init_val, maxval=init_val)
9+
b_init = tf.random_uniform_initializer(minval=-init_val, maxval=init_val)
10+
11+
weights = tf.get_variable(name + "weights", shape, initializer=w_init, trainable=trainable)
12+
biases = tf.get_variable(name + "biases", [output_num], initializer=b_init, trainable=trainable)
13+
14+
dot = tf.matmul(input, weights) + biases
15+
16+
if not relu:
17+
return dot
18+
19+
dot = tf.nn.relu(dot)
20+
return dot
21+
22+
23+
class RND:
24+
def __init__(self, s_features, out_features=3, learning_rate=0.001):
25+
self.s_features = s_features
26+
self.out_features = out_features
27+
self.lr = learning_rate
28+
self._build_net()
29+
self.sess = tf.Session()
30+
self.sess.run(tf.global_variables_initializer())
31+
32+
def _build_net(self): # 创建两个random网络,一个训练一个固定
33+
self.state = tf.placeholder(tf.float32, [None, self.s_features], name='state') # input
34+
35+
with tf.variable_scope('train_net'):
36+
l1 = create_nn(self.state, self.s_features, 64, relu=True, trainable=True, name='l1')
37+
self.train_net_output = create_nn(l1, 64, self.out_features,relu=False, trainable=True, name='output')
38+
39+
with tf.variable_scope('target_net'):
40+
l1_ = create_nn(self.state, self.s_features, 64, init_val=10, relu=True, trainable=False, name='l1')
41+
self.target_net_output = create_nn(l1_, 64, self.out_features, init_val=10, relu=False, trainable=False, name='output')
42+
43+
self.loss = tf.reduce_mean(tf.squared_difference(self.train_net_output, self.target_net_output))
44+
self.intrinsic_reward = tf.reduce_mean(tf.squared_difference(self.train_net_output, self.target_net_output), axis=1)
45+
self._train_op = tf.train.AdamOptimizer(self.lr).minimize(self.loss)
46+
47+
def train(self, state):
48+
loss, train_op = self.sess.run([self.loss, self._train_op], feed_dict={
49+
self.state: state,
50+
})
51+
return loss
52+
53+
def get_intrinsic_reward(self, state):
54+
return self.sess.run(self.intrinsic_reward, feed_dict={self.state:state})
55+
56+
def get_target(self, state):
57+
return self.sess.run(self.target_net_output, feed_dict={self.state: state})
58+
59+
60+
61+
62+
63+
64+
65+
66+

actor_critic_layer/actor.py

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
import tensorflow as tf
2+
import numpy as np
3+
from .utils import layer
4+
import time
5+
6+
class Actor():
7+
8+
def __init__(self,
9+
sess,
10+
env,
11+
batch_size,
12+
layer_number,
13+
FLAGS,
14+
learning_rate=0.001,
15+
tau=0.05):
16+
17+
self.sess = sess
18+
# self.seed = FLAGS.seed
19+
20+
# Determine range of actor network outputs. This will be used to configure outer layer of neural network
21+
if layer_number == 0: # 最底层输出动作而不是目标
22+
self.action_space_bounds = env.action_bounds
23+
self.action_offset = env.action_offset
24+
else:
25+
self.action_space_bounds = env.subgoal_bounds_symmetric
26+
self.action_offset = env.subgoal_bounds_offset
27+
28+
# Dimensions of action will depend on layer level
29+
if layer_number == 0:
30+
self.action_space_size = env.action_dim
31+
else:
32+
self.action_space_size = env.subgoal_dim
33+
34+
self.actor_name = 'actor_' + str(layer_number) + str(time.time())
35+
36+
# Dimensions of goal placeholder will differ depending on layer level
37+
if layer_number == FLAGS.layers - 1:
38+
self.goal_dim = env.end_goal_dim
39+
else:
40+
self.goal_dim = env.subgoal_dim
41+
42+
self.state_dim = env.state_dim
43+
44+
self.learning_rate = learning_rate
45+
# self.exploration_policies = exploration_policies
46+
self.tau = tau # what's tau
47+
self.batch_size = batch_size
48+
49+
self.state_ph = tf.placeholder(tf.float32, shape=(None, self.state_dim))
50+
self.goal_ph = tf.placeholder(tf.float32, shape=(None, self.goal_dim))
51+
self.features_ph = tf.concat([self.state_ph, self.goal_ph], axis=1)
52+
53+
# Create actor network
54+
self.infer = self.create_nn(self.features_ph, self.actor_name)
55+
56+
# Target network code "repurposed" from Patrick Emani :^)
57+
self.weights = [v for v in tf.trainable_variables() if self.actor_name in v.op.name]
58+
# self.num_weights = len(self.weights)
59+
60+
# Create target actor network
61+
self.target = self.create_nn(self.features_ph, name = self.actor_name + '_target')
62+
self.target_weights = [v for v in tf.trainable_variables() if self.actor_name in v.op.name][len(self.weights):] # 在原来的网络之后加入的网络
63+
64+
self.update_target_weights = \
65+
[self.target_weights[i].assign(tf.multiply(self.weights[i], self.tau) +
66+
tf.multiply(self.target_weights[i], 1. - self.tau))
67+
for i in range(len(self.target_weights))] # 平滑地去更新target网络
68+
69+
self.action_derivs = tf.placeholder(tf.float32, shape=(None, self.action_space_size)) # 动作的权重,确定性动作是单点策略
70+
self.unnormalized_actor_gradients = tf.gradients(self.infer, self.weights, -self.action_derivs)
71+
self.policy_gradient = list(map(lambda x: tf.div(x, self.batch_size), self.unnormalized_actor_gradients)) # map将第二个iterable的参数给第一个函数执行
72+
73+
# self.policy_gradient = tf.gradients(self.infer, self.weights, -self.action_derivs)
74+
self.train = tf.train.AdamOptimizer(learning_rate).apply_gradients(zip(self.policy_gradient, self.weights))
75+
76+
77+
def get_action(self, state, goal):
78+
actions = self.sess.run(self.infer,
79+
feed_dict={
80+
self.state_ph: state,
81+
self.goal_ph: goal
82+
})
83+
84+
return actions
85+
86+
def get_target_action(self, state, goal):
87+
actions = self.sess.run(self.target,
88+
feed_dict={
89+
self.state_ph: state,
90+
self.goal_ph: goal
91+
})
92+
93+
return actions
94+
95+
def update(self, state, goal, action_derivs):
96+
weights, policy_grad, _ = self.sess.run([self.weights, self.policy_gradient, self.train],
97+
feed_dict={
98+
self.state_ph: state,
99+
self.goal_ph: goal,
100+
self.action_derivs: action_derivs
101+
})
102+
103+
return len(weights)
104+
105+
# self.sess.run(self.update_target_weights)
106+
107+
# def create_nn(self, state, goal, name='actor'):
108+
def create_nn(self, features, name=None):
109+
110+
if name is None:
111+
name = self.actor_name
112+
113+
with tf.variable_scope(name + '_fc_1'):
114+
fc1 = layer(features, 64)
115+
# with tf.variable_scope(name + '_fc_2'):
116+
# fc2 = layer(fc1, 64)
117+
# with tf.variable_scope(name + '_fc_3'):
118+
# fc3 = layer(fc2, 64)
119+
with tf.variable_scope(name + '_fc_4'):
120+
fc4 = layer(fc1, self.action_space_size, is_output=True)
121+
122+
output = tf.tanh(fc4) * self.action_space_bounds + self.action_offset
123+
return output
124+
125+

actor_critic_layer/critic.py

Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
import tensorflow as tf
2+
import numpy as np
3+
from .utils import layer
4+
import time
5+
6+
class Critic():
7+
8+
def __init__(self, sess, env, layer_number, FLAGS, learning_rate=0.001, gamma=0.98, tau=0.05):
9+
self.sess = sess
10+
self.critic_name = 'critic_' + str(layer_number) + str(time.time())
11+
self.learning_rate = learning_rate
12+
self.gamma = gamma
13+
self.tau = tau
14+
15+
self.q_limit = -FLAGS.time_scale
16+
17+
# Dimensions of goal placeholder will differ depending on layer level
18+
if layer_number == FLAGS.layers - 1:
19+
self.goal_dim = env.end_goal_dim
20+
else:
21+
self.goal_dim = env.subgoal_dim
22+
23+
self.loss_val = 0
24+
self.state_dim = env.state_dim
25+
self.state_ph = tf.placeholder(tf.float32, shape=(None, env.state_dim), name=self.critic_name + 'state_ph')
26+
self.goal_ph = tf.placeholder(tf.float32, shape=(None, self.goal_dim))
27+
28+
29+
# Dimensions of action placeholder will differ depending on layer level
30+
if layer_number == 0:
31+
action_dim = env.action_dim
32+
else:
33+
action_dim = env.subgoal_dim
34+
35+
self.action_ph = tf.placeholder(tf.float32, shape=(None, action_dim), name=self.critic_name + 'action_ph')
36+
37+
self.features_ph = tf.concat([self.state_ph, self.goal_ph, self.action_ph], axis=1)
38+
39+
# Set parameters to give critic optimistic initialization near q_init
40+
self.q_init = -0.067
41+
self.q_offset = -np.log(self.q_limit/self.q_init - 1)
42+
43+
# Create critic network graph
44+
self.infer = self.create_nn(self.features_ph, self.critic_name)
45+
self.weights = [v for v in tf.trainable_variables() if self.critic_name in v.op.name]
46+
47+
# Create target critic network graph. Please note that by default the critic networks are not used and updated. To use critic networks please follow instructions in the "update" method in this file and the "learn" method in the "layer.py" file.
48+
49+
# Target network code "repurposed" from Patrick Emani :^)
50+
self.target = self.create_nn(self.features_ph, name=self.critic_name + '_target')
51+
self.target_weights = [v for v in tf.trainable_variables() if self.critic_name in v.op.name][len(self.weights):]
52+
53+
self.update_target_weights = \
54+
[self.target_weights[i].assign(tf.multiply(self.weights[i], self.tau) +
55+
tf.multiply(self.target_weights[i], 1. - self.tau))
56+
for i in range(len(self.target_weights))]
57+
58+
self.wanted_qs = tf.placeholder(tf.float32, shape=(None, 1)) # 期望
59+
60+
self.loss = tf.reduce_mean(tf.square(self.wanted_qs - self.infer))
61+
62+
self.train = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)
63+
64+
self.gradient = tf.gradients(self.infer, self.action_ph)
65+
66+
67+
def get_Q_value(self,state, goal, action):
68+
return self.sess.run(self.infer,
69+
feed_dict={
70+
self.state_ph: state,
71+
self.goal_ph: goal,
72+
self.action_ph: action
73+
})[0]
74+
75+
def get_target_Q_value(self,state, goal, action):
76+
return self.sess.run(self.target,
77+
feed_dict={
78+
self.state_ph: state,
79+
self.goal_ph: goal,
80+
self.action_ph: action
81+
})[0]
82+
83+
84+
def update(self, old_states, old_actions, rewards, new_states, goals, new_actions, is_terminals):
85+
86+
# Be default, repo does not use target networks. To use target networks, comment out "wanted_qs" line directly below and uncomment next "wanted_qs" line. This will let the Bellman update use Q(next state, action) from target Q network instead of the regular Q network. Make sure you also make the updates specified in the "learn" method in the "layer.py" file.
87+
wanted_qs = self.sess.run(self.infer,
88+
feed_dict={
89+
self.state_ph: new_states,
90+
self.goal_ph: goals,
91+
self.action_ph: new_actions
92+
})
93+
94+
"""
95+
# Uncomment to use target networks
96+
wanted_qs = self.sess.run(self.target,
97+
feed_dict={
98+
self.state_ph: new_states,
99+
self.goal_ph: goals,
100+
self.action_ph: new_actions
101+
})
102+
"""
103+
104+
for i in range(len(wanted_qs)):
105+
if is_terminals[i]:
106+
wanted_qs[i] = rewards[i]
107+
else:
108+
wanted_qs[i] = rewards[i] + self.gamma * wanted_qs[i][0]
109+
110+
# Ensure Q target is within bounds [-self.time_limit,0]
111+
wanted_qs[i] = max(min(wanted_qs[i],0), self.q_limit)
112+
assert wanted_qs[i] <= 0 and wanted_qs[i] >= self.q_limit, "Q-Value target not within proper bounds"
113+
114+
self.loss_val, _ = self.sess.run([self.loss, self.train],
115+
feed_dict={
116+
self.state_ph: old_states,
117+
self.goal_ph: goals,
118+
self.action_ph: old_actions,
119+
self.wanted_qs: wanted_qs
120+
})
121+
122+
def get_gradients(self, state, goal, action):
123+
grads = self.sess.run(self.gradient,
124+
feed_dict={
125+
self.state_ph: state,
126+
self.goal_ph: goal,
127+
self.action_ph: action
128+
})
129+
130+
return grads[0]
131+
132+
# Function creates the graph for the critic function. The output uses a sigmoid, which bounds the Q-values to between [-Policy Length, 0].
133+
def create_nn(self, features, name=None):
134+
135+
if name is None:
136+
name = self.critic_name
137+
138+
with tf.variable_scope(name + '_fc_1'):
139+
fc1 = layer(features, 64)
140+
# with tf.variable_scope(name + '_fc_2'):
141+
# fc2 = layer(fc1, 64)
142+
# with tf.variable_scope(name + '_fc_3'):
143+
# fc3 = layer(fc2, 64)
144+
with tf.variable_scope(name + '_fc_4'):
145+
fc4 = layer(fc1, 1, is_output=True)
146+
147+
# A q_offset is used to give the critic function an optimistic initialization near 0
148+
output = tf.sigmoid(fc4 + self.q_offset) * self.q_limit
149+
150+
return output

0 commit comments

Comments
 (0)