YangRui2015
diff --git a/‎LICENSE‎
Lines changed: 21 additions & 0 deletions b/‎LICENSE‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 27 additions & 0 deletions b/‎README.md‎
Lines changed: 27 additions & 0 deletions
diff --git a/‎actor_critic_layer/RND.py‎
Lines changed: 66 additions & 0 deletions b/‎actor_critic_layer/RND.py‎
Lines changed: 66 additions & 0 deletions
diff --git a/‎actor_critic_layer/actor.py‎
Lines changed: 125 additions & 0 deletions b/‎actor_critic_layer/actor.py‎
Lines changed: 125 additions & 0 deletions
diff --git a/‎actor_critic_layer/critic.py‎
Lines changed: 150 additions & 0 deletions b/‎actor_critic_layer/critic.py‎
Lines changed: 150 additions & 0 deletions
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2018 andrew-j-levy
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
@@ -0,0 +1,27 @@
+# Hierarchical Actor-Critc (HAC)
+This repository contains the code to implement the *Hierarchical Actor-Critic (HAC)* algorithm. HAC helps agents learn tasks more quickly by enabling them to break problems down into short sequences of actions. The paper describing the algorithm is available [here](https://openreview.net/pdf?id=ryzECoAcY7).
+
+To run HAC, execute the command *"python3 initialize_HAC.py --retrain"*. By default, this will train a UR5 agent with a 3-level hierarchy to learn to achieve certain poses. This UR5 agent should achieve a 90+% success rate in around 350 episodes. The following [video](https://www.youtube.com/watch?v=R86Vs9Vb6Bc) shows how a 3-layered agent performed after 450 episodes of training. In order to watch your trained agent, execute the command *"python3 initialize_HAC.py --test --show"*. Please note that in order to run this repository, you must have (i) a MuJoCo [license](https://www.roboti.us/license.html), (ii) the required MuJoCo software [libraries](https://www.roboti.us/index.html), and (iii) the MuJoCo Python [wrapper](https://github.com/openai/mujoco-py) from OpenAI. 
+
+To run HAC with your own agents and MuJoCo environments, you need to complete the template in the *"design_agent_and_env.py"* file. The *"example_designs"* folder contains other examples of design templates that build different agents in the UR5 reacher and inverted pendulum environments.
+
+Happy to answer any questions you have. Please email me at andrew_levy2@brown.edu.
+
+## UPDATE LOG
+
+### 10/12/2018 - Key Changes
+1. Bounded Q-Values
+
+The Q-values output by the critic network at each level are now bounded between *[-T,0]*, in which *T* is the max sequence length in which each policy specializes as well as the negative of the subgoal penalty. We use an upper bound of 0 because our code uses a nonpositive reward function. Consequently, Q-values should never be positive. However, we noticed that somtimes the critic function approximator would make small mistakes and assign positive Q-values, which occassionally proved harmful to results. In addition, we observed improved results when we used a tighter lower bound of *-T* (i.e., the subgoal penalty). The improved results may result from the increased flexibility the bounded Q-values provides the critic. The critic can assign a value of *-T* to any (state,action,goal) tuple, in which the action does not bring the agent close to the goal, instead of having to learn the exact value.
+
+2. Removed Target Networks
+
+We also noticed improved results when we used the regular Q-networks to determine the Bellman target updates (i.e., *reward + Q(next state,pi(next state),goal)*) instead of the separate target networks that are used in DDPG. The default setting of our code base thus no longer uses target networks. However, the target networks can be easily activated by making the changes specified in (i) the *"learn"* method in the *"layer.py"* file and (ii) the *"update"* method in the *"critic.py"* file. 
+
+3. Centralized Design Template
+
+Users can now configure the agent and environment in the single file, *"design_agent_and_env.py"*. This template file contains most of the significant hyperparameters in HAC. We have removed the command-line options that can change the architecture of the agent's hierarchy.
+
+4. Added UR5 Reacher Environment
+
+We have added a new UR5 reacher environment, in which a UR5 agent can learn to achieve various poses. The *"ur5.xml"* MuJoCo file also contains commented code for a Robotiq gripper if you would like to augment the agent. Additional environments will hopefully be added shortly. 
@@ -0,0 +1,66 @@
+import tensorflow as tf
+import numpy as np
+
+
+def create_nn(input, input_num, output_num, init_val=0.001, relu=True, trainable=True, name=''):
+ shape = [input_num, output_num]
+
+ w_init = tf.random_uniform_initializer(minval=-init_val, maxval=init_val)
+ b_init = tf.random_uniform_initializer(minval=-init_val, maxval=init_val)
+
+ weights = tf.get_variable(name + "weights", shape, initializer=w_init, trainable=trainable)
+ biases = tf.get_variable(name + "biases", [output_num], initializer=b_init, trainable=trainable)
+
+ dot = tf.matmul(input, weights) + biases
+
+ if not relu:
+ return dot
+
+ dot = tf.nn.relu(dot)
+ return dot
+
+
+class RND:
+ def __init__(self, s_features, out_features=3, learning_rate=0.001):
+ self.s_features = s_features
+ self.out_features = out_features
+ self.lr = learning_rate
+ self._build_net()
+ self.sess = tf.Session()
+ self.sess.run(tf.global_variables_initializer())
+
+ def _build_net(self): # 创建两个random网络，一个训练一个固定
+ self.state = tf.placeholder(tf.float32, [None, self.s_features], name='state') # input
+
+ with tf.variable_scope('train_net'):
+ l1 = create_nn(self.state, self.s_features, 64, relu=True, trainable=True, name='l1')
+ self.train_net_output = create_nn(l1, 64, self.out_features,relu=False, trainable=True, name='output')
+
+ with tf.variable_scope('target_net'):
+ l1_ = create_nn(self.state, self.s_features, 64, init_val=10, relu=True, trainable=False, name='l1')
+ self.target_net_output = create_nn(l1_, 64, self.out_features, init_val=10, relu=False, trainable=False, name='output')
+
+ self.loss = tf.reduce_mean(tf.squared_difference(self.train_net_output, self.target_net_output))
+ self.intrinsic_reward = tf.reduce_mean(tf.squared_difference(self.train_net_output, self.target_net_output), axis=1)
+ self._train_op = tf.train.AdamOptimizer(self.lr).minimize(self.loss)
+
+ def train(self, state):
+ loss, train_op = self.sess.run([self.loss, self._train_op], feed_dict={
+ self.state: state,
+ })
+ return loss
+
+ def get_intrinsic_reward(self, state):
+ return self.sess.run(self.intrinsic_reward, feed_dict={self.state:state})
+
+ def get_target(self, state):
+ return self.sess.run(self.target_net_output, feed_dict={self.state: state})
+
+
+
+
+
+
+
+
+
@@ -0,0 +1,125 @@
+import tensorflow as tf
+import numpy as np
+from .utils import layer
+import time
+
+class Actor():
+
+ def __init__(self, 
+ sess,
+ env,
+ batch_size,
+ layer_number,
+ FLAGS,
+ learning_rate=0.001,
+ tau=0.05):
+
+ self.sess = sess
+ # self.seed = FLAGS.seed
+
+ # Determine range of actor network outputs. This will be used to configure outer layer of neural network
+ if layer_number == 0: # 最底层输出动作而不是目标
+ self.action_space_bounds = env.action_bounds
+ self.action_offset = env.action_offset
+ else:
+ self.action_space_bounds = env.subgoal_bounds_symmetric
+ self.action_offset = env.subgoal_bounds_offset
+ 
+ # Dimensions of action will depend on layer level 
+ if layer_number == 0:
+ self.action_space_size = env.action_dim
+ else:
+ self.action_space_size = env.subgoal_dim
+
+ self.actor_name = 'actor_' + str(layer_number) + str(time.time())
+
+ # Dimensions of goal placeholder will differ depending on layer level
+ if layer_number == FLAGS.layers - 1:
+ self.goal_dim = env.end_goal_dim
+ else:
+ self.goal_dim = env.subgoal_dim
+
+ self.state_dim = env.state_dim
+
+ self.learning_rate = learning_rate
+ # self.exploration_policies = exploration_policies
+ self.tau = tau # what's tau
+ self.batch_size = batch_size
+ 
+ self.state_ph = tf.placeholder(tf.float32, shape=(None, self.state_dim))
+ self.goal_ph = tf.placeholder(tf.float32, shape=(None, self.goal_dim))
+ self.features_ph = tf.concat([self.state_ph, self.goal_ph], axis=1)
+
+ # Create actor network
+ self.infer = self.create_nn(self.features_ph, self.actor_name)
+
+ # Target network code "repurposed" from Patrick Emani :^)
+ self.weights = [v for v in tf.trainable_variables() if self.actor_name in v.op.name]
+ # self.num_weights = len(self.weights)
+ 
+ # Create target actor network
+ self.target = self.create_nn(self.features_ph, name = self.actor_name + '_target')
+ self.target_weights = [v for v in tf.trainable_variables() if self.actor_name in v.op.name][len(self.weights):] # 在原来的网络之后加入的网络
+
+ self.update_target_weights = \
+ [self.target_weights[i].assign(tf.multiply(self.weights[i], self.tau) +
+ tf.multiply(self.target_weights[i], 1. - self.tau))
+ for i in range(len(self.target_weights))] # 平滑地去更新target网络
+
+ self.action_derivs = tf.placeholder(tf.float32, shape=(None, self.action_space_size)) # 动作的权重，确定性动作是单点策略
+ self.unnormalized_actor_gradients = tf.gradients(self.infer, self.weights, -self.action_derivs)
+ self.policy_gradient = list(map(lambda x: tf.div(x, self.batch_size), self.unnormalized_actor_gradients)) # map将第二个iterable的参数给第一个函数执行
+
+ # self.policy_gradient = tf.gradients(self.infer, self.weights, -self.action_derivs)
+ self.train = tf.train.AdamOptimizer(learning_rate).apply_gradients(zip(self.policy_gradient, self.weights))
+
+
+ def get_action(self, state, goal):
+ actions = self.sess.run(self.infer,
+ feed_dict={
+ self.state_ph: state,
+ self.goal_ph: goal
+ })
+
+ return actions
+
+ def get_target_action(self, state, goal):
+ actions = self.sess.run(self.target,
+ feed_dict={
+ self.state_ph: state,
+ self.goal_ph: goal
+ })
+
+ return actions
+
+ def update(self, state, goal, action_derivs):
+ weights, policy_grad, _ = self.sess.run([self.weights, self.policy_gradient, self.train],
+ feed_dict={
+ self.state_ph: state,
+ self.goal_ph: goal,
+ self.action_derivs: action_derivs
+ })
+
+ return len(weights)
+
+ # self.sess.run(self.update_target_weights)
+
+ # def create_nn(self, state, goal, name='actor'):
+ def create_nn(self, features, name=None):
+ 
+ if name is None:
+ name = self.actor_name
+
+ with tf.variable_scope(name + '_fc_1'):
+ fc1 = layer(features, 64)
+ # with tf.variable_scope(name + '_fc_2'):
+ # fc2 = layer(fc1, 64)
+ # with tf.variable_scope(name + '_fc_3'):
+ # fc3 = layer(fc2, 64)
+ with tf.variable_scope(name + '_fc_4'):
+ fc4 = layer(fc1, self.action_space_size, is_output=True)
+
+ output = tf.tanh(fc4) * self.action_space_bounds + self.action_offset
+ return output
+ 
+ 
@@ -0,0 +1,150 @@
+import tensorflow as tf
+import numpy as np
+from .utils import layer
+import time
+
+class Critic():
+
+ def __init__(self, sess, env, layer_number, FLAGS, learning_rate=0.001, gamma=0.98, tau=0.05):
+ self.sess = sess
+ self.critic_name = 'critic_' + str(layer_number) + str(time.time())
+ self.learning_rate = learning_rate
+ self.gamma = gamma
+ self.tau = tau
+ 
+ self.q_limit = -FLAGS.time_scale
+
+ # Dimensions of goal placeholder will differ depending on layer level
+ if layer_number == FLAGS.layers - 1:
+ self.goal_dim = env.end_goal_dim
+ else:
+ self.goal_dim = env.subgoal_dim
+
+ self.loss_val = 0
+ self.state_dim = env.state_dim
+ self.state_ph = tf.placeholder(tf.float32, shape=(None, env.state_dim), name=self.critic_name + 'state_ph')
+ self.goal_ph = tf.placeholder(tf.float32, shape=(None, self.goal_dim))
+
+
+ # Dimensions of action placeholder will differ depending on layer level
+ if layer_number == 0:
+ action_dim = env.action_dim
+ else:
+ action_dim = env.subgoal_dim
+
+ self.action_ph = tf.placeholder(tf.float32, shape=(None, action_dim), name=self.critic_name + 'action_ph')
+ 
+ self.features_ph = tf.concat([self.state_ph, self.goal_ph, self.action_ph], axis=1)
+
+ # Set parameters to give critic optimistic initialization near q_init
+ self.q_init = -0.067
+ self.q_offset = -np.log(self.q_limit/self.q_init - 1)
+
+ # Create critic network graph
+ self.infer = self.create_nn(self.features_ph, self.critic_name)
+ self.weights = [v for v in tf.trainable_variables() if self.critic_name in v.op.name]
+
+ # Create target critic network graph. Please note that by default the critic networks are not used and updated. To use critic networks please follow instructions in the "update" method in this file and the "learn" method in the "layer.py" file.
+
+ # Target network code "repurposed" from Patrick Emani :^)
+ self.target = self.create_nn(self.features_ph, name=self.critic_name + '_target')
+ self.target_weights = [v for v in tf.trainable_variables() if self.critic_name in v.op.name][len(self.weights):]
+
+ self.update_target_weights = \
+ [self.target_weights[i].assign(tf.multiply(self.weights[i], self.tau) +
+ tf.multiply(self.target_weights[i], 1. - self.tau))
+ for i in range(len(self.target_weights))]
+
+ self.wanted_qs = tf.placeholder(tf.float32, shape=(None, 1)) # 期望
+
+ self.loss = tf.reduce_mean(tf.square(self.wanted_qs - self.infer))
+
+ self.train = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)
+
+ self.gradient = tf.gradients(self.infer, self.action_ph)
+
+
+ def get_Q_value(self,state, goal, action):
+ return self.sess.run(self.infer,
+ feed_dict={
+ self.state_ph: state,
+ self.goal_ph: goal,
+ self.action_ph: action
+ })[0]
+
+ def get_target_Q_value(self,state, goal, action):
+ return self.sess.run(self.target,
+ feed_dict={
+ self.state_ph: state,
+ self.goal_ph: goal,
+ self.action_ph: action
+ })[0]
+
+
+ def update(self, old_states, old_actions, rewards, new_states, goals, new_actions, is_terminals):
+
+ # Be default, repo does not use target networks. To use target networks, comment out "wanted_qs" line directly below and uncomment next "wanted_qs" line. This will let the Bellman update use Q(next state, action) from target Q network instead of the regular Q network. Make sure you also make the updates specified in the "learn" method in the "layer.py" file. 
+ wanted_qs = self.sess.run(self.infer,
+ feed_dict={
+ self.state_ph: new_states,
+ self.goal_ph: goals,
+ self.action_ph: new_actions
+ })
+
+ """
+ # Uncomment to use target networks
+ wanted_qs = self.sess.run(self.target,
+ feed_dict={
+ self.state_ph: new_states,
+ self.goal_ph: goals,
+ self.action_ph: new_actions
+ })
+ """
+ 
+ for i in range(len(wanted_qs)):
+ if is_terminals[i]:
+ wanted_qs[i] = rewards[i]
+ else:
+ wanted_qs[i] = rewards[i] + self.gamma * wanted_qs[i][0]
+
+ # Ensure Q target is within bounds [-self.time_limit,0]
+ wanted_qs[i] = max(min(wanted_qs[i],0), self.q_limit)
+ assert wanted_qs[i] <= 0 and wanted_qs[i] >= self.q_limit, "Q-Value target not within proper bounds"
+
+ self.loss_val, _ = self.sess.run([self.loss, self.train],
+ feed_dict={
+ self.state_ph: old_states,
+ self.goal_ph: goals,
+ self.action_ph: old_actions,
+ self.wanted_qs: wanted_qs 
+ })
+
+ def get_gradients(self, state, goal, action):
+ grads = self.sess.run(self.gradient,
+ feed_dict={
+ self.state_ph: state,
+ self.goal_ph: goal,
+ self.action_ph: action
+ })
+
+ return grads[0]
+
+ # Function creates the graph for the critic function. The output uses a sigmoid, which bounds the Q-values to between [-Policy Length, 0].
+ def create_nn(self, features, name=None):
+
+ if name is None:
+ name = self.critic_name 
+
+ with tf.variable_scope(name + '_fc_1'):
+ fc1 = layer(features, 64)
+ # with tf.variable_scope(name + '_fc_2'):
+ # fc2 = layer(fc1, 64)
+ # with tf.variable_scope(name + '_fc_3'):
+ # fc3 = layer(fc2, 64)
+ with tf.variable_scope(name + '_fc_4'):
+ fc4 = layer(fc1, 1, is_output=True)
+
+ # A q_offset is used to give the critic function an optimistic initialization near 0
+ output = tf.sigmoid(fc4 + self.q_offset) * self.q_limit
+
+ return output