Continuous control with deep reinforcement learning (DDPG)

Continuous control with deep reinforcement learning 2016-06-28 Taehoon Kim

Motivation • DQN can only handle • discrete (not continuous) • low-dimensional action spaces • Simple approach to adapt DQN to continuous domain is discretizing • 7 degree of freedom system with discretization 𝑎" ∈ {−𝑘, 0, 𝑘} • Now space dimensionality becomes 3+ = 2187 • explosion of the number of discrete actions 2

Contribution • Present a model-free, off-policy actor-critic algorithm • learn policies in high-dimensional, continuous action spaces • Work based on DPG (Deterministic policy gradient) 3

Background • actions 𝑎" ∈ ℝ2 , action space 𝒜 = ℝ2 • history of observation, action pairs 𝑠" = (𝑥7, 𝑎7, … , 𝑎"97, 𝑥") • assume fully-observable so 𝑠" = 𝑥" • policy 𝜋: 𝒮 → 𝒫(𝒜) • Model environment as Markov decision process • initial state distribution 𝑝(𝑠7) • transition dynamics 𝑝(𝑠"A7|𝑠", 𝑎") 4

Background • Discounted future reward 𝑅" = ∑ 𝛾F9" 𝑟(𝑠F, 𝑎F)H FI" • Goal of RL is to learn a policy 𝜋 which maximizes the expected return • from the start distribution 𝐽 = 𝔼LM ,NM~P,QM~R[𝑅7] • Discounted state visitation distribution for a policy 𝜋: ρR 5

Background • action-value function 𝑄R 𝑠", 𝑎" = 𝔼LMW",NMXY~P,QMXY~R[𝑅"|𝑠", 𝑎"] • expected return after taking an action 𝑎" in state 𝑠" and following policy 𝜋 • Bellman equation • 𝑄R 𝑠", 𝑎" = 𝔼LY,NYZ[~P[𝑟 𝑠", 𝑎" + 𝛾𝔼QYZ[~R 𝑄R (𝑠"A7, 𝑎"A7) ] • With deterministic policy 𝜇: 𝒮 → 𝒜 • 𝑄^ 𝑠", 𝑎" = 𝔼LY,NYZ[~P[𝑟 𝑠", 𝑎" + 𝛾𝑄^ 𝑠"A7, 𝜇(𝑠"A7 )] 6

Background • Expectation only depends on the environment • possible to learn 𝑄 𝝁 off-policy, where transitions are generated from different stochastic policy 𝜷 • Q-learning with greedy policy 𝜇 𝑠 = arg max f 𝑄 𝑠, 𝑎 • 𝐿 𝜃i = 𝔼NY~jk,QY~l,NY~P[ 𝑄 𝑠", 𝑎" 𝜃i − 𝑦" n ] • where 𝑦" = 𝑟 𝑠", 𝑎" + 𝛾𝑄(𝑠"A7, 𝜇(𝑠"A7)|𝜃i ) • To scale Q-learning into large non-linear approximators: • a replay buffer, a separate target network 7 (a commonly used off-policy algorithm)

Deterministic Policy Gradient (DPG) • In continuous space, finding the greedy policy requires an optimization of 𝑎" at every timestep • too slow to large, unconstrained function approximators and nontrivial action spaces • Instead, used an actor-critic approach based on the DPG algorithm • actor: 𝜇 𝑠 𝜃^ : 𝒮 → 𝒜 • critic: 𝑄(𝑠, 𝑎|𝜃i ) 8

Learning algorithm • Actor is updated by following the applying the chain rule to the expected return from the start distribution 𝒥 w.r.t 𝜃^ • 𝛻rs 𝒥 ≈ 𝔼N~j 𝜷 𝛻rs 𝑄 𝑠, 𝑎 𝜃i |NINY,QI^ 𝑠" 𝜃^ = 𝔼N~j 𝜷 𝛻Q 𝑄 𝑠, 𝑎 𝜃i |NINY,QI^ NY ∇rs 𝜇 𝑠 𝜃^ |NIN" • Silver et al. (2014) proved this is the policy gradient • the gradient of policy’s performance 9

Contributions • Introducing non-linear function approximators means that convergence is no longer guaranteed • But essential to learn and generalize on large state spaces • Contribution • To provide modifications to DPG, inspired by the success of DQN • Allow to use neural network function approximators to learn in large state and action spaces online 10

Challenges 1 • NN for RL usually assume that the samples are i.i.d. • but when the samples are generated from exploring sequentially in an environment, this assumption no longer holds. • As DQN, we use replay buffer to address this issue • As DQN, we used target network for stable learning but use “soft” target updates • 𝜃` ← 𝜏𝜃 + 1 − 𝜏 𝜃`, with 𝜏 ≪ 1 • Target network slowly change that greatly improve the stability of learning 11

Challenges 2 • When learning from low dimensional feature vector, observations may have different physical units (i.e. positions and velocities) • make it difficult to learn effectively and also to find hyper-parameters which generalize across environments • Use batch normalization [Ioffe & Szegedy, 2015] to normalize each dimension across the samples in a minibatch to have unit mean and variance • Also maintains a running average of the mean and variance for normalization during testing • Use all layers of 𝜇 and 𝑄 prior to the action input • Can train different units without needing to manually ensure the units were within a set range 12 (exploration or evaluation)

Challenges 3 • Advantage of off-policies algorithm (i.e. DDPG) is that we can treat the problem of exploration independently from the learning algorithm • Constructed an exploration policy 𝜇` by adding noise sampled from a noise process 𝒩 • 𝜇` 𝑠" = 𝜇 𝑠" 𝜃" ^ + 𝒩 • Use an Ornstein-Uhlenbeck process to generate temporally correlated exploration for exploration efficiency with inertia 13

Experiment details • Adam. 𝑙𝑟^ = 109| , 𝑙𝑟i = 109} • 𝑄 include 𝐿n weight decay of 109n and 𝛾 = 0.99 • 𝜏 = 0.001 • ReLU for hidden layers, tanh for output layer of the actor to bound the actions • NN: 2 hidden layers with 400 and 300 units • Action is not included until the 2nd hidden layer of 𝑄 • The final layer weights and biases are initialized from a uniform distribution −3×109} ,3×109} • to ensure the initial outputs for the policy and value estimates were near zero • The other layers are initialized from uniform distributions − 7 • , 7 • where 𝑓 is the fan-in of the layer • Replay buffer ℛ = 10„ , Ornstein-Uhlenbeck process: 𝜃 = 0.15, 𝜎 = 0.2 15

References 1. [Wang, 2015] Wang, Z., de Freitas, N., & Lanctot, M. (2015). Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581. 2. [Van, 2015] Van Hasselt, H., Guez, A., & Silver, D. (2015). Deep reinforcement learning with double Q-learning. CoRR, abs/1509.06461. 3. [Schaul, 2015] Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience replay. arXiv preprint arXiv:1511.05952. 4. [Sutton, 1998] Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction(Vol. 1, No. 1). Cambridge: MIT press. 16

Continuous control with deep reinforcement learning (DDPG)

More Related Content

What's hot

Viewers also liked

Similar to Continuous control with deep reinforcement learning (DDPG)

More from Taehoon Kim

Recently uploaded

Continuous control with deep reinforcement learning (DDPG)