Reinforcement Learning Papers
Mistakes teach us to clarify what we really want and how we want to live. That’s the spirit of reinforcement learning: learning from the mistakes. Let’s be the explorer in reinforcement learning!
Deep Reinforcement Learning
 Speaker: John Schulman, OpenAI
 Gradient Estimation Using Stochastic Computation Graphs [NIPS 2015]
 John Schulman, Nicolas Heess, Theophane Weber, Pieter Abbeel
 Maximum Entropy Inverse Reinforcement Learning [AAAI 2008]
 Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey
 Close to the real case :point_right: the suboptimal case, optimal case can’t cover all the state space, and alleviate the reward function ambiguity
 basic concept: plans with equivalent rewards have equal probabilities, and plans with higher rewards are exponentially more preferred.
 Reinforcement Learning with Unsupervised Auxiliary Tasks [arXiv 2016]
 Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, Koray Kavukcuoglu
 Introduce an agent that also maximises many other pseudoreward functions simultaneously by reinforcement learning
 The advantage of auxiliary tasks: in many environment the extrinsic reward is very sparse, which make the feature extractor hard to learn at the beginning. Giving some pseudo reward makes the learner know how to interpret the image at the initial stage.
 They proposes two main auxiliary task: Pixel changes, Network features.
 Pixel changes: maximally changing the pixels in each cell of an n*n nonoverlapping grid placed over the input image :point_right: make the learner knows to move faster or avoid stopping (I guess)
 Network features: maximally activating each of the units in a specific hidden layer :point_right: to fully use the hidden units
 Section 4.1 “Unsupervised Reinforcement Learning” discuss why not use the pixel reconstruction loss
 Apprenticeship Learning via Inverse Reinforcement Learning [ICML 2004]
 Pieter Abbeel, Andrew Y. Ng
 The first time when apprenticeship is proposed
 Most of these methods try to directly mimic the demonstrator by applying a supervised learning algorithm to learn a direct mapping from the states to the actions. :point_right: only suitable for the case that the taskis to mimic the expert’s trajectory
 Reward function, rather than the policy or the value function, is the most succinct, robust, and transferable definition of the task,
 Basic concept: use the inverse reinforcement learning to recover the reward function from the expert and use the that reward function to find the optimal policy.
 Actually, Apprenticeship Learning doesn’t need to find the correct reward function. Instead, it use the predicted reward function to find the policy that is similar to expert.
 From the experiments, the apprenticeship learning need less sample trajectory than action mimic.
 Algorithms for Inverse Reinforcement Learning [ICML 2000]
 Andrew Y. Ng, Stuart Russell
 In examing animal and human behavior we must consider the reward function as an unknown to be ascertained through empirical investigation.
 Recover the expert’s reward function and this to generate desired behavior
 Use the eq.4 (see the derivation in the paper, which is quite clear)

Avoid reward function ambiguity by adding margin λ R  Use feature representation for the reward function
 Since the reward function is assumed to be the linear combination of features, this implies that if two policy with similar accumulated feature expectation, the accumulated reward is similar
 Experiment part shows IRL is soluble at least in moderate discrete, continuous space
 Reference: Inverse Reinforcement Learning, by Pieter Abbeel.
Deep Reinforcement Learning for Motion Planning
 Deep Reinforcement Learning with a Natural Language Action Space [ACL 2016]
 Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng, Mari Ostendorf
 Task: String of text :point_right: state, several strings of text :point_right: potential actions
 other
 Imitation Learning with Recurrent Neural Networks[arXiv 2016]
 Khanh Nguyen
 Aims to unify two sequence prediction network: learning to search(L2S) and recurrent neural network
 supervised recurrent neural entwork makes an independent prediction at each time step, which suffers seriously from compounding errors since the input observations correlate and thus violate the identically independent distributed assumption required by any supervised approach.
 L2S algorithms reduce a sequential prediction problem to learning a policy to traverse in a search space with minimum cost. It garantees that compounding errors grow linearly with trajectory lengths.
 Map the RNN components to L2S (take sequence2sequence for example):
 hidden representation :point_right: St
 decoded word :point_right: At
 encoded word :point_right: Xt (new information for the environment)
 The nonlinear gates in RNN are served to be the trainsition function
 Language Understanding for Textbased Games Using Deep Reinforcement Learning [EMNLP 2015]
 Karthik Narasimhan, Tejas Kulkarni, Regina Barzilay
 Use natural language as state representation, and fixed action space (not output natural language in freeform :point_right: major restriction)
 :star: challenging part: the environment is not directly observable
 The task includes text interpretation and learning strategy build on the text interpretation
 Use LSTM to interpret the state(in natural langiage form), and use DQN to select to corresponding action
 Basically follow the deepmind paper. With experience replay and minibatch update
 Using tSNE for the represnetation analysis is really cool (fig. 5)
 :star: HighDimensional Continuous Control Using Generalized Advantage Estimation [ICLR 2016]
 John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, Pieter Abbeel
 In extremely high dimensional task(like continuous control in 3D environment), stability is a key point.
 Propose an effective variance reduction scheme for policy gradients, which called generalized advantage estimation (GAE)
 Motivation of GAE: Supposed we have fixed length of steps, from eq.15, we know that the bias of each advantage function is kdependent. So, as k increases, the biased term becomes more ignorable, while the variance increases and vice versa. (if you found this concept is abstract, think of MC is unbiased but with high variance, while TD is biased, but with los variance)
 λ is a new concept included in this paper.
 If λ = 0 (like eq.17), then we have low variance, and is biased
 If λ = 1 (like eq.18), then we have high variance, and is unbased
 :star: Recurrent Models of Visual Attention [NIPS 2014]
 Volodymyr Mnih, Nicolas Heess, Alex Graves, Koray Kavukcuoglu
 Motivation: computationally expensive when dealing with large image. Many attention methods computation cost is propotional to the image size.
 Use the action control to attend part of image (define a Gaussian, and use treat the location(mean of Gaussian) as action)
 Can be viewed as POMDP (partially observation markov decision process)
 The location network is a 2DGaussian distribution
 Action is stochastically drawn from the distribution of location network
 Reward can be taskdependent (this paper is used in classification)
 Use the policy gradient to optimize
 Deterministic Policy Gradient Algorithms [ICML 2014]
 D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, M. Riedmiller
 The deterministic policy gradient is just a special case for stochastic policy gradient
 Problem of stochastic policy gradient: as the policy become more and more deterministic, the variance of the policy gradient become larger an larger. Finally, become two delta function :point_right: end up computing the gradient of Q
 The stochatic policy gradient ends up calculating the gradient of Q(s,a), a is the mean mean of the policy(assume that the policy is a normal distribution)
 Intuition: update policy in the direction that most improve Q
 The tradeoff of determinisitic policy: exploration :point_right: offpolicy deterministic actorcritic (exploration in Qlearning)
 In continuous action spaces, greedy policy improvement becomes problematic, requiring a global maximisation at every step. Instead, a simple and computationally attractive alternative is to move the policy in the direction of the gradient of Q, rather than globally maximising Q
 David Silver’s talk about deterministic policy gradient in ICML :point_right: very clear!
 Prioritized Experience Replay [ICML 2016]
 Tom Schaul, John Quan, Ioannis Antonoglou, David Silver
 Use prioritized sampling rather than uniformly sampling
 Use transition’s TD error δ, which indicates how “surprising” or “unexpected” the transition is
 Alleviate the loss of diversity with stochastic prioritization, and introduce bias
 Stochastic Prioritization: mixture of pure greedy prioritization and uniform random sampling
 Deep Reinforcement Learning with Double Qlearning [AAAI 2016]
 Hado van Hasselt, Arthur Guez, David Silver
 Deal with overestimation of Qvalues
 Separate actionselectQ and predictQ
 :star: Asynchronous Methods for Deep Reinforcement Learning [ICML 2016]
 Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu
 Onpolicy updates
 Implementation from others: asyncrl
 Asynchronous SGD, explain what “asynchronous” means.
 Tuning Deep Learning Episode 1: DeepMind’s A3C in Torch
 :star: Learning HandEye Coordination for Robotic Grasping with Deep Learning and LargeScale Data Collection
[arXiv 2016]
 Sergey Levine, Peter Pastor, Alex Krizhevsky, Deirdre Quillen
 [Deep Learning for Robots: Learning from LargeScale Interaction] (https://research.googleblog.com/2016/03/deeplearningforrobotslearningfrom.html)
 :star: Dueling Network Architectures for Deep Reinforcement Learning [ICML 2016]
 Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, Nando de Freitas
 Best Paper in ICML 2016
 Pose the question: Is conventional CNN suitable for RL tasks?
 Two stream network(statevalue and advantage funvtion)
 Focusing on innovating a neural network architecture that is better suited for modelfree RL
 Torch blog  Dueling Deep QNetworks
 Control of Memory, Active Perception, and Action in Minecraft [arXiv 2016]
 Junhyuk Oh, Valliappa Chockalingam, Satinder Singh, Honglak Lee
 Solving problem concerning to partial observability
 Propose mincraft task
 Memory QNetwork (MQN), Recurrent Memory QNetwork (RMQN), and Feedback Recurrent Memory QNetwork (FRMQN)
 :star: Continuous Control With Deep Reinforcement Learning [ICLR 2016]
 Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra
 Solves the continuous control task, and avoids the curse of dimension
 Deep version of DPG(deterministic policy gradient)
 When going deep, some issues will happens. It’s unstable to use the nonlinear function to approxiamate
 The different components of the observation may have different physical units and the ranges may vary across environments. => solve by batch normalization

For exploration, adding the noise to the actor policy: µ0(st) = µ(st θt µ) + N
 Active Object Localization with Deep Reinforcement Learning [ICCV 2015]
 Juan C. Caicedo, Svetlana Lazebnik
 Agent learns to deform a bounding box using simple transformation action(map the object detection task to RL)
 Ideas similar to GCNN: an Iterative Grid Based Object Detector
 Memorybased control with recurrent neural networks [[NIPS 2015 Deep Reinforcement Learning Workshop]]
(http://arxiv.org/abs/1512.04455)
 Nicolas Heess, Jonathan J Hunt, Timothy P Lillicrap, David Silver
 Use RNN to solve partiallyobserved problem
 Playing Atari with Deep Reinforcement Learning [NIPS 2013 Deep Learning Workshop]
 Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves Ioannis Antonoglou, Daan Wierstra
Suggest Paper
 Maximum Entropy Inverse Reinforcement Learning [AAAI 2008]