Hands-On Intelligent Agents with OpenAI Gym
上QQ阅读APP看书,第一时间看更新

Deep reinforcement learning

With a basic understanding of reinforcement learning, you are now in a better state (hopefully you are not in a strictly Markov state where you have forgotten the history/things you have learned so far) to understand the basics of the cool new suite of algorithms that have been rocking the field of AI in recent times.

Deep reinforcement learning emerged naturally when people made advancements in the deep learning field and applied them to reinforcement learning. We learned about the state-value function, action-value function, and policy. Let's briefly look at how they can be represented mathematically or realized through computer code. The state-value function  is a real-value function that takes the current state  as the input and outputs a real-value number (such as 4.57). This number is the agent's prediction of how good it is to be in state  and the agent keeps updating the value function based on the new experiences it gains. Likewise, the action-value function  is also a real-value function, which takes action  as an input in addition to state , and outputs a real number. One way to represent these functions is using neural networks because neural networks are universal function approximators, which are capable of representing complex, non-linear functions. For an agent trying to play a game of Atari by just looking at the images on the screen (like we do), state  could be the pixel values of the image on the screen. In such cases, we could use a deep neural network with convolutional layers to extract the visual features from the state/image, and then a few fully connected layers to finally output   or  , depending on which function we want to approximate. 

Recall from the earlier sections of this chapter that   is the state-value function and provides an estimate of the value of being in state  , and   is the action-value function, which provides an estimate of the value of each action given the  state.

If we do this, then we are doing deep reinforcement learning! Easy enough to understand? I hope so. Let's look at some other ways in which we can use deep learning in reinforcement learning. 

Recall that a policy is represented as  in the case of deterministic policies, and as  in the case of stochastic policies, where action  could be discrete (such as "move left," "move right," or "move straight ahead") or continuous values (such as "0.05" for acceleration, "0.67" for steering, and so on), and they can be single or multi-dimensional. Therefore, a policy can be a complicated function at times! It might have to take in a multi-dimensional state (such as an image) as input and output a multi-dimensional vector of probabilities as output (in the case of stochastic policies). So, this does look like it will be a monster function, doesn't it? Yes it does. That's where deep neural networks come to the rescue! We could approximate an agent's policy using a deep neural network and directly learn to update the policy (by updating the parameters of the deep neural network). This is called policy optimization-based deep reinforcement learning and it has been shown to be quite efficient in solving several challenging control problems, especially in robotics.

So in summary, deep reinforcement learning is the application of deep learning to reinforcement learning and so far, researchers have applied deep learning to reinforcement learning successfully in two ways. One way is using deep neural networks to approximate the value functions, and the other way is to use a deep neural network to represent the policy.

These ideas have been known from the early days, when researchers were trying to use neural networks as value function approximators, even back in 2005. But it rose to stardom only recently because although neural networks or other non-linear value function approximators can better represent the complex values of environment states and actions, they were prone to instability and often led to sub-optimal functions. Only recently have researchers such as Volodymyr Mnih and his colleagues at DeepMind (now part of Google) figured out the trick of stabilizing the learning and trained agents with deep, non-linear function approximators that converged to near-optimal value functions. In the later chapters of this book, we will in fact reproduce some of their then-groundbreaking results, which surpassed human Atari game playing capabilities!