# What is Reinforcement Learning

Recently, the study of sequential decisions have been grouped under a growing umbrella of research called “reinforcement learning.”  The problem today is that when you say you are going to solve a problem with “reinforcement learning” (or you want to take a course in “reinforcement learning”), not even specialists in the field know what this is.

When reinforcement learning first emerged as a solution approach in the 1980’s (the term dates back to work in psychology in the early 1900s), it was a well defined method for a well defined class of problems (which I review below).  Today, it has broadened to a much wider class of methods, and a much wider class of problems, which evade clear description.

I start with a description offered by Professor Ben van Roy (Stanford) which associates “reinforcement learning” with a community. I then offer an alternative strategy, which is to define a field that I am calling sequential decision analytics which is centered on a clearly defined class of problems that can be solved with clearly defined classes of methods.

## A little history

In the 1990’s, reinforcement learning emerged as a method for solving (approximately) sequential decision problems using the framework of “Markov decision processes.” The idea proceeded by estimating the value $Q(s,a)$ of being in a state “$s$” and taking an action “$a$” using two basic equations:

${\hat q}^n(s^n,a^n) = r(s,a) + \max_{a'} {\bar Q}^{n-1}(s^{n+1},a') \\{\bar Q}^n(s^n,a^n) = (1-\alpha) {\bar Q}^{n-1}(s^n,a^n) + \alpha {\hat q}^n(s^n,a^n)$

where $r(s^n,a^n)$ is the reward from being in a state $s^n$ and taking an action $a^n$ (chosen according to some rule), and where we have to simulate our way to the next state $s^{n+1}$ given that we are in state $s^n$ and take action $a^n$. The parameter $\alpha$ is a smoothing factor (also called a stepsize or learning rate) that may decline over time. These updating equations became known as Q-learning.  They are simple and elegant, and minimize the complexities of modeling sequential decision problems.

Over time, researchers found that Q-learning did not always work (in fact, it often does not work, a statement that applies to virtually every method for solving sequential decision problems). The second edition of Sutton and Barto’s classic book Reinforcement Learning (which appeared in 2018) now includes parameterized policies, upper confidence bounding, Q-learning and a class of lookahead policies known as Monte Carlo tree search.  There is, of course, an intense flurry of research proposing new methods under the umbrella “reinforcement learning.”

## Reinforcement learning through supervised learning

Note that you can use machine learning to make decisions if you are trying to mimic what a human would do.  In this case, the observations “y” would be actual decisions made in the past.  You have a “cost function” which is to minimize the differences between the model $f(x|\theta)$ given input data x and the observed decision y.   This would be viewed as a form of supervised learning where a (typically human) “superviser” would look at the inputs (the state variable) x and then make a decision y.  A standard statistical model would then be used to “predict” the decision.  This is just a form of (supervised) machine learning.

When we are making decisions through optimization, we assume we have a performance metric $C(S,x)$ (a reward to be maximized, or a cost to be minimized), along with a model of the dynamics of the problem.  In this case, we do not need the training dataset (of the y values).  I revisit this below.

## Reinforcement learning as defined by a community I

At a workshop a few years ago, Ben van Roy (a renowned RL researcher at Stanford, and early pioneer of the field) described reinforcement learning as:

• A problem class consisting of an agent acting on an environment receiving a reward.
• A community that identifies its work as “reinforcement learning.”
• The set of methods developed by the community using the methods it self-identifies as “reinforcement learning” applied to the problem class.

This description effectively describes “reinforcement learning” as any method used by the “reinforcement learning community” for solving problems that can be described as an agent acting on the environment receiving a reward.  The community used to be a fairly well defined set of researchers in computer science, but as the visibility of the field grew (through successes solving games such as chess and Go), so did the breadth of the community identifying itself with “reinforcement learning.”

Ben’s characterization of reinforcement learning effectively says that it is whatever the “RL community” is doing to solve a problem that involves “an agent acting on an environment receiving a reward.” I would argue that while this characterization may have been useful years ago, today it is simply too vague.  This is partly because the field has grown very large in terms of the size of the community, the breadth of problems, and the variety of solution methods.  Further complicating the situation is that other communities working in this space (dynamic programming, optimal control, simulation optimization, stochastic search) are going through the same growth in terms of the scope of problems and variety of methods.  Today, the overlap between these communities is significant, taking away any meaning to a statement of solving a problem using “reinforcement learning.”

Below, I am going to make a clear distinction between a problem class, and methods for solving the problem class.  I will start by defining a problem class called “sequential decision problems” that are solved by finding the best policy that can be computed.  I then provide an explicit organization of the different types of policies that make it possible for people to be more specific about the solution approach they are using.  After this, I revisit the problem of defining “reinforcement learning” at the end.

## Sequential decision problems

Sequential decision problems describe any setting where we intermingle making decisions and then making observations that affect the performance of the decision. Any sequential decision problem can be written as the sequence

$(S^0,x^0,W^1,S^1,x^1,\ldots,S^n,x^n,W^{n+1},X^{n+1},\ldots,S^N)$

We are indexing “time” using an iteration counter n, but we could use time t, in which case we would write the state $S_t$, decision $x_t$, and exogenous information $W_{t+1}$.

Each time we make a decision we receive a contribution $C(S^n,x^n)$. We make decisions $x^n$ with a function (or policy) $X^\pi(S^n)$.  For these problems, the decisions $x^n$ could be binary, discrete or continuous, scalar or vector-valued.

The goal is to find the best policy where we might optimize the objective function

$\max_\pi \mathbb{E} \left\{\sum_{n=0}^N C(S^n,X^\pi(S^n))|S^0\right\}$

There are variations of this objective function; here we are maximizing the cumulative reward, but we might also maximize the final reward after a series of learning iterations.

There are five components of any sequential decision problem:

• The state variable $S^n$ that contains whatever we know (or believe) after n iterations.
• The decision $x^n$ which is made using a (yet to be designed) policy $X^\pi(S^n)$.
• The exogenous information (not present in all problems) $W^{n+1}$ that is the information that arrives after we choose $x^n$.
• The transition function which describes how the state variables $S^n$ given the decision $x^n$ and exogenous information $W^{n+1}$.  I like to write this as
$S^{n+1}=S^M(S^n,x^n,W^{n+1})$
• The objective function
$\max_\pi \mathbb{E}\left\{\sum_{n=0}^N C(S^n,X^\pi(S^n))|S^0\right\}$

Recognizing that objective functions can come in different styles, I claim that these five components describe any sequential decision problem.

Both of these optimization formulations involving searching over policies. Searching over policies (which are forms of functions) is comparable to the challenge faced in machine learning where the search is for the best statistical model (which is also a function). However, the class of policies for making decisions is much broader than the set of functions used in machine learning. I describe these next.

## Designing policies

The core challenge of sequential decision problems is the design of policies (modeling uncertainty is also a nice challenge).  There are two broad classes for designing policies:

1. The policy search class – These are policies that have to be tuned over time to work well in terms of the performance metric (maximizing rewards, minimizing costs, …). This tuning is usually done offline in a simulator, but may be done in the field.
2. The lookahead class – These policies work by finding the action now, given the state we are in, that maximizes the one-period reward plus an approximation of the downstream value of the state that the action takes us to (this may be random).

Each of these classes are divided into two additional subclasses. The policy search class includes:

• Policy function approximations (PFAs) – This is the simplest class, and includes analytical functions that map state to action. These include lookup tables (if it is cold, wear a coat; buy-low, sell-high policies in finance), parametric models (bid for $ad-clicks = a-b \times (sales)$), neural networks.
• Cost function approximations (CFAs) – These are parametrically modified optimization problems. For example, let $x$ be the reduction in blood sugar from drug $x$, and let $\bar{\mu}_x$ be our estimate of the reduction for a new patient. We could just choose the drug that maximizes $\bar{\mu}_x$, but we have a lot of uncertainty. Let $\bar{\sigma}_x$ be the standard deviation of this estimate. A good policy for choosing a drug to test would be given by $x=argmax_x\left(\bar{\mu}_x + \theta \bar{\sigma}_x\right)$, where $\theta$ is a tunable parameter. Increasing $\theta$ encourages trying drugs where there is more uncertainty. Parametric CFAs are widely used in practice, since they can be any deterministic optimization problem where tunable parameters have been introduced to handle uncertainty. For example, using a navigation system to estimate the travel time to the destination, but then adding a buffer for traffic delays, is a form of CFA.

The lookahead class can also be divided into two subclasses:

• Policies based on value function approximations (VFAs) – If we are in a state (this might capture how much inventory we have on hand), and take an action (e.g. ordering more inventory) that takes us to a new state, we might estimate the value of being in this new state. This estimate (we can rarely do this exactly) is known as a value function approximation. VFAs are comparable to the Q-factors in Q-learning, and can be approximated using any of the tools from machine learning.
• Direct lookaheads (DLAs) – The easiest example of a DLA is a navigation system that decides whether you should turn or go straight by finding the best path (or at least an approximation) to the destination. This path can be updated periodically during the trip. Another example of a DLA is when you plan several moves into the future when playing a game to determine what move you make now.

These four classes of policies (PFAs, CFAs, VFAs and DLAs) cover every possible method for making decisions over time that has been used by any of the research communities working on sequential decisions, along with any method used in practice. In other words, I claim that these four classes are universal, which means they include whatever you might be using now (without knowing it).  This does not mean I have solved everyone’s problems… these are meta-classes.  Picking any of the four (meta) classes means you still have work to do choosing the precise policy within the meta-class, and you may even design hybrids.

I close by noting that these four classes of policies span all the methods currently used in reinforcement learning for sequential decision problems, but the four classes are much broader, and covers a broader class of problems.  See the decision in my upcoming book Reinforcement Learning and Stochastic Optimization: A unified framework for sequential decisions

## Reinforcement learning as defined by a community II

I are going to revisit Ben’s three characteristics of reinforcement learning, beginning with one fairly major tweak.  I have two issues with his characterization of “an agent acting on the environment receiving a reward:”

• First, we have to define what it means to “act on the environment.”  Is observing the environment an instance of “acting on the environment”?  If someone pulls out a pair of binoculars to observe what types of birds are in an area, is that “acting on an environment”?
• Second, what if our “agent” has only one action to choose from, so she is always making the same action.  I am quite sure that no-one would view that as a reinforcement learning problem.  Implicit in the idea of “acting on the environment” is the idea that there is a choice of actions.  In other words, the agent needs to make a decision.

So, my “tweak” to Ben’s characterization of reinforcement learning would go as follows:

• An agent making a decision that produces a contribution (or cost, or some performance metric).
• A community that identifies its work as “reinforcement learning.”
• The set of methods developed by the community using the methods it self-identifies as “reinforcement learning” applied to the problem class.

Now, replace “reinforcement learning” with the name of any field working in the broad area of sequential decision problems.  This could be dynamic programming, stochastic programming, optimal control (in discrete time), stochastic search (derivative-free or derivative-based), simulation-optimization, or active learning (also known as the multiarmed bandit problem).

Or, we could draw all these communities together by using:

• Problem class: any sequential decision problem (characterized by describing the application setting).
• Methods: any of the four classes of policies (and possibly a hybrid).

With these two dimensions, we can describe work we are doing by a) describing the application context, and then b) describing the class of policy (or policies if developing a hybrid) that you are using.  This is much more meaningful than saying that you are solving a problem with “reinforcement learning.”