Warren B Powell – Princeton University

(I welcome comments on the thoughts on this webpage – Please enter them here.)

There are a lot of articles appearing about “What is AI” (along with “What is machine learning” and “What is reinforcement learning”) that talk about these terms using vague language. This web page discusses these topics in more concrete terms, but with minimal mathematics.

There are two primary forms in which computers can exhibit intelligence:

1) **Making inferences about the environment** – This is the domain of machine learning, also known as statistics, statistical learning and, more broadly, data sciences.

2) **Making decisions that affect the environment** – This falls under headings such as “reinforcement learning” and “optimization,” but many other terms are used by different communities (dynamic programming, optimal control, and stochastic optimization are a few examples).

**Making inferences about the environment**

Inferences come in three flavors:

a) **Classification** – e.g. identifying the name of a person from an image, words being spoken, or the identity of a song from a recording. This tends to have right vs. wrong answers.

b) **Estimation** – For example, estimating the deterioration of a piece of machinery, or the number of people infected by a disease. Estimates typically imply some error, where we wish to minimize some measure of the errors.

c) **Prediction** – These are estimates about the future, such as the amount of snowfall or the movement of a stock price.

In all three cases, we have observable (input) data x (such as an image, the characteristics of a patient, or a history of stock prices) from which we wish to make inferences about something that is not known (or observable) to us right now (the identity of the image, diagnosing a disease, or future stock prices).

While these are different settings, they all require the same machinery of statistical modeling (all forms, not just neural nets). Central to making inferences is that you have some model f(x|\theta) where “*x*” is a set of known inputs, and f(x|\theta) is a function that predicts a response y. Imagine that you have a set of inputs x^1, ..., x^N, and the corresponding responses y^1,...,y^N . We then want to find the function f(x|\theta), and the parameters \theta, that solves the model fitting problem

\min_\theta \sum_{n=1}^N [y^n-f(x^n|\theta)]^2.

But to determine the best function f(x|\theta), and to find the best \theta, you need a set of these responses. This is known as *supervised learning*.

Statistical models can be classified under three headings:

**Lookup table**– Here “*x*” is discrete (e.g. the gender of a patient) or discretized (e.g. the blood sugar of a patient). A lookup table requires a different estimate (or classification) for each value of “*x.”*“*x*” may have multiple dimensions (age, gender, ethnicity) in which case the number of possible values of “x” can grow dramatically – this is the classic “curse of dimensionality.”**Parametric models**– A simple linear parametric model might write demand as a function of price using D(p|\theta) = \theta_0 - \theta_1 p + \theta_2 p^2. More complex functions might be nonlinear in the parameters such as D(p|\theta) = \theta_0 e^{-\theta_1 p}. Neural networks are a type of nonlinear model, but typically with a*lot*of parameters, which means they need a lot of data.**Nonparametric/locally parametric**– This is a class of advanced models that includes deep neural networks, but could also include models that use simple approximations (perhaps constants or linear models) that only apply to specific regions. These models require more data (and sometimes a lot of data).

**Making decisions**

In the context of this discussion, we identify two broad classes of decisions:

**o Decisions that change the environment.** These decisions might change the physical device (such as moving from one location to another, acquiring/using inventory, taking a drug, or investing in stocks) or setting parameters such as the price of a product. Choosing website designs or restaurant menus fall in this category.

**o Decisions to exchange information. ** It helps to identify two classes of “information decisions”:

**Decisions to acquire information**– We might run a laboratory or field experiment, take a medical image, or simulate an operational problem in the computer to collect information that changes what we know about the environment. This information is then used to make better decisions that directly change the environment.**Decisions to communicate/disseminate information.**We might send a signal, email a coupon or broadcast an ad to inform other decision makers.

Some decisions may change the physical environment while also producing information, such as traversing a network (or making a move in a game) that also allows us to learn about the network (or the response of an opponent).

It helps to have the notion of the “state” of our system which contains all the information needed to make a decision. The state, which we denote S_t, can be a physical state R_t (the location of a vehicle, how much water is in a reservoir), other information I_t (weather, forecasts, prices), and possibly beliefs B_t about quantities we cannot observe directly.

For example, in our demand model D(p|\theta) = \theta_0 - \theta_1 p, we may not know (\theta_0,\theta_1), but we might have estimates along with information about the accuracy of the estimates (such as a mean and variance). We might also have beliefs about the deterioration of a piece of machinery, or the presence of disease in a population.

Pure information collection decisions only change the belief state B_t, which does not directly affect the environment, but will have an impact on decisions that do affect the environment. Problems with just a belief state are pure learning problems, known in some communities as “multiarmed bandit” problems, but there are many problems with physical and/or information states that also have belief states. When decisions may change the belief state, they become *active learning problems.*

We evaluate decisions using a cost or contribution function C(S,x) that, given a “state” *S* (which contains all the information we need to make a decision) and a decision *x* (that you control), C(S,x) is a metric of how well we have done. If we have a deterministic problem, we might be looking for an optimal decision x*.

For sequential problems (and especially sequential problems where there is some form of uncertainty), we are looking for a decision function, or *policy*, for making decisions that we like to write x_t = X^\pi(S_t). Rather than finding optimal decisions, we aspire to find optimal policies, although in practice this is quite rare. The challenge of finding policies is an essential feature of problems where decisions have to be made over time.

There are many communities that address the challenges of making decisions under uncertainty. We refer to these communities as the “jungle of stochastic optimization” using names such as dynamic programming (or Markov decision processes), optimal control (or stochastic control), simulation-optimization, stochastic programming, decision analysis, and stochastic search, along with communities that use names such as multiarmed bandit problems and active learning. However, in recent years one name has caught the attention of the broader public, and that is “reinforcement learning” which emerged from the work in the 1980’s of Rich Sutton and Andy Barto in computer science.

**What is reinforcement learning?**

Recently, the study of sequential decisions have been grouped under a growing umbrella of research called “reinforcement learning.” In the 1990’s, reinforcement learning emerged as a method for solving (approximately) sequential decision problems using the framework of “Markov decision processes.” The idea proceeded by estimating the value Q(s,a) of being in a state “*s*” and taking an action “*a*” using two basic equations:

where r(s^n,a^n) is the reward from being in a state s^n and taking an action a^n (chosen according to some rule), and where we have to simulate our way to the next state s^{n+1} given that we are in state s^n and take action a^n. The parameter \alpha is a smoothing factor (also called a stepsize or learning rate) that may decline over time. These updating equations became known as Q-learning.

Over time, researchers found that Q-learning did not always work (in fact, it often did not work). In our own work, we have found that exploiting structure is important: linearity, convexity, monotonicity, or perhaps just smoothness, are all important properties that can dramatically improve the performance of methods for approximating value functions. My own work in this area (under the umbrella of “approximate dynamic programming”) exploited convexity to model large trucking companies and locomotives for a major railroad.

However, the strategy of approximating value functions (“*Q*” factors in the parlance of reinforcement learning) simply does not work for all problems, and as a result other tools emerged, paralleling tools developed by other communities in the jungle. All of these tools can be organized into two broad classes of policies:

- The policy search class – These are policies that have to be tuned over time to work well in terms of the performance metric (maximizing rewards, minimizing costs, …). This tuning is usually done offline in a simulator, but may be done in the field.
- The lookahead class – These policies work by finding the action now, given the state we are in, that maximizes the one-period reward plus an approximation of the downstream value of the state that the action takes us to (this may be random).

Each of these classes are divided into two additional classes. The policy search class can be divided between policy function approximations (PFAs) which are functions that map a state analytically to an action (this could be a lookup table, a linear model, or a neural network), and cost function approximations (CFAs), which are typically parametrically modified optimization problems (this would include upper confidence bounding, or adding slack when scheduling resources).

The lookahead class can also be divided between policies that depend on value function approximations (VFAs) that approximate the downstream value of landing in a state with some form of approximation (again, this could be a lookup table, a linear or nonlinear statistical model, or a neural network), and direct lookaheads (DLAs) where we optimize over some horizon to make a better decision now (see jungle.princeton.edu for more detailed descriptions). Of course, *Q*-learning falls in the VFA class.

As the reinforcement learning community has broadened the range of problems being tackled, it has grown past Q-learning (a VFA) into other methods using names such as “policy gradients” (used to optimize PFAs), “upper confidence bounding” (a form of CFA), and “Monte Carlo tree search” (a form of DLA).

These four classes of policies (PFAs, CFAs, VFAs and DLAs) cover *every* possible method for making decisions over time that has been used by any of the research communities working on sequential decisions, along with any method used in practice. In other words, I claim that these four classes are *universal*, which means they include whatever you might be using now (without knowing it).

As of this writing, reinforcement learning has been described as spanning a problem class and a set of methods. The problem class is typically described as an agent acting on an environment receiving a reward, which describes every sequential decision problem (with or without uncertainty). The methods have grown to include examples from each of the four classes of policies.

Since these four classes are universal, saying that you are using a “reinforcement learning method” has started to lose its meaning. In fact, “reinforcement learning” is emerging as an umbrella term for anyone using any form of adaptive (or iterative) procedure for solving decision problems. The distinguishing feature of “reinforcement learning” today is that it is making a decision to optimize some metric while interacting with an environment, a description that covers any sequential decision problem.

The different communities in the jungle have begun to recognize that “reinforcement learning” has become the term by which the general public is identifying tools for making decisions. The only question right now is whether this will expand past the class of tools that involve adaptive learning to include older (but very powerful) methods for making decisions such as linear, nonlinear or integer programming. These tools have been evolving since the 1950’s, supported by powerful software packages that make them usable even by people without formal training in these fields.

**The intersection of “learning” and “making decisions.”**

Note that you can use machine learning to make decisions if you are trying to mimic what a human would do. In this case, the observations “*y*” would be actual decisions made in the past. You have a “cost” function which is to minimize the differences between the model f(x|\theta) and the observed decision. However, you need these observations, whereas when you are doing “optimization” (making decisions), you do not need these responses – you just need the performance metric C(S,x).