We have a very good handle on modeling deterministic optimization problems, but the universe of problems that involve sequencing decisions and information have resisted being expressed in the kind of common canonical framework that has become universal in mathematical programming.
Stochastic optimization problems are so diverse that the field has become fragmented into a Balkanized set of communities with competing algorithmic strategies and modeling styles. It is easy to spend a significant part of a career mastering the subtleties of each perspective on stochastic optimization, without recognizing the common themes.
We have developed a unified framework that covers all of these books. We break all sequential decision problems into five components: state variables, decision variables, exogenous information variables, the transition function and the objective function, where we search over policies (functions) for making decisions. We have identified two core strategies for designing policies (policy search, and policies based on lookahead approximations), each of which further divides into two classes, creating four classes that cover all of the solution approaches in the books illustrated above.
There is a growing community, which started in computer science, which uses the term “reinforcement learning” to refer to sequential decision problems. Click here for my explanation of “artificial intelligence,” “machine learning,” and “reinforcement learning.”
Our approach does not replace any prior work: rather, it brings all of this work together, and helps to identify opportunities for cross-fertilization. A good way to think of the unified framework is “standing on the shoulders of giants.”
All of the material below, including the material for the graduate course, is written for students with an undergraduate course in probability and statistics. Occasionally a second course in probability may be useful, and I occasionally present problems that use linear programming, but in a way that can be read without a full course in math programming (there may be occasional exceptions).
- W.B. Powell, The Next Generation of Optimization: A Unified Framework for Dynamic resource Allocation Problems, to appear in a volume on Optimization in Large Scale Problems, Springer. – This is a five-page introduction to the unified framework.
- W.B. Powell, A unified framework for stochastic optimization, European Journal of Operational Research, Vol. 275, No. 3, pp. 795-821 (2019). – An invited review article to appear in European J. Operational Research. This is my latest attempt to pull this field together in a 25 page tutorial.
- I teach this material in an undergraduate course at Princeton: Sequential Decision Analytics and Modeling.
- Course webpage, with a complete set of lecture notes in powerpoint, is available at: http://www.castlelab.princeton.edu/orf-411/.
- There is an online textbook for the course at Sequential Decision Analytics and Modeling course notes. Cite this reference as: Warren B. Powell, Sequential Decision Analytics and Modeling, Department of Operations Research and Financial Engineering, Princeton University, 2019. A compact URL for the notes is http://tinyurl.com/sequentialdecisionanalytics
- You are invited to contribute chapters to the book. An editable version is available at http://tinyurl.com/sequentialdecisionanalyticspub
- Python modules that accompany “Sequential Decision Analytics and Modeling.” – There is a python module for most of the chapters. Most of these have been used in an undergraduate course at Princeton.
- I will make the problem sets available soon.
- There is also a graduate level course on Stochastic Optimization and Learning.
- The webpage for the course, complete with all lecture notes in powerpoint, is at http://www.castlelab.princeton.edu/orf-544/
- A book (in progress) written entirely around this framework can be downloaded from Reinforcement Learning and Stochastic Optimization: A Unified Framework (updated June 19, 2019) – This is a book (in progress ~700 pages) that is designed entirely around the unified framework. This book is used in a graduate course on stochastic optimization. Cite this reference as: Warren B. Powell, Reinforcement Learning and Stochastic Optimization and Learning: A Unified Framework, Department of Operations Research and Financial Engineering, Princeton University, 2019.
- A very short presentation illustrating the jungle of stochastic optimization (updated April 12, 2019). The last slide shows the evolution of seven major communities from their origin using one of the four classes of policies, to where they stand now (using two, three or often all four classes of policies.
Some presentations on the unified framework:
- Unified Framework for Optimization under Uncertainty in Transportation and Logistics – Presentation given at the workshop for Transportation Science and Logistics at the University of Vienna, Austria on the sharing economy – July 15, 2019.
- Talk on the unified framework given to the Army Research Office workshop on “reinforcement learning” April 12, 2019. This is a 20 minute talk on the unified framework.
- Click here for a list of past and pending talks. If you are interested in me giving a talk at your school or company, email me at email@example.com.
Some key points
- State variables – We claim that all properly modeled dynamic systems are Markovian, and provide a teachable, implementable definition that students can use to guide their efforts to model these systems (see chapter 9 in Stochastic Optimization and Learning). (If you have a counterexample, email it to me at firstname.lastname@example.org).
- A dynamic program is a sequential decision problem (it is not a method). Bellman’s equations are a) conditions for an optimal policy and b) a path to designing good policies.
- We introduce the distinction between the base model, which is our model of the problem we are trying to solve, and a lookahead model, widely used as one class of policy (especially in stochastic programming). The base model is often referred to as a “simulator” without recognizing that this is the problem we are trying to solve. The base model is comparable to the objective function in a deterministic optimization model.
- When solving a deterministic optimization problem, we want to find the best decision (often a real-valued vector). When solving a stochastic optimization problem, we typically want to find the best function (known as a policy). Notable exception is asymptotic stochastic search problems (min_x E F(x,W)), but practical stochastic search problems are solved in finite time (which requires searching over policies).
- There are two core strategies for finding policies: policy search, and policies based on approximations of a lookahead model. Each of these can be further divided into two strategies, producing four fundamental classes of policies that have been used to solve a wide range of dynamic programs (only one of which uses value functions).
- A multistage stochastic program (using scenario trees) is a lookahead model for designing a policy (called a lookahead policy) for solving dynamic programs. An optimal solution of a lookahead model is generally not an optimal policy (even if it is hard to solve).
- “Robust optimization” (for sequential problems) is a lookahead policy (using a min-max objective) to solve a dynamic program where the objective may be formulated using an expectation (which is what is implied if you simulate the policy and average the results), or the worst case (which is consistent with the robust concept) or any other risk measure.
Our canonical model
The slides below provide a thumbnail sketch of the central idea behind our modeling and algorithmic strategy.
The first slide below contrasts the standard canonical form for a multiperiod, deterministic linear program (on the left), and the approach we are advocating to be the canonical form for a sequential, stochastic optimization problem:
The key idea is that when working on sequential stochastic problems, you are searching for the best function (known as a policy), rather than the more familiar problem of finding the best real-valued scalar or vector.
This raises the obvious question – how do you search over a space of functions???
We have taken a practical path for solving this question. We have identified four fundamental (meta) classes of policies, called PFAs, CFAs, VFAs and DLAs (direct lookaheads). These are summarized on the following slide:
We let \pi[\latex] be the variable that captures both the type of function, and any parameters that determine the behavior of the function. A tunable parameter could be a regression parameter, or variables that determine the planning horizon, number of stages and number of scenarios.
In the equations above, we use tildes on variables that describe the lookahead model to avoid confusing them with the base model. This is only an issue when we are using lookahead policies.
Note that these four classes of policies span all the standard modeling and algorithmic paradigms, including dynamic programming (including approximate/adaptive dynamic programming and reinforcement learning), stochastic programming, and optimal control (including model predictive control).
It is important to recognize that even for the same problem, each of the four classes of policies may work best, depending on the data. We demonstrate this in the context of a simple energy storage problem. We created five problems, which we solved using each of the four classes of policies, plus a hybrid. In the graph below, we report the performance of each policy as a fraction of the posterior optimal (the best you could do with perfect information). Each policy works best on one of the five problems.
We prefer the approach universally used in deterministic optimization where we formulate the problem first, and then we design methods to find a solution (in the form of a policy). But this requires accepting that in sequential problems, we are not looking for decisions (as we do in deterministic models), but rather functions (or policies). Classical strategies in stochastic optimization (which are described using familiar labels such as dynamic programming, stochastic programming, robust optimization and optimal control) actually represent particular classes of policies. So, to solve a problem using one of these approaches means you are actually choosing the class of policy before you have even started modeling the problem.