Reinforcement Learning and Stochastic Optimization

Reinforcement Learning and Stochastic Optimization: A unified framework for sequential decisions is the first textbook to offer a comprehensive, unified framework of the rich field of sequential decisions under uncertainty. Up to now, this rich problem class has been fragmented into at least 15 distinct fields that have been studied under names such as dynamic programming, stochastic programming, optimal control, simulation optimization, optimal learning, and multiarmed bandit problems. Recently these have been primarily associated with “reinforcement learning” and “stochastic optimization.”

Audience: The book requires little more than a good course in probability and statistics/machine learning (and supporting linear algebra). There are occasional forays that draw on linear programming. The presentation is designed for people who want to plan sequential decision problems, with an emphasis on modeling and computation. Applications are drawn from numerous topics in engineering (electrical, civil, mechanical, chemical), physical sciences, computer science, social sciences, economics and finance, operations research and industrial engineering, business applications and statistics/machine learning.

Major book themes and features:

  1. Central to the book is that we are addressing sequential decision problems: decision, information, decision, information, … which combines making decisions over time (or iterations/experiments), under uncertainty. Every problem addressed in the dramatically growing literature on reinforcement learning can be described in this way.
  2. We use a universal modeling framework with five-parts (state variables, decision variables, exogenous information, transition function, and objective function) that applies to any sequential decision problem (the framework draws heavily from the optimal control literature). Central to the framework is optimizing over policies.
  3. We introduce four (meta)classes of policies (PFAs, CFAs, VFAs and DLAs) that we claim are universal – they include any method proposed in the literature, or used in practice. The policies are distinguished by their computational characteristics. Each of the four classes can produce an optimal policy for special problem classes, but this is rare.
  4. The book is written for people who want to write software for real-world problems, using methods that scale not just in terms of size, but also complexity. Mathematical notation is designed to be translated directly into software.
  5. To help first-time readers, many sections are marked with * indicating they can be skipped on a first-time through the book. Sections marked with ** indicate more advanced mathematics (we do not use measure-theoretic terminology, but we have a section that introduces the interested reader who would like to learn it).
  6. There are over 370 exercises, organized into seven categories at the end of each chapter. These consist of review questions, modeling, computation, theory, problem solving, problems drawn from the companion volume Sequential Decision Analytics and Modeling (with supporting python modules), and finally a ““”diary” problem that the reader chooses in chapter 1, and then uses to answer a question at the end of each chapter. This allows readers to work on an application familiar to them.
  7. There is a supplementary materials webpage here.

Chapter summaries:

Table of contents

Chapter 1 – Introduction – Includes table of contents – Updated third pass. The book presents a new universal framework for sequential decisions under uncertainty, where the focus is designing policies (methods for making decisions). We present four classes of policies that cover all possible methods. Chapter 1 provides a one-chapter overview of this entire framework.

Chapter 2 – Canonical models and applications – Updated third pass. This chapter provides very brief summaries of 14 different fields that all deal with sequential decisions. We give the five-part universal framework in a bit more detail, and a series of examples.

Chapter 3 – Online learning – Updated third pass. This is a single-chapter tutorial of all the major methods for online (that is, adaptive) learning, spanning lookup tables (with independent and correlated beliefs), parametric and nonparametric models (including neural networks).

Chapter 4 – Introduction to stochastic search – Updated third pass. This chapter talks about three strategies for solving different types of sequential decision problems: methods using deterministic mathematics (where you can compute expectations), sampled methods (where we approximate expectations with samples), and adaptive methods, which dominates the rest of the book

Chapter 5 – Derivative-based stochastic optimization – Updated third pass. This is classic material on stochastic gradient methods, but includes a description of how to model a stochastic gradient algorithm as a sequential decision problem (the decision is the stepsize).

Chapter 6 – Stepsize policies – Updated third pass. An entire chapter dedicated to stepsize policies. We cover deterministic policies, adaptive (stochastic) policies, and then some results on optimal policies.

Chapter 7 – Derivative-free stochastic optimization – Updated third pass. We describe derivative-free stochastic search (also known as a multi-armed bandit problem) using our universal framework (since it is a sequential decision problem), and then outline examples of all four classes of policies that have been applied to this broad problem class.

Chapter 8 – State-dependent problems – Updated third pass. This chapter provides a tour of a wide range of general dynamic problems which we call “state-dependent” problems, to distinguish them from “state-independent” problems which are the pure learning problems of chapters 5 and 7.

Chapter 9 – Modeling sequential decision problems – Updated third pass. This chapter presents the five-part universal modeling framework, first in the context of simple problems, and then a much more detailed presentation that provides the foundation for modeling much more complex problems (think of energy systems, transportation systems, health applications, supply chains).

Chapter 10 – Uncertainty modeling – Updated third pass. We describe 12 sources of uncertainty, and then provide a brief tutorial into Monte Carlo sampling and uncertainty quantification.

Chapter 11 – Designing policies – Updated third pass (just updated July 7 – See section 11.10 on choosing across policy classes). This is a much more in-depth presentation of the four classes of policies, and includes guidance on how to choose a policy class for a problem you might be working on. Chapters 12-19 cover each of the four classes in depth (chapters 14-18 are dedicated to value function approximations).

Chapter 12 – Policy function approximations and policy search – Updated third pass. We discuss the idea of PFAs in much greater depth (PFAs consist of any function class covered in chapter 3 for online learning), and describe four methods for performing a search over tunable parameters spanning derivative-free stochastic search, and three types of derivative-based methods: numerical derivatives, backpropagation (for control problems), and the policy-gradient method.

Chapter 13 – Cost function approximations – Updated third pass. Cost function approximations are parameterized optimization models. Widely used in industry on an ad-hoc basis, they have been largely overlooked by the research literature. This book is the first to deal with them as a fundamental class of policy. CFAs feature much simpler scaling issues than the simpler PFAs.

Chapter 14 – Exact Dynamic Programming – Updated third pass. This is classic material on Markov decision processes, as well as some other examples of dynamic programs that can be solved exactly, including linear quadratic regulation from optimal control.

Chapter 15 – Backward approximate dynamic programming – Updated third pass. Backward approximate dynamic programming is a relatively recent methodology (it parallels fitted value iteration for infinite horizon problems), but we have had considerable success with it. It overcomes problems with high-dimensional and/or continuous states and uncertainties and, for some problems, high-dimensional and/or continuous decisions.

Chapter 16 – Forward ADP I: The value of a policy – Updated third pass. This is from my 2011 ADP book, and introduces, for finite and infinite horizon problems, classical TD(\lambda), approximate value iteration (single pass and double pass), LSTD and LSPE, projected Bellman minimization for linear architectures, Bayesian learning for value functions, designing stepsizes for approximate value iteration.

Chapter 17 – Forward ADP II: Policy optimization – Updated third pass. Also from my 2011 ADP book, here we describe the rich methods for approximating value functions while simultaneously searching for policies. This, with chapter 16, is the material that was the original heart of reinforcement learning before the introduction (in this book) of the four classes of policies.

Chapter 18 – Forward ADP III: Convex functions – Updated third pass. This is methods for approximate dynamic programming where we exploit convexity (concavity for maximizing), which arises frequently with high-dimensional resource allocation problems. We show how to use Benders cuts, piecewise linear separable and linear (in the resource variable) value function approximations that have been applied to real resource allocation problems.

Chapter 19 – Direct lookahead policies – Updated third pass. We cover both deterministic lookahead policies (most often associated with “model predictive control”) and policies based on stochastic lookahead models. We describe six classes of approximation strategies, and then illustrate all four classes of policies in the context of policies to solve a lookahead model (the “policy-within-a-policy”).

Chapter 20 – Multiagent modeling and learning – Updated third pass. Here we adopt our universal framework and illustrate it in the context of the rich domain of multiagent problems, beginning with two-agent problems for learning where we present our perspective on POMDPs.

Index – Check out the entry “Applications”

The roots of the book:

Note: This book used my 2011 book, Approximate Dynamic Programming: Solving the curses of dimensionality as a starting point.  Chapter 3 on online learning evolved from chapter 8 in ADP on approximating value functions. The modeling chapter 5 from ADP is now chapter 9 (with  major modifications) in RLSO.  Chapter 14 in RLSO is based on the old chapter 3 on Markov decision processes.  Chapter 15 is an entirely new chapter on backward approximate dynamic programming. Chapters 16-18 are based directly on the chapters in the ADP book for approximating value functions (they are now labeled as “Forward approximate dynamic programming”).