Courses in reinforcement learning and sequential decision problems

This webpage is designed to help you design teaching material that goes with Reinforcement Learning and Stochastic Optimization: A unified framework for sequential decisions.

We can envision the following ways to teach this material: 

Courses I taught at Princeton (full course materials are provided):

  • A graduate course (masters to Ph.D.) for a broad audience from engineering and the sciences 
  • A full course (12-15 weeks) on sequential decision problems at an introductory level (this is a course I taught to undergraduates at Princeton) 

Ideas for new ways of teaching this material 

  • 2-4 lectures within an existing course focusing on a problem domain.
  • Self-taught graduate seminar course – This is for students who want to learn this material when there is not a course available.
  • Nonanalytical lectures for an MBA course on supply chain management

If you are interested in joining the community of people interested in teaching this material (in whatever format fits you) please sign up here.

Graduate course in reinforcement learning and stochastic optimization (alternatively, sequential decision analytics)

Description: This course is designed for students who are interested in developing methods, and implementing these methods in software, for solving sequential decision problems under uncertainty for any sequential decision problem.  These problems range from pure learning problems to a broad class of problems including optimal stopping (selling an asset, A/B testing) to complex resource allocation problems which may or may not involve active learning.  The course has a methodological bias, supported by an extensive array of applications.  There is a strong emphasis on proper modeling as a stepping stone to implementation on the computer.

Audience: The course is suitable for courses coming from either a methodological background (computer science, operations research) with a strong interest in computation and applications, as well as students coming from a domain background (any field of engineering, the sciences, economics).  Students should have a good course in probability and statistics, as well as a knowledge of programming.  There are some problems that draw on fields of math programming (e.g. linear/nonlinear/integer programming), but these involve algorithms for which packages are available.

The course is geared toward students with an interest in modeling and computation over theory, but there are theoretical exercises at the end of most (but not all) chapters.  Some sections are marked with ** that indicates more advanced material.  The presentation does not use measure theory, but there is one section that has a “measure theory for beginners” presentation.  The book does, however, provide a solid foundation in modeling that will prove useful whether a student wants to focus on computation or theory.

Readings:

This course is designed to be taught from my book: Reinforcement Learning and Stochastic Optimization: A unified framework for sequential decisions.  It is 1100 pages, and the print version is in black&white.  The e-book version is much easier to use.  If you prefer a printed version, there are websites that offer printing services as low as $0.05 per page, and can be ordered in 3-hole punch format.

The book has 370 exercises organized into seven categories: review questions, modeling questions, problem solving questions, computational exercises, exercises from the companion book Sequential Decision Analytics and Modeling (see the undergraduate course below) which builds on a series of python modules, and a “diary problem” which is a problem chosen by the reader at the beginning of the book, with questions that need to be answered in the context of this problem (this is like “keeping a diary” since the questions all relate to the same problem.

Lectures: 

Below are the lectures from the course I taught Spring, 2019.  Each lecture is given in pdf and powerpoint formats.

Week 1 – Introduction and overview (pdf) (powerpoint)

Week 2 – Adaptive learning (pdf) (powerpoint)

Week 3 – Derivative-based stochastic optimization (pdf) (powerpoint)

Week 3 – (by post-doc Saeed Ghadimi) – Stochastic-based stochastic search for nonconvex problems (pdf)

Week 4 – Derivative-free stochastic optimization Part I: PFAs and CFAs (pdf) (powerpoint)

Week 5 – Derivative-free stochastic optimization Part II: VFAs and DLAs (pdf) (powerpoint)

Week 6 – Notation and the unified modeling framework (pdf) (powerpoint)

Week 7 – Uncertainty and designing policies (pdf) (powerpoint)

Week 8 – Policy function approximations (PFAs) and policy search (pdf) (powerpoint)

Week 9 – Cost function approximations (CFAs) and introduction to Markov decision processes (pdf) (powerpoint)

Week 10 – Approximate dynamic programming and Q-learning (VFAs) (pdf) (powerpoint)

Week 11 – Approximate dynamic programming – Monotonicity and convexity (pdf) (powerpoint)

Week 12 – Lookahead policies (pdf) (powerpoint)

Undergraduate/masters course in sequential decision analytics

Description: This course was taught at Princeton to undergraduates in operations research.  It introduces the general field of sequential decision problems, using a teach-by-example style.  The course follows a downloadable book (see below) that provides the general framework for sequential decision problems, followed by a series of examples that are all modeled using the same style.  The examples were chosen to illustrate the four classes of policies, as well as other issues such as modeling uncertainty.

This course will not provide the comprehensive overview of methodology provided by the graduate course.  It is focused on teaching students how to think clearly about sequential decision problems using a powerful, universal modeling framework.  All of this is done through a series of carefully chosen examples designed to illustrate creating and solving models.  

Audience: The course was originally written for undergraduates in Operations Research and Financial Engineering at Princeton.  The only mathematical prerequisite is a course in probability and statistics.  One chapter (on blood management) uses linear programming, but the material can be covered without prior knowledge of linear programming.

Notes: It is extremely easy to create new lectures built around different applications.  On two occasions I invited students doing senior thesis research with me to help me run a lecture, where they would provide the answers to questions needed to fill in the five elements of the modeling framework.  Students best understand material that is centered on applications they are familiar with, and this is where the material can be easily customized to new problem settings.

Readings:

This course is taught from the downloadable book: Sequential Decision Analytics and Modeling.

Programming: There are exercises at the end of each chapter.  Most chapters are accompanied by a python module that can be found at https://tinyurl.com/sdagithub/.  Often the exercises can be completed without touching the code, but the software was designed to allow students to modify the logic.  I suggest designing questions that fit the programming skills of the audience.

Lectures: 

Below are the lectures from the course I taught Fall, 2018.  Each lecture is given in pdf and powerpoint formats.

Lecture slides:

First half of course (pdf).  Second half of course (pdf)

Individual lectures (powerpoint format):

  • Lecture 1 – Introduction, applications, modeling (powerpoint)
  • Lecture 2 – Adaptive market planning (powerpoint)
  • Lecture 3 – Machine learning (powerpoint)
  • Lecture 4 – Learning diabetes medications (powerpoint)
  • Lecture 5 – Learning policies (powerpoint)
  • Lecture 6 – Stochastic shortest paths I (powerpoint)
  • Lecture 7 – Stochastic shortest paths II (powerpoint)
  • Lecture 8 – Designing policies (powerpoint)
  • Lecture 9 – Energy storage I (powerpoint)
  • Lecture 10 – Energy storage II (powerpoint)
  • Lecture 11 – Energy storage III (powerpoint)
  • Lecture 12 – Uncertainty modeling (powerpoint)
  • Lecture 13 – Uncertainty modeling (powerpoint)
  • Lecture 14 – The beer game I – Introduction (powerpoint)
  • Lecture 15  – The beer game II (powerpoint, spreadsheet)
  • Lecture 16 – The beer game III – Discussion (powerpoint)
  • Lecture 17 – The blood management problem (powerpoint)
  • Lecture 18 – Hotel revenue management (powerpoint)
  • Lecture 19 – Class decision problem – Booking flights (powerpoint)
  • Lecture 20 – Clinical drug trials (powerpoint)
  • Lecture 21 – Student decision problems – Rowing strategies, twitter trading (powerpoint)
  • Lecture 22 – Summary lecture (powerpoint)

Introducing sequential decision analytics in an existing course

Imagine that you are teaching a course in any of the following topics:

  • Financial trading – Economists love Bellman’s equation, but Wall St. traders use a variety of strategies, typically in an ad hoc way.
  • Energy systems (any of a subset of topics) – Energy systems is such a rich topic, but classical training that engineers are given in controls is heavily biased toward deterministic models. Decision problems span design and control of renewables, storage systems, grids, demand response, R&D for new technology, …
  • Supply chain management – Another incredibly rich topic where uncertainty has become especially important.  Simply identifying the decisions is a start, but learning how to make decisions in the presence of uncertainty is becoming extremely visible.
  • Transportation (traffic, freight transportation, …) – Another tremendously rich topic that has a wide range of decision problems (design and control) along with different forms of uncertainty.
  • Health (public health, personal health, medical decision making) – Health stands out given the importance of information collection (testing, monitoring, collecting samples, ….)
  • Decision analytics for scientists – Scientists have to make many decisions.  There is more to choosing which experiment to run than just pulling experimental choices out of thin air.  Scientists tend to have less training in analytics, but they still need to make decisions that depend on belief models.

For any problem setting, you can present the fundamental approach for sequential decision problems in a few simple steps:

  • Pick a decision problem (something on the simple side) and use it to walk through the five elements of any sequential decision problem: state variables, decision variables, exogenous information (what we observe or learn), the transition function (how the state variable is updated given the decision and what we observe), and the objective function. 
  • I find it best to start by identifying decision variables, performance metrics, and the exogenous information.  I introduce the concept of a policy, and then emphasize “we will design this later.”
  • Explain that the state variable is all the information needed to compute the performance metrics, the decision (including constraints and any information needed by the policy we have not yet designed), and information needed to model the system as it evolves into the future.  Be sure to emphasize that state variables are information, all the information you need (and only the information you need) to model your system as you move forward in time. State variables can include:
    • Physical state variables (inventories, location of a drone, the state of a patient, investments, …). Decisions tend to act on the physical state variables.
    • Other information that may be needed, but which does not belong to the physical state variables (prices, weather, environment, market conditions).
    • Belief state variables – These capture probabilistic information about parameters and quantities we do not know perfectly.  Belief state variables have to include any information we need to update beliefs given observations.
  • The exogenous information is whatever we observe or learn after making a decision.
  • The transition function gives all the equations describe how the state variables evolve.  This might capture the change in inventory or the location of a drone, changes in the temperature of an experiment or market prices, and the updating of beliefs from an observation.
  • The objective function captures the performance metric, and how we evaluate our policy (which we still have not designed) over time, averaging over different outcomes.

Next you typically want to talk about the uncertainties and what you may observe after you make a decision (the exogenous information).

Then you want to talk about ways of making decisions (the policies).  Ideally you have an idea of a simple policy just to illustrate the concept of a policy.  Often (not always) there is some tunable parameter that affects the behavior of the policy.  If this is the case, use your objective function to evaluate the performance of a particular setting of the tunable parameters, something you typically do in a computer simulator (but not always).

At this point, it is useful to open up to a discussion of all four classes of policies.  Again, it is helpful to illustrate these with an example, but this depends on your ability to identify problems that are well suited to each class of policy.

You may wish to switch to a simple example such as an inventory problem, which is a kind of universal problem (we all have to solve our own inventory problems).  It is easily to illustrate solving inventory problems with all four classes of policies.  Which works best is very dependent on the characteristics of the policy.  

It should be possible to cover this material in two 90-minute lectures.  I then suggest returning to this framework during the course as the need to make decisions arise.

Self-taught graduate seminars

I believe this material is so important that graduate students who do not have access to a course that teaches this material may want to run a seminar on their own.  Typically this would involve a weekly meeting of about 90 minutes.  I suggest the following topics:

  • Week 1 – Discuss different settings in which sequential decisions arise that are familiar to the group.  Discuss: decisions, metrics and exogenous information (that you learn/observe after making a decision).  Then introduce state variables which is all the information you need to compute your metrics, make your decision (which can reflect constraints as well as any other information needed in a policy that has not yet been designed), and model the evolution of the state variables.  Note that the design of state variables is iterative – it evolves as you model the system.  You might jump forward to my section 9.9 for an illustration using energy storage.  Mention the four classes of policies which you will return to repeatedly.
  • Week 2 – Assign as outside reading to skim section 2.1 that covers 15 different classes of problems.  Discuss in class any that might be of special interest to the class, but otherwise I would not use class time on this.  Use week 2 to cover key elements of Chapter 3 – Online learning.  I suggest covering lookup tables (frequentist and Bayesian), correlated beliefs, a brief discussion of hierarchical models, then linear models, and then sampled beliefs for nonlinear models.  You will probably also want to touch on neural networks.
  • Week 3 – Cover chapter 4, including section 4.2 (so-called stochastic models that only need deterministic methods), section 4.3 (sampled belief models – simple idea, but this is very important), and then adaptive learning in section 4.4, which is the focus of the rest of the book.
  • Week 4 – Chapters 5 and 6 – Derivative-based stochastic search.  Cover the basic idea of a stochastic gradient algorithm, and illustrate how a stochastic gradient may be computed (e.g. for the simple newsvendor problem).  Mention that we can use numerical derivatives (I would introduce SPSA).  Then introduce the need for the stepsize, after which I would jump to chapter 6 (on stepsizes).  Introduce basic deterministic stepsizes and one or two stochastic stepsize rules.  I would not go through the optimal stepsize rules – it is just important to know this material is there.
  • Week 5 – Chapter 7 – Derivative-free stochastic search.  This is a big chapter.  You need to introduce what it means to do derivative-free stochastic search, and touch on the concept of cumulative and final reward objectives.  Now introduce the four classes of policies again.  You will find it easiest to illustrate the PFA, CFA and possibly some form of DLA such as the knowledge gradient.  You should introduce the concept of the multiarmed bandit problem.  I would skip section 7.6 on VFA-based policies, but note that this section introduces Gittins indices, which was the inspiration for the UCB policies (section 7.5) which is very popular in the tech sector.  The knowledge gradient (section 7.7.2) captures the value of the information from an experiment. This is useful for expensive experiments and small budgets.  You may find you need two weeks for chapter 7.
  • Week 6 – The class should skim chapter 8 (a series of examples) before class.  Then use this class for chapter 9, my big modeling chapter.  Review the five elements of a sequential decision problem and its application to a simple inventory in section 9.1.  Note that for simple problems, this is all you need.  The rest of the chapter provides the foundation for modeling complex problems.  How thoroughly you cover this material depends on the interests of the class.  Be sure to touch on belief states and how you update belief states (very important for any problems that involve learning).  I would definitely try to cover the energy storage extensions in section 9.9, but then stop here.
  • Week 7 – Chapter 10 focuses on modeling uncertainty.  This is where it would help if the students pick a problem on their own, and then review the 12 classes of uncertainties in section 10.1.  Then jump to the COVID example in section 10.2.  The rest of the chapter addresses the challenges of modeling uncertainty and the technique of Monte Carlo simulation.  This is important, but students may have prior background in this.
  • Week 8 – Chapter 11 is a major chapter – this is where I review the four classes of policies in more depth, and then cover how to choose from among the classes.  This discussion is important, but is highly problem dependent.  Be sure to touch on section 11.7 on hybrid policies, where I review six examples of policies that combine two or more of the four classes into a single policy.  It might help to pick a specific problem such as the energy storage problem in section 11.9, where we illustrate all four classes of policies, and then show that any of them (plus a hybrid) can work best depending on the data.
  • Week 9 – Chapter 12 – Here we introduce the simplest policies, PFAs, and discuss parameter tuning.  Here we are just using the methods of chapters 5 and 7, but the review is useful, since it is done in the context of parameter tuning.  Parameter tuning ends up being a recurring problem, since you are almost always introducing some form of approximation in any of the four classes of policies, and approximations tend to lead to tunable parameters.
  • Week 10  – Chapter 13 – Here we cover the concept of parametric cost function approximations in considerably more detail.  This is an important class of policies since it is widely used in practice, but largely ignored by the academic literature.  See the energy storage example with time-dependent demands and rolling forecasts.  If anyone is looking for a good area of research, there are incredible opportunities, partly because it is such a useful strategy, but also because there are unresolved methodological issues.
  • Weeks 11-12 – There are five chapters on value function approximations, often equated (incorrectly) with “reinforcement learning.”  You could easily spend half of a real course (meeting twice each week) on this material.  I suggest:
    • Cover equation 14.2 in chapter 14, which is basic backward dynamic programming.  Be sure to introduce the one-step transition matrix, and note that it is almost never computable.  Note that if we could compute one-step transition matrices, we could solve every sequential decision problem with equation 14.2 (I am not a big fan of dynamic programming for steady state problems – in my entire career, I have never encountered a real application that was properly modeled as a steady state problem).
    • Then jump to chapter 15 (backward ADP) and cover section 15.1, which is how to compute equation 14.2 approximately.  
    • Now jump to section 15.3.1 which introduces the idea of using linear models for value function approximations.  Mention that you could use a deep neural network here, but it would require a large sample.  Be sure to take a quick look at the experimental work in section 15.4.  There is not much theory supporting backward ADP, but so far it has worked in every problem I have tried!
    • Next jump to sections 16.1 and 16.2 which describes how to fit a VFA using a fixed policy.
    • Then jump to sections 17.1, 17,2 (skim 17.3), 17.4, and 17.5.  Section 17.6 provides two examples that students might find  useful.
    • Chapter 18 is advanced material specifically for more complex resource allocation problems.  The interest in this material is very dependent on the makeup of the class.  This material could easily require an entire session.
  • Week 13 – Chapter 19 covers direct lookahead policies.  This chapter is divided into three parts.  Sections 19.1 through 19.5 provides background material which should be skimmed.  Section 19.6 introduces deterministic lookaheads, including parameterized deterministic lookaheads (we first saw this in the energy storage problem in chapter 13).  Then, I would cover section 19.7, which introduces stochastic lookahead models, and discusses each of the four classes of policies to be used in the stochastic lookahead.  The remainder of the chapter covers Monte Carlo tree search (for problems with discrete decisions) and stochastic programming (for problems with vector-valued decisions).  This is specialized material that depends on the background and interests of the class.

There are entire books written on specific strategies for solving stochastic lookaheads within a DLA policy.  We effectively cover these in just a few sections by standing on the shoulders of all the other chapters.

  • Week 14 – Chapter 20 on Multiagent systems – While the interest in this material depends on the interests of the students, there is some very important material in this chapter.  I recommend covering sections 20.1 and 20.2.  Section 20.3 on POMDPs is advanced material, which points out the limitations of the very substantial literature on POMDPs, but again, the interest in this material depends on the makeup of the class.  Our position is that the two-agent model (environment plus controller) should be used to model problems where we cannot observe the environment perfectly.  Sections 20.4, 20.5 and 20.6 are on more specialized topics that depends on the interests of the class.

Non-analytical lectures for MBAs

MBA students are often working on complex problems such as supply chain management, where simply identifying the core elements of the problem can be a serious challenge.  I think the universal modeling framework provides guidance in how to think about complex problems, even if the ultimate goal is not to create a mathematical model that can be implemented.  While I still think it can be useful to introduce the basic notation of state variables, decision variables, exogenous information, transition function and objective function, the presentation should transition to the following questions:

  • What decisions need to be made (and who is, or should be, making the decisions)?
  • What performance metrics are being evaluated?
  • What are the different sources of uncertainty, both in terms of knowledge of quantities and parameters, and what information is learned or observed after a decision is made?

For simple problems, answering these questions is easy.  But for a complex problem such as a supply chain, they are not obvious and are worthy of discussion.  For example, imagine that you want to build a more robust supply chain.  You have to start with: What actions might you take?  This is the same as asking: What is the set of decisions?  Analytics can help figure out which decision to take, but a computer cannot come up with the set of decisions over which to optimize.

A very rich topic involves identifying sources of uncertainty.  Again – while advanced analytics can help with quantifying uncertainty, we still need people with domain knowledge to come up with the list of different sources of uncertainty.