Warren B. Powell
I welcome comments to this webpage at https://tinyurl.com/RLSOcomments.
For far too long, I have been struggling with the issue of defining state variables for sequential decision problems. I am finally venting to help clean up the mess in the various communities that use state variables.
Some “definitions” of state variables:
- Bellman’s seminal text [Bellman (1957), p. 81] says “… we have a physical system characterized at any stage by a small set of parameters, the state variables.” (Italics are from the original text)
- Puterman first introduces a state variable by saying [Puterman (2005), p. 18] “At each decision epoch, the system occupies a state.” (Italics are from the original text)
- From Wikipedia: “A state variable is one of the set of variables that are used to describe the mathematical ‘state’ of a dynamical system.” (The next sentence says: “Intuitively, the state of a system describes enough about the system to determine its future behaviour in the absence of any external forces affecting the system.” But, we can still define state variables in the presence of exogenous information flows, so this statement is not accurate either.)
Let me first start by asking: Didn’t we all learn in grade school that we do not use the word we are defining in its definition??!!
Now look at some definitions in books on optimal control:
- From Kirk (2004): A state variable is a set of quantities x_1(t),x_2(t),\ldots, which if known at time t, are determined for t \geq t_0 by specifying the inputs for the system for t \geq t_0.
- From Cassandras and Lafortune (2008): The state of a system at time t_0 is the information required at time t_0 such that the output [cost] y(t) for all t \geq t_0 is uniquely determined from this information and from u(t).
These are both stated as formal definitions, and both can be restated simply as:
- The state is all the information you need at time t to model the system from time t onward.
Both of the definitions above understand that to model the system moving forward, you need the controls u(t) (presumably determined by a “control law” or “policy”) as well as any exogenous (random) information. These definitions appear to be standard in optimal control.
I have spoken to numerous mathematicians (in stochastic control/optimization) who will insist “but I know what a state variable is.” Consider the following anecdotes of statements made by some of the best known names in the field:
From Probability and Stochastics by Erhan Cinlar (2011) – a former colleague at Princeton and one of the best known probabilists in the field: “The definitions of ‘time’ and ‘state’ depend on the application at hand and on the demands of mathematical tractability. Otherwise, if such practical considerations are ignored, every stochastic process can be made Markovian by enhancing its state space sufficiently.”
- From Bertsekas’ Dynamic Programming and Optimal Control: Approximate Dynamic Programming (4th edition, 2012): “… we assume that at each time k, the control is applied with knowledge of the current state x_k. Such policies are called Markov because they do not involve dependence on states beyond the current. However, what if the control were allowed to depend on the entire past history:
which ordinarily would be available at time k. Is it possible that better performance can be achieved in this way?” (WBP: If this were the case, then there is information from “history” that is needed to make decisions, so why isn’t this included in the state variable?)
- In Puterman’s wonderful book Markov Decision Processes, on p. 97 he presents a graph problem that involves finding the path through a network that minimizes the second highest cost on the path (rather than the sum of the costs). He then goes on to argue that Bellman’s optimality equation no longer works! This is because he changes how costs are calculated, but still assumes the state of the system is the node where a traveler is located. The problem is that with the revised cost metric, you also have to keep track of the two highest costs on the path the traveler has traversed, because this is what is needed to determine whether a cost on the next arc is one of the top two.
If we agree that a state is all the information you need to model the system from time t onward, then the system is, by definition (and by construction) Markovian. Further, you would never need information from history since again, by definition (and by construction), the state variable already has any information that may have arrived before time t (or “time” k). So, there is no need to “expand the state space sufficiently,” nor any need to depend on history.
[Side note: a talented post-doc in my lab posed the question: What if we simply do not know all the information we need? This raises subtle issues that are more than I can cover on a webpage. See note (vii) on page 483 of RLSO (following the definition of states) and section 20.2 in RLSO which uses a two-agent model of flu mitigation to illustrate the setting of when a controlling agent does not know the environment.]
I like the characterization, widely used in books on optimal control, that the state variable is all the information you need to model the system from time t onward, regardless of when the information arrived! My only complaint is that it needs to be more explicit.
In my new book (RLSO)[section 9.4], I offer two definitions of state variables depending on whether a policy has been specified or not.
- Policy-dependent version – A function of history that, combined with the exogenous information (and a policy), is necessary and sufficient to compute the cost/contribution function, the decision function (the policy), and any information required by the transition function to model the information needed for the cost/contribution and decision functions.
- Optimization version – A function of history that is necessary and sufficient to compute the cost/contribution function, the constraints, and any information required by the transition function to model the information needed for the cost/contribution function and the constraints.
Both definitions are completely consistent with the “all the information you need …” definitions from optimal control. It is just that I have identified the specific places where we need to provide information: the cost/contribution function, the policy (or constraints), and then the equations used to model how this information evolves over time (this is inside the transition function).
I find it useful to note that a state variable is information which may come in three flavors:
- Physical state variables R_t – This might be inventory, the location of a vehicle on a graph, the attributes of a person, machine or patient.
- Informational variables I_t – This is any information about quantities or parameters that are not included in R_t. Examples could be market prices, weather, or traffic conditions.
- Belief variables B_t – These are statistics (frequentist) or parameters of probability distributions (Bayesian) describing any quantities or parameters that we do not know perfectly. This could be used to describe how a market responds to price, the time that a shipment might arrive, the state of a patient or complex machine.
While the optimal control literature is the best I have seen in terms of defining state variables, I have yet to see a control paper that recognizes that a belief can be part of a state variable.
The POMDP (partially observable Markov decision process) literature creates a special dynamic program where the belief about a quantity can be a state, but this literature does not seem to recognize that you can have physical state variables and belief state variables that combine to form the state variable. A good example arises in clinical trials where you have a physical state (how many patients are remaining in the pool, how much money you have remaining) and the belief about the efficacy of the drug.