Optimal learning represents the problem of making observations (or measurements) in an efficient way to achieve some objective. This is our newest area of research, with a number of papers on the way.

**Yingfei Wang, K. G. Reyes, K. A. Brown, C. A. Mirkin, W. B. Powell, “Nested Batch Mode Learning and Stochastic Optimization with an Application to Sequential Multi-Stage Testing in Materials Science,” SIAM J. Scientific Computing, Vol. 37, No. 3, pp. B361-B381, DOI: 10.1137/140971117, 2015.**

*The paper develops a knowledge gradient policy for guiding an initial design decision (e.g. the size and shape of nanoparticles) followed by batch learning of a secondary tunable parameter (e.g. the density of particles) to maximize a metric (reflexivity of a surface).*

(Click here to download paper.)

**E. Barut and W. B. Powell, “Optimal Learning for Sequential Sampling with Non-Parametric Beliefs,” Journal of Global Optimization, Vol. 58, pp. 517-543 (2014). DOI 10.1007/s10898-013-0050-5.**

*This paper derives the knowledge gradient when the belief model is captured using kernel regression, a class of nonparametric statistical models. *

*(Click here to download paper)*

**Dayanik, Savas, Warren B. Powell, and Kazutoshi Yamazaki. “Asymptotically Optimal Bayesian sequential change detection and identification rules.” Annals of Operations Research (M. Katehakis, ed.) 208.1 (2013): 337-370.**

*We study the joint problem of sequential change detection and multiple hypothesis testing. Suppose that the common distribution of a sequence of i.i.d. random variables changes suddenly at some unobservable time to one of nitely many distinct alternatives, and one needs to both detect and identify the change at the earliest possible time. We propose computationally efficient sequential decision rules that are asymptotically either Bayes-optimal or optimal in a Bayesian fixed-error formulation, as the unit detection delay cost or the misdiagnosis and false alarm probabilities go to zero, respectively. Numerical examples are provided to verify the asymptotic optimality and the speed of convergence.*

*(Click here to download paper)*

**Ryzhov, I., W. B. Powell, “Information Collection for Linear Programs with Uncertain Objective Coefficients,” SIAM J. Optimization, Vol. 22(4), pp. 1344–1368 http://epubs.siam.org/doi/abs/10.1137/12086279X. (2012).**

Finding the optimal solution of a linear program assumes that you have accurate information on costs (among other things). In some settings, these costs may be approximate, and getting more information can be expensive. For example, a problem in logistics might require including costs that reflect the quality of service provided by a vendor, but it may be necessary to use the vendor for a period of time, or collect historical information from other manufacturers, to refine these costs.

This paper develops the knowledge gradient for maximizing the expected value of information when solving linear programs. The paper establishes asymptotic optimality for off-line versions of the problem and proposes a computationally tractable algorithm.

*(Click here to download paper)*

**I. Ryzhov, W. B. Powell, P. I. Frazier, “The knowledge gradient algorithm for a general class of online learning problems,” Operations Research, Vol. 60, No. 1, pp. 180-195 (2012). **

*The knowledge-gradient policy was originally derived for off-line learning problems such as ranking and selection. Considerable attention has been given to the on-line version of this problem, known popularly as the multiarmed bandit problem, for which Gittins indices are known to be optimal for discounted, infinite-horizon versions of the problem. In this paper, we derive a knowledge gradient policy for on-line problems, and show that it very closely matches the performance of Gittins indices for discounted infinite horizon problems. It actually slightly outperforms the best available approximation of Gittins indices (by Gans and Chick) on problems for which Gittins indices should be optimal. The KG policy is also effective on finite horizon problems. The only policy which is competitive with KG seems to be interval estimation, but this requires careful tuning of a parameter. The KG policy also works on problems where the beliefs about different alternatives are correlated. The KG policy with independent beliefs is extremely easy to compute (we provide closed-form expressions for the case with normal rewards), and requires a simple numerical algorithm for the case with correlated beliefs.*

*(click here to download paper)*

**Scott, Warren, P. I. Frazier, and W. B. Powell. “The Correlated Knowledge Gradient for Simulation Optimization of Continuous Parameters Using Gaussian Process Regression.” SIAM Journal on Optimization 21, No. 3 (2011): 996-1026.**

*We have previously developed the knowledge gradient with correlated beliefs for discrete alternatives. This paper extends this idea to problems with continuous alternatives. We do this by developing a continuous approximate of the knowledge gradient. This produces a nonconcave surface that we have to maximize. Local minima are located close to points that have been previously measured, so we use these points to guess at the locations of local maxima and then use a simple gradient search algorithm starting from each of these points. A proof of convergence is provided. We use the distances between local minima to perform scaling of the steepest descent algorithm. We compare the method against Huang’s adaptation of sequential kriging to problems with noisy measurements.*

*(Click here to download paper)*

**I. Ryzhov, W.B. Powell, “Information collection on a graph,” Operations Research, Vol 59, No. 1, pp. 188-201, 2011. (c) Informs**

*We derive a knowledge gradient policy for an optimal learning problem on a graph, in which we use sequential measurements to rene Bayesian estimates of individual arc costs in order to learn about the best path. This problem differs from traditional ranking and selection, in that the implementation decision (the path we choose) is distinct from the measurement decision (the edge we measure). Our decision rule is easy to compute, and performs competitively against other learning policies, including a Monte Carlo adaptation of the knowledge gradient policy for ranking and selection.*

*(click here to download paper)*

**Ryzhov, I. and W. B. Powell, “Bayesian Active Learning with Basis Functions,” IEEE Workshop on Adaptive Dynamic Programming and Reinforcement Learning, Paris, April, 2011.**

A common technique for dealing with the curse of dimensionality in approximate dynamic programming is to use a parametric value function approximation, where the value of being in a state is assumed to be a linear combination of basis functions. We propose a Bayesian strategy for resolving the exploration/exploitation dilemma in this setting. Our approach is based on the knowledge gradient concept from the optimal learning literature, which has been recently adapted for approximate dynamic programming with lookup-table approximations. The new method performs well in numerical experiments conducted on an energy storage problem.

*(click here to download paper)*

**P. Frazier and W. B. Powell, “Consistency of Sequential Bayesian Sampling Policies” SIAM J. Control and Optimization, Vol. 49, No. 2, 712-731 (2011). DOI: 10.1137/090775026**

We consider Bayesian information collection, in which a measurement policy collects information to support a future decision. This framework includes ranking and selection, continuous global optimization, and many other problems in sequential experimental design. We give a sufficient condition under which measurement policies sample each measurement type infinitely often, ensuring consistency, i.e., that a globally optimal future decision is found in the limit. This condition is useful for verifying consistency of adaptive sequential sampling policies that do not do forced random exploration, making consistency difficult to verify by other means. We demonstrate the use of this sufficient condition by showing consistency of two previously proposed ranking and selection policies: OCBA for linear loss, and the knowledge-gradient policy with independent normal priors. Consistency of the knowledge-gradient policy was shown previously, while the consistency result for

OCBA is new.

**D. Negoescu, P. Frazier and W. B. Powell, “The Knowledge Gradient Algorithm for Sequencing Experiments in Drug Discovery”, Informs Journal on Computing, Vol. 23, No. 3, pp. 346-363, 2011. (c) Informs**

*This paper describes a method for applying the knowledge gradient to a problem with a very large number of alternatives. Instead of creating a belief about each alternative (known as a “lookup table belief model”), we represent our belief about an alternative using linear regression (known as a “parametric belief model”). The method is motivated by the need to find the best molecular compound to solve a particular problem (e.g. killing cancer cells). There is a base compound with a series of sites (indexed by j) and a series of small sequences of atoms (“substituents”) indexed by i. Let X_{ij} = 1 if we put substituent i at site j, and let theta_{ij} be the impact of this combination on the performance of the compound. The goal is to choose compounds to test that allow us to estimate the parameters theta as quickly as possible.*

*(click here to download main paper) (Click here for online supplement)*

**P. Frazier, W. B. Powell, H. P. Simao, “Simulation Model Calibration with Correlated Knowledge-Gradients,” Winter Simulation Conference, December, 2009.**

*A common challenge in the calibration of simulation model is that we have to tune several continuous parameters. This is our first application of the knowledge gradient algorithm with correlated beliefs to the problem of parameter tuning for simulation models. A single run of the model (which uses adaptive learning from approximate dynamic programming) requires more than a day, so the paper also introduces methods to product results without a full run. A Bayesian model is set up to capture the uncertainty in our beliefs about the convergence of the model. The method is illustrated in the tuning of two continuous parameters, which required approximately six runs of the model.*

*(click here to download paper)*

**Frazier, P. I., and W. B. Powell, “Paradoxes in Learning: The Marginal Value of Information and the Problem of Too Many Choices,” Decision Analysis, Vol. 7, No. 4, pp. 378-403, 2010.**

*The value of information can be a concave function in the number of measurements, but for many problems it is not, and instead follows an S-curve. A little bit of information may teach you nothing, and you may have to make an investment in information beyond a certain threshold to actually have an impact. We investigate the economic implications of the S-curve effect, showing that it is possible to have too many choices. We then revisit the knowledge gradient algorithm, which allocates measurements based on the marginal value of information. The knowledge gradient can produce poor learning results in the presence of an S-curve. We propose the KG(*) algorithm, which maximizes the average value of information, and show that it produces good results when there is a significant S-curve effect.*

*(click here to download paper)*

**Mes, M., P. I. Frazier and W. B. Powell, “Hierarchical Knowledge Gradient for Sequential Sampling,” J. Machine Learning Research, Vol.12, pp. 2931-2974, 2011. **

*We develop the knowledge gradient for optimizing a function when our belief is represented by constants computed at different levels of aggregation. Our estimate of the function at any point is given by a weighted sum of estimates at different levels of aggregation. *

*(Click here to download paper)*

**Frazier, P. and W. B. Powell, “The knowledge gradient stopping rule for ranking and selection,” Proceedings of the Winter Simulation Conference, December 2008.**

*We consider the ranking and selection of normal means in a fully sequential Bayesian context. By considering the sampling and stopping problems jointly rather than separately, we derive a new composite stopping/sampling rule. The sampling component of the derived composite rule is the same as the previously introduced LL1 sampling rule, but the stopping rule is new. This new stopping rule significantly improves the performance of LL1 as compared to its performance under the best other generally known adaptive stopping rule, EOC Bonf, outperforming it in every case tested.*

*(click here to download paper)*

**Ryzhov, I. and W. B. Powell, “The Knowledge Gradient Algorithm For Online Subset Selection,” IEEE Conference on Approximate Dynamic Programming and Reinforcement Learning (part of IEEE Symposium on Computational Intelligence), March, 2009.**

*We derive a one-period look-ahead policy for online subset selection problems, where learning about one subset also gives us information about other subsets. We show that the resulting decision rule is easily computable, and present experimental evidence that the policy is competitive against other online learning policies.*

*(click here to download paper)*

**S Dayanik, W. B. Powell and K. Yamazaki “An Asymptotically Optimal Strategy in Sequential Change Detection and Identification Applied to Problems in Biosurveillance” Proceedings of the 3rd INFORMS Workshop on Data Mining and Health Informatics, (J. Li, D. Aleman, R. Sikora, eds.), 2008.**

*In some application, it is useful to have a stopping rule for an information collection problem. This paper investigates a stopping rule based on the knowledge gradient concept.*

*(click here to download paper)*

**Powell, W. B. and P. Frazier, “Optimal Learning,” TutORials in Operations Research, Chapter 10, pp. 213-246 (2008) (c) Informs.**

*This was an invited tutorial on the topic of optimal learning, and represents a fairly easy introduction to the general field of information collection. Although the page constraints limited the scope, it covers the central dimensions of information collection, along with an overview of a number of the most popular heuristic policies. Of course, we include an introduction to the knowledge gradient concept.*

*(click here to download paper)*

**P. Frazier, W. B. Powell, S. Dayanik, “The Knowledge-Gradient Policy for Correlated Normal Beliefs,” Informs Journal on Computing, Vol. 21, No. 4, pp. 585-598 (2009) (c) Informs**

*The knowledge gradient policy is a method for determining which of a discrete set of measurements we should make to determine which of a discrete set of choices we should make. Most of the applications that we have considered introduce the dimension of correlated beliefs. If we evaluate the level of contamination in one location and it measures high, we are likely to raise our belief about the level of toxin in nearby locations. If we test a machine for airport security that can sense explosives and it works poorly, we might lower our evaluation of other devices that might use similar technologies (e.g. a particular material or sensor within the device). This article shows how to compute the knowledge gradient for problems with correlated beliefs. The paper shows that just as with problems with independent beliefs, the knowledge gradient is both myopically and asymptotically optimal. Experimental work shows that it can produce a much higher rate of convergence than the knowledge gradient with independent beliefs, in addition to outperforming other more classical information collection mechanisms.*

*(click here to download paper) (Click here for online supplement)*

**Frazier, P., W. B. Powell and S. Dayanik, “A Knowledge Gradient Policy for Sequential Information Collection,” SIAM J. on Control and Optimization, Vol. 47, No. 5, pp. 2410-2439 (2008).**

*The knowledge gradient policy is introduced here as a method for solving the ranking and selection problem, which is an off-line version of the multiarmed bandit problem. Imagine that you have M choices (M is not too large) where you have a normally distributed belief about the value of each choice. You have a budget of N measurements to evaluate each choice to refine your distribution of belief. After your N measurements, you have to choose what appears to be the best based on your current belief. The knowledge gradient policy guides this search by always choosing to measure the choice which would produce the highest value if you only have one more measurement (the knowledge gradient can be viewed as a method of steepest ascent). The paper shows that this policy is myopically optimal (by construction), but is also asymptotically optimal, making it the only stationary policy that is both myopically and asymptotically optimal. The paper provides bounds for finite measurement budgets, and provides experimental work that shows that it works as well as, and often better, than other standard learning policies.*

*(click here to download paper)*

**S. Dayanik, W. Powell, and K. Yamazaki, “Index policies for discounted bandit problems with availability constraints,” Advances in Applied Probability, Vol. 40, No. 2, pp. 377-400 (2008).**

*One of the most famous problems in information collection is the multiarmed bandit problem, where make a choice (out of a discrete set of choices), observe a reward, and use this observation to update estimates of the future value of rewards. This problem can be solved by choosing the option with the highest index (known as the Gittins index). In this paper, we present an index problem for the case where not all the choices are available each time. As a result, it is sometimes important to make an observation just because the observation is available to be made.*