Optimal Learning: Efficient Data Collection in the Information Age Industrial Engineering Research Conference Cancun, Mexico June 6, 2010 . Warren B.
Download ReportTranscript Optimal Learning: Efficient Data Collection in the Information Age Industrial Engineering Research Conference Cancun, Mexico June 6, 2010 . Warren B.
Optimal Learning: Efficient Data Collection in the Information Age Industrial Engineering Research Conference Cancun, Mexico June 6, 2010 . Warren B. Powell With research by Peter Frazier Ilya Ryzhov Warren Scott Princeton University © 2009B. Warren B. Powell © 2010 Warren Powell, Princeton University 1 1 Energy technology Retrofitting buildings with new energy technologies Different combinations of technologies interact, with behaviors that depend on the characteristics of the building. Potential technologies include: • • • • Window tinting, insulation Energy-efficient lighting Advanced thermostats … many others We need to try different combinations of technologies to build up a knowledge base on different interactions, in different settings. 2 Finding the best path Figure out Manhattan: Walking Subway/walking Taxi Street bus Driving 3 3 Finding effective compounds Materials research How do we find the best material for converting sunlight to electricity? What is the best battery design for storing energy? We need a method to sort through potentially thousands of experiments. 4 Applications Pandemic disease control Face masks are effective at disease containment. It is better to test people for disease. But we cannot test everyone. Who do we test? 5 Applications Finding good designs How do we optimize the dimensions of tubes, plates and distances in an aerosol device? Each design requires several hours to set up and execute. Five parameters determine the effectiveness of the spray. 6 Nomadic trucker illustration The nomadic trucker V 0 (MN ) 0 V (CO) 0 0 V 0 ( NY ) 0 $350 $150 V (CA) 0 0 $450 $300 7 Nomadic trucker illustration The nomadic trucker V 0 (MN ) 0 V (CO) 0 0 V 0 ( NY ) 0 $350 $150 V (CA) 0 0 $450 $300 V 1 (TX ) 450 8 Nomadic trucker illustration The nomadic trucker V 0 (MN ) 0 $180 V (CO) 0 0 V 0 (CA) 0 V 0 ( NY ) 0 $400 $600 $125 V 1 (TX ) 450 9 Nomadic trucker illustration The nomadic trucker V 0 (MN ) 0 $180 V (CO) 0 0 V 0 (CA) 0 V 0 ( NY ) 600 $400 $600 $125 V 1 (TX ) 450 10 Nomadic trucker illustration The nomadic trucker V 0 (MN ) 0 $550 V (CO) 0 0 $350 V 0 ( NY ) 600 $150 V 0 (CA) 0 $250 V 1 (TX ) 450 11 Applications 12 Outline The challenge of learning The knowledge gradient policy The knowledge gradient with correlated beliefs The knowledge gradient for on-line learning Applications © 2009 Warren B. Powell © 2008 Warren B. Powell Slide 13 13 Outline The challenge of learning The knowledge gradient policy The knowledge gradient with correlated beliefs The knowledge gradient for on-line learning Applications © 2009 Warren B. Powell © 2008 Warren B. Powell Slide 14 14 The challenge of learning Deterministic optimization Find the choice with the highest reward (assumed known): Choice 1 2 3 4 5 Value 759 722 698 653 616 The winner! 15 15 The challenge of learning Stochastic optimization Now assume the reward you will earn is stochastic, drawn from a normal distribution. The reward is revealed after the choice is made. Choice 1 2 3 4 5 Mean 759 722 698 653 616 Std dev 120 142 133 90 102 The winner! 16 16 The challenge of learning Optimal learning Choice 1 2 3 4 5 Now, you have a budget of 10 measurements to determine which of the 5 choices is best. You have an estimate of the performance of each, but you are unsure and you are willing to update your belief. Mean 759 722 698 653 616 Std dev Observation 120 702 78 133 90 102 Mean 712 722 698 653 616 Std dev Observation 96 78 734 133 90 102 Mean 712 726 698 653 616 Std dev Observation 96 64 133 90 102 • … It is no longer obvious which you should try first. 17 17 The challenge of learning At first, we believe that x ~ N x0 ,1 x0 But we measure alternative x and observe Our beliefs change: i ij0 j ij0 j yˆ 1x ~ N x ,1 y i x ~ N 1x ,1 x1 ˆ ij1 x1 x0 y 0 0 1 ˆ y y x 1x x 0x x y i ij1 j Thus, our beliefs about the rewards are gradually improved over measurements 18 18 The challenge of learning Now assume we have five choices, with uncertainty in our belief about how well each one will perform. If you can make one measurement, which would you measure? 1 2 3 4 5 19 19 The challenge of learning Now assume we have five choices, with uncertainty in our belief about how well each one will perform. If you can make one measurement, which would you measure? No improvement 1 2 3 4 5 20 20 The challenge of learning Now assume we have five choices, with uncertainty in our belief about how well each one will perform. If you can make one measurement, which would you measure? New solution 1 2 3 4 5 The value of learning is that it may change your decision. 21 21 The challenge of learning The measurement problem We wish to design a sequential measurement policy, where each measurement depends on previous choices. We can formulate this as a dynamic program: V n ( S n ) max x C ( S n , x) E V n 1 ( S n 1 ) | S n … but it is a little different than most dynamic programs that focus on the physical state. 22 22 The challenge of learning Optimal routing over a graph S is a node in the network V n ( S n ) max x C ( S n , x) E V n 1 ( S n 1 ) | S n Current node (e.g. node 2) 23 23 The challenge of learning Optimal routing over a graph S is a node in the network V n ( S n ) max x C ( S n , x) E V n 1 ( S n 1 ) | S n Decision to go to a node (e.g. 5) Downstream node (e.g. 5) 24 24 The challenge of learning Learning problems S is our “state of knowledge” 5 5 S5 N 5 , 52 1 2 3 4 5 V n ( S n ) max x C ( S n , x) E V n 1 ( S n 1 ) | S n New state of knowledge Current state of knowledge Decision to make a measurement 25 25 The challenge of learning Heuristic measurement policies Pure exploitation – Always make the choice that appears to be the best. Pure exploration – Make choices at random so that you are always learning more. Hybrid (epsilon greedy) • Explore with probability and exploit with probability 1 n • Declining exploration – explore with probability c / n. Goes to zero as .n , but not too quickly. Boltzmann exploration n • Explore choice x with probability px exp xn exp n x' x' Interval estimation n n Standardparameter deviation of • Choose x which maximizes x z x Tunable estimate of the mean. 0 0 z xn 26 Outline The challenge of learning The knowledge gradient policy The knowledge gradient with correlated beliefs The knowledge gradient for on-line learning Applications © 2009 Warren B. Powell © 2008 Warren B. Powell Slide 27 27 The knowledge gradient Basic principle: Assume you can make only one measurement, after which you have to make a final choice (the implementation decision). What choice would you make now to maximize the expected value of the implementation decision? Change which produces a change in the decision. Change in estimate of value of option 5 due to measurement. 1 2 3 4 5 28 28 The knowledge gradient General model Off-line learning – We have a measurement budget of N observations. After we do our measurements, we have to make an implementation decision. Notation: y Implementation decision K n Our state of knowledge after n measurements. F ( y, K n ) Value of making decision y given knowledge K n . x n Measurement decision after n measurements. Wxn 1 Observation resulting from observing x n x K n 1 ( x ) Updated distribution of belief about costs after observing Wxn 1 29 29 The knowledge gradient The knowledge gradient The knowledge gradient is the expected value of a single measurement x, given by xKG ,n E max y F ( y, K n 1 ( x)) max y F ( y, K n ) New optimization problem Optimization problem given what we Marginal value of measuring xUpdated (the knowledge gradient) Expectation over different measurement outcomes knowledge state given measurement x know Implementation decision Knowledge state X arg max x xKG Knowledge gradient policy The challenge is a computational one: how do we compute the expectation? 30 30 The knowledge gradient Computing the knowledge gradient for Gaussian beliefs The change in variance can be found to be x2,n Var xn 1 xn | S n x2,n x2,n 1 Next compute the normalized influence: xn max x ' x xn' xn n x Let f ( ) ( ) ( ) ( ) Cumulative standard normal distribution ( ) Standard normal density 0 Knowledge gradient is computed using xKG x2,n f xn 31 31 The knowledge gradient Classic steepest ascent x2 x1 xn1 xn nf ( xn ) 32 The knowledge gradient Knowledge gradient policy is a type of coordinate ascent x2 x1 x n 1 x n n KG ,n 33 The knowledge gradient 2.5 Current Current estimate estimate of of value standard of a decision deviation Value of knowledge gradient 2 1.5 mu Sigma KG index 1 0.5 0 1 2 3 4 5 Choice 44 44 The knowledge gradient 1.6 1.4 1.2 1 mu 0.8 Sigma KG index 0.6 0.4 0.2 0 1 2 3 4 5 45 45 The knowledge gradient 6 5 Choice 4 mu 3 Sigma KG index 2 1 0 1 2 3 4 5 46 46 The knowledge gradient The knowledge gradient policy X KG (S n ) arg max x xKG,n Properties Effectively a myopic policy, but also similar to steepest ascent for nonlinear programming. The best single measurement you can make (by construction). Asymptotically optimal for offline learning (more difficult proof). As the measurement budget grows, we get the optimal solution. The knowledge gradient policy is the only stationary policy with this behavior, with no tunable parameters. 47 47 The knowledge gradient policy Myopic and asymptotic optimality Optimal solution Ideal Fast initial convergence, but stalls Asymptotically optimal 48 The knowledge gradient policy Myopic and asymptotic optimality Myopic optimality (fast initial convergence) Optimal solution Ideal Asymptotic optimality Knowledge gradient 49 The knowledge gradient Myopic policy vs. three-step lookahead Opportunity cost Rolling horizon policy Three-step lookahead Knowledge gradient One-step lookahead 50 The value of information The value of information is often concave… 51 The value of information … but not always. The marginal value of a single measurement can be small! 52 The value of information Optimal number of choices As measurement noise increases, the optimal number of alternatives to evaluate decreases. Number of alternatives being evaluated 53 The value of information The KG(*) policy Maximize the average value of measurements. 54 Outline The challenge of learning The knowledge gradient policy The knowledge gradient with correlated beliefs The knowledge gradient for on-line learning Applications © 2009 Warren B. Powell © 2008 Warren B. Powell Slide 55 55 Introduction An important problem class involves correlated beliefs – measuring one alternative tells us something other alternatives. 1 2 3 ...these beliefs change too. 4 5 measure here... 56 56 The knowledge gradient with correlated beliefs Introduction Examples Finding the best price at which to sell a product. • Demand at a price of $8 is close to demand at a price of $9 Choosing a combination of drugs to treat a disease. • Two treatments may share common medications. Finding a chemical for a particular medical or industrial purpose. • Two chemicals sharing similar molecular structures behave similarly. Choosing a combination of features to include in a product. • Can only evaluate sales of a complete product. • Two products may have some features in common, while others are different. 57 57 The knowledge gradient with correlated beliefs Introduction Optimizing the price of a product Estimating demand at a price of $84 tells us something about the demand when we charge $86 Correlated knowledge gradient procedure Without correlations With correlations Chooses measurements based in part on what we learn about other potential measurements. Updating of correlations is built into the decision function, not just the transition function. 58 Outline The challenge of learning The knowledge gradient policy The knowledge gradient with correlated beliefs The knowledge gradient for on-line learning Applications © 2009 Warren B. Powell © 2008 Warren B. Powell Slide 59 59 Major problem classes Types of learning probems Off-line learning (ranking and selection/stochastic search) • There is a phase of information collection with a finite budget, after which you make an implementation decision. • Examples: – Finding the best design of a manufacturing configuration or engineering design which is evaluated using an expensive simulation. – What is the best combination of designs for hydrogen production, storage and conversion. On-line learning (multiarmed bandit problems) • “Learn as you earn” • Example: – Finding the best path to work – What is the best set of energy-saving technologies to use for your building? – What is the best medication to control your diabetes? 60 60 Knowledge gradient for online learning Objective function for off-line problems We wish to find the best design after N measurements max x xN Objective function for on-line problems We wish to maximize the total reward as we proceed N max n 1 n xn n x n (S n ) 61 Measurement policies Special case: the multiarmed bandit problem Which slot machine should I try next to maximize total expected rewards? Breakthrough (Gittins and Jones, 1974) • Do not need to solve the high-dimensional dynamic program • Compute a single index (the “Gittins index”) for each slot machine • Try the slot machine with the largest index xGittins xn (n, ) x Gittins index mean zero, 1x Standard ofvariance measurement Current estimate of for thedeviation reward from machine 62 62 Knowledge gradient for online learning Knowledge gradient policy For off-line problems: xKG,n Value of a measurement from a single decision For finite-horizon on-line problems: • Assume we have made 3 measurements out of our budget of 20. • What is the value of learning from one more measurement? KG ,3 • x is the improvement in the 4th decision given what we know after the 3rd measurement. But we benefit from this decision 17 more times. xKGOL,3 x3 (20 3) xKG,3 x3 17 xKG,3 • The more times we can use the information, the more we are willing to take a loss for future benefits. 63 Knowledge gradient for online learning Knowledge gradient policy For finite-horizon on-line problems: xKGOL,n xn ( N n) xKG,n For infinite-horizon discounted problems: xKG OL,n xn 1 xKG ,n Compare to Gittins indices for bandit problems xGittins xn (n, ) x 64 Knowledge gradient for online learning Gittins indices Gittins indices looks at one measurement at a time, over the entire future. Knowledge gradient looks across all measurements at a point in time. Time Knowledge gradient Gittins indices 65 Knowledge gradient for online learning On-line KG vs. Gittins On-line KG slightly underperforms Gittins On-line KG slightly outperforms Gittins. Number of measurements 66 Knowledge gradient for online learning KG versus Gittins indices for multiarmed bandit problems Gittins indices are provably optimal, but computing them is hard. Chick and Gans (2009) have developed a simple and accurate approximation. Informative prior Uninformative prior Improvement of KG over Gittins Improvement of KG over Gittins 67 Knowledge gradient for online learning But knowledge gradient can also handle: Finite horizons Correlated beliefs: KG vs. Gittins KG vs. Upper confidence bounding ??? KG vs. Interval estimation KG vs. pure exploitation 68 Knowledge gradient for online learning KG versus interval estimation Recall that with IE, you choose the alternative with the highest: xIE x z x Tunable parameter Opportunity cost IE IE beats KG KG IE parameter z 69 Knowledge gradient for online learning Tuning z for interval estimation Optimal value is very sensitive to the problem parameters z 0.3 z 110 70 Outline The challenge of learning The knowledge gradient policy The knowledge gradient with correlated beliefs The knowledge gradient for on-line applications Applications © 2009 Warren B. Powell © 2008 Warren B. Powell Slide 71 71 Outline Applications Optimizing an energy storage problem Warren Scott, Emre Barut Jennifer and Christine Schoppe Drug discovery Diana Negoescu Peter Frazier Learning on a graph Ilya Ryzhov © 2009 Warren B. Powell © 2008 Warren B. Powell Slide 72 72 Optimal control of wind and storage Wind Varies with multiple freqeuencies (seconds, hours, days, seasonal). Spatially uneven, generally not aligned with population centers. Solar Shines primarily during the day (when it is needed), but not entirely reliably. Strongest in south/southwest. 73 Optimal control of wind and storage 30 days 1 year 74 Optimal control of wind and storage Hydroelectric Batteries Flywheels Ultracapacitors 75 Optimal control of wind and storage Controlling the storage process Imagine that we would like to use storage to reduce demand when electricity prices are high. We use a simple policy controlled by two parameters. Price Withdraw Store 76 Optimizing storage policy Store Withdraw 77 77 Optimizing storage policy Initially we think the concentration is the same everywhere: Estimated profit Knowledge gradient We want to measure the value where the knowledge gradient is the highest. This is the measurement that teaches us the most. 78 Optimizing storage policy After four measurements: Estimated profit Measurement Knowledge gradient Value of another measurement New optimum at same location. Whenever we measure at a point, the value of another measurement at the same point goes down. The knowledge gradient guides us to measuring areas of high uncertainty. 79 Optimizing storage policy After five measurements: Estimated profit Knowledge gradient 80 Optimizing storage policy After six samples Estimated profit Knowledge gradient 81 Optimizing storage policy After seven samples Estimated profit Knowledge gradient 82 Optimizing storage policy After eight samples Estimated profit Knowledge gradient 83 Optimizing storage policy After nine samples Estimated profit Knowledge gradient 84 Optimizing storage policy After ten samples Estimated profit Knowledge gradient 85 Optimizing storage policy After 10 measurements, our estimate of the surface: Estimated profit True concentration 86 Outline Applications Optimizing an energy storage problem Warren Scott, Emre Barut Jennifer and Christine Schoppe Drug discovery Diana Negoescu Peter Frazier Learning on a graph Ilya Ryzhov © 2009 Warren B. Powell © 2008 Warren B. Powell Slide 87 87 Applications Biomedical research How do we find the best drug to cure cancer? There are millions of combinations, with laboratory budgets that cannot test everything. We need a method for sequencing experiments. 88 Drug discovery Designing molecules X and Y are sites where we can hang substituents to change the behavior of the molecule 89 Drug discovery We express our belief using a linear, additive QSAR model X ij 1 if substituent j is at site i, 0 otherwise. m Y 0 ij X ij sites i substituents j 90 Drug discovery Knowledge gradient versus pure exploration for 99 compounds Performance under best possible Pure exploration Knowledge gradient Number of molecules tested (out of 99) 91 Drug discovery A more complex molecule: 3 8 4’ 7 6 R1 R2 Potential substituents: 1 9 R4 9 R3 4 3’ 2’ 2 1’ 5 R5 F OH CH 3 OCOCH 3 OCOCH OCH 3 CH NO CI OCOCH From this base molecule, we created problems with 10,000 compounds, and one with 87,120 compounds. 92 Drug discovery Compact representation on 10,000 combination compound Results from 15 sample paths Performance under best possible Number of molecules tested 93 Drug discovery Single sample path on molecule with 87,120 combinations Performance under best possible 94 Parametric belief models Representing beliefs using linear regression has many applications: How do we find the optimal price of a product sold on the internet? Which internet ad will generate the most ad clicks? How will a customer, described by a set of attributes, respond to a price for a contract? What parameter settings produce the best results from my business simulator? What are the best features that I should include in a laptop? 95 Major problem classes Belief structures Lookup tables (one belief for n each discrete value x . • Independent beliefs • Correlated beliefs 1 2 3 4 5 Parametric beliefs y 0 11 (S ) 22 (S ) ... Nonparametric beliefs 96 Outline Applications Optimizing an energy storage problem Warren Scott Jennifer and Christine Schoppe Drug discovery Diana Negoescu Peter Frazier Learning on a graph Ilya Ryzhov © 2009 Warren B. Powell © 2008 Warren B. Powell Slide 97 97 Information collection for rapid response The challenge: We need to plan the movement of emergency response resources to respond to an emergency. Collecting information Aerial videos Sampling mobile phones with GPS Ground observations 98 Information collection on a graph Optimal routing over a graph: 99 99 Information collection on a graph Optimal routing over a graph The shortest path 100 100 Information collection on a graph Optimal routing over a graph The shortest path Evaluating a link 101 101 Information collection on a graph Optimal routing over a graph The shortest path Evaluating a link Now we have a new shortest path How do we decide which links to measure? 102 102 Information collection on a graph The knowledge gradient on a graph We can apply the knowledge gradient concept directly xKG E max y F ( y, K ( x)) max y F ( y, K ) Expected valuepath of updated shortest problem Shortest problem after thepath update weproblem believe about Current path Updated distributions ofshortest arc What costs link Based on what wecosts know How do we compute the expected value of a stochastic shortest path problem? 103 103 Information collection on a graph The knowledge gradient on a graph When we had finite alternatives, we had to compute xn max x ' x xn' xn n x Normalized distance to best (or second best) For problems on graphs, we have to compute n n max p ( ij ) p ' P ( ij ) p ' ijn pn (ij ) Value of best path that includes link (i,j) Value of best path that does not include link (i,j) 104 104 Experimental results Ten layered graphs (22 nodes, 50 edges) Ten larger layered graphs (38 nodes, 102 edges) 105 Special thanks to Peter Frazier (faculty, Cornell University) Ilya Ryzhov (available 2011) Warren Scott (expected availability 2012) Emre Barut (expected availability 2013?) © 2009 Warren B. Powell © 2008 Warren B. Powell Slide 106 106 © 2009 Warren B. Powell 107 Major problem classes Measurement variable Binary problems x=(0,1) • Sequential hypothesis testing Discrete choice problems x=(1,2,…,M) • Finding the best technology Subset selection problems x=( 0 1 1 0 1 0 0 0 1 ) • R&D portfolio optimization Continuous scalar parameter – x=2.682 • What is the best temperature, density, quantity Continuous vectors x=(1.43, 12.78, 4.59, …) • Tuning a design or process Multiattribute problems x=(OH, OCH3, NO, CI, …) • Drug discovery – What is the molecular compound? • What is the best set of features for a device? 108 The knowledge gradient Computing the knowledge gradient n n max x ' x x' xn x xn Normalized distance to best (or second best) 5KG xn 1 2 3 4 5 x 109 109 Approximate dynamic programming Learning the value of being in each state -$5 $0 $0 1 2 v1 0 v2 0 $20 Starting in state 1, given our initial estimate of the value of being in each state, we would prefer to stay in state 1 and get $0 then move to state 2 and get -$5+0 = -$5. To learn the value of being in state 2, we have to make an explicit decision to explore state 2. 110