Transcript Document
A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,[email protected] “Road map” Introduction: Bayesian networks What are BNs: representation, types, etc Why use BNs: Applications (classes) of BNs Information sources, software, etc Probabilistic inference Exact inference Approximate inference Learning Bayesian Networks Learning parameters Learning graph structure Summary Bayesian Networks P(A) P(S) Visit to Asia P(T|A) Tuberculosis BN (G, Θ) Smoking P(L|S) P(B|S) Lung Cancer P(C|T,L) P(D|T,L,B) Chest X-ray P(A, S, T, L, B, C, D) CPD: Bronchitis Dyspnoea = T 0 0 0 0 L 0 0 1 1 B 0 1 0 1 D=0 0.1 0.7 0.8 0.9 ... D=1 0.9 0.3 0.2 0.1 P(A) P(S) P(T|A) P(L|S) P(B|S) P(C|T,L) P(D|T,L,B) Conditional Independencies Efficient Representation [Lauritzen & Spiegelhalter, 95] Bayesian Networks Structured, graphical representation of probabilistic relationships between several random variables Explicit representation of conditional independencies Missing arcs encode conditional independence Efficient representation of joint pdf Allows arbitrary queries to be answered P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ? Example: Printer Troubleshooting (Microsoft Windows 95) Application Output OK Print Spooling On Spool Process OK Local Disk Space Adequate Network Up Correct Printer Path Net Cable Connected Spooled Data OK GDI Data Input OK Uncorrupted Driver Correct Driver GDI Data Output OK Correct Printer Selected Print Data OK Net Path OK Printer On and Online Correct Driver Settings Net/Local Printing PC to Printer Transport OK Printer Data OK Print Output OK Local Path OK Correct Local Port Local Cable Connected Paper Loaded Printer Memory Adequate [Heckerman, 95] Example: Microsoft Pregnancy and Child Care) [Heckerman, 95] Example: Microsoft Pregnancy and Child Care) [Heckerman, 95] Independence Assumptions Smoking Visit to Asia Bronchitis Lung Cancer tail-to-tail Tuberculosis Bronchitis Lung Cancer Chest X-ray Head-to-tail Head-to-head Dyspnoea Independence Assumptions Nodes X and Y are d-connected by nodes in Z along a trail from X to Y if every head-to-head node along the trail is in Z or has a descendant in Z every other node along the trail is not in Z Nodes X and Y are d-separated by nodes in Z if they are not d-connected by Z along any trail from X to Y Nodes X and Y are d-separated by Z implies X and Y are conditionally independent given Z Independence Assumptions A variable (node) is conditionally independent of its non-descendants given its parents Smoking Visit to Asia Tuberculosis Bronchitis Lung Cancer Chest X-ray Dyspnoea Independence Assumptions Age Gender Exposure to Toxins Diet Smoking Cancer Serum Calcium Cancer is independent of Diet given Exposure to Toxins and Smoking Lung Tumor [Breese & Koller, 97] Independence Assumptions What this means is that joint pdf can be represented as product of local distributions P(A,S,T,L,B,C,D) = P(A) . P(S|A) . P(T|A,S) . P(L|A,S,T) . P(B|A,S,T,L) . P(C|A,S,T,L,B) . P(D|A,S,T,L,B,C) = P(A) . P(S) . P(T|A) . P(L|S) .P(B|S) . P(C|T,L) . P(D|T,L,B) Smoking Visit to Asia Tuberculosis Bronchitis Lung Cancer Chest X-ray Dyspnoea Independence Assumptions Thus, the General Product rule for Bayesian Networks is n P(X1,X2,…,Xn) = P P(Xi | Pa(Xi)) i=1 where Pa(Xi) is the set of parents of Xi The Knowledge Acquisition Task Variables: collectively exhaustive, mutually exclusive values clarity test: value should be knowable in principle Structure if data available, can be learned constructed by hand (using “expert” knowledge) variable ordering matters: causal knowledge usually simplifies Probabilities can be learned from data second decimal usually does not matter; relative probs sensitivity analysis The Knowledge Acquisition Task Fuel Battery Start Gauge TurnOver Gauge Start Fuel Battery TurnOver Variable Order is Important Fuel Battery Gauge Causal Knowledge Simplifies Construction TurnOver Start The Knowledge Acquisition Task Naive Baysian Classifiers [Duda&Hart; Langley 92] Selective Naive Bayesian Classifiers [Langley & Sage 94] Conditional Trees [Geiger 92; Friedman et al 97] The Knowledge Acquisition Task Selective Bayesian Networks [Singh & Provan, 95;96] What are BNs useful for? Diagnosis: P(cause|symptom)=? Prediction: P(symptom|cause)=? Classification: max P(class|data) class Decision-making (given a cost function) Data mining: induce best model from data Medicine Bioinformatics Speech recognition Stock market Text Classification Computer troubleshooting What are BNs useful for? Decision Making Max. Expected Utility Cause Predictive Inference Known Predisposing Factors Effect Value Unknown but important Cause Diagnostic Reasoning Effect Decision Imperfect Observations What are BNs useful for? Value of Information Salient Observations Action 2 Action 1 Do nothing Fault 1 Fault 2 Fault 3 . . . Assignment of Belief Act Now! Yes Halt? No Probability of fault “i” Next Best Observation (Value of Information) New Obs. Why use BNs? Explicit management of uncertainty Modularity implies maintainability Better, flexible and robust decision making MEU, VOI Can be used to answer arbitrary queries multiple fault problems Easy to incorporate prior knowledge Easy to understand Application Examples Intellipath commercial version of Pathfinder lymph-node diseases (60), 100 findings APRI system developed at AT&T Bell Labs learns & uses Bayesian networks from data to identify customers liable to default on bill payments NASA Vista system predict failures in propulsion systems considers time criticality & suggests highest utility action dynamically decide what information to show Application Examples Answer Wizard in MS Office 95/ MS Project Bayesian network based free-text help facility uses naive Bayesian classifiers Office Assistant in MS Office 97 Extension of Answer wizard uses naïve Bayesian networks help based on past experience (keyboard/mouse use) and task user is doing currently This is the “smiley face” you get in your MS Office applications Application Examples Microsoft Pregnancy and Child-Care Available on MSN in Health section Frequently occuring children’s symptoms are linked to expert modules that repeatedly ask parents relevant questions Asks next best question based on provided information Presents articles that are deemed relevant based on information provided Application Examples Printer troubleshooting HP bought 40% stake in HUGIN. Developing printer troubleshooters for HP printers Microsoft has 70+ online troubleshooters on their web site use Bayesian networks - multiple faults models, incorporate utilities Fax machine troubleshooting Ricoh uses Bayesian network based troubleshooters at call centers Enabled Ricoh to answer twice the number of calls in half the time Application Examples Application Examples Application Examples Online/print resources on BNs Conferences & Journals UAI, ICML, AAAI, AISTAT, KDD MLJ, DM&KD, JAIR, IEEE KDD, IJAR, IEEE PAMI Books and Papers Bayesian Networks without Tears by Eugene Charniak. AI Magazine: Winter 1991. Probabilistic Reasoning in Intelligent Systems by Judea Pearl. Morgan Kaufmann: 1998. Probabilistic Reasoning in Expert Systems by Richard Neapolitan. Wiley: 1990. CACM special issue on Real-world applications of BNs, March 1995 Online/Print Resources on BNs Wealth of online information at www.auai.org Links to Electronic proceedings for UAI conferences Other sites with information on BNs and reasoning under uncertainty Several tutorials and important articles Research groups & companies working in this area Other societies, mailing lists and conferences Publicly available s/w for BNs List of BN software maintained by Russell Almond at bayes.stat.washington.edu/almond/belief.html several free packages: generally research only commercial packages: most powerful (& expensive) is HUGIN; others include Netica and Dxpress we are working on developing a Java based BN toolkit here at Watson - will also work within ABLE “Road map” Introduction: Bayesian networks What are BNs: representation, types, etc Why use BNs: Applications (classes) of BNs Information sources, software, etc Probabilistic inference Exact inference Approximate inference Learning Bayesian Networks Learning parameters Learning graph structure Summary Probabilistic Inference Tasks Belief updating: BEL(Xi ) P(Xi xi | evidence) Finding most probable explanation (MPE) x* arg max P(x, e) x Finding maximum a-posteriory hypothesis (a 1* ,..., ak* ) arg max P(x, e) a A X : hypothesis variables X/A Finding maximum-expected-utility (MEU) decision (d1* ,..., dk* ) arg max P(x, e)U(x) d X/D D X : decision variables U ( x ) : utility function Belief Updating Smoking Bronchitis lung Cancer X-ray Dyspnoea P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ? Belief updating: P(X|evidence)=? P(a|e=0) P(a,e=0)= A BB C C P(a)P(b|a)P(c|a)P(d|b,a)P(e|b,c)= e 0 , d , c ,b D D EE “Moral” graph P(a) P(c|a) P(b|a)P(d|b,a)P(e|b,c) e 0 d c b Variable Elimination h B (a, d , c, e) Bucket elimination Algorithm elim-bel (Dechter 1996) b bucket B: bucket C: P(b|a) P(d|b,a) P(e|b,c) P(c|a) h B (a, d, c, e) hC (a, d, e) bucket D: bucket E: bucket A: Elimination operator e=0 P(a) B C D h D (a, e) E h (a) P(a|e=0) W*=4 ”induced width” (max clique size) E A Finding MPE max P(x) Algorithm elim-mpe (Dechter x1996) is replacedby max : MPE max P(a ) P(c | a ) P(b | a ) P(d | a, b) P(e | b, c) a ,e ,d ,c ,b max b Elimination operator bucket B: P(b|a) P(d|b,a) P(e|b,c) B bucket C: P(c|a) h B (a, d, c, e) C hC (a, d, e) bucket D: bucket E: e=0 bucket A: P(a) D h D (a, e) h E (a) MPE W*=4 ”induced width” (max clique size) E A Generating the MPE-tuple 5. b' arg max P(b | a' ) b P(d' | b, a' ) P(e' | b, c' ) 4. c' arg max P(c | a' ) c h (a' , d' , c, e' ) B B: P(b|a) P(d|b,a) P(e|b,c) C: P(c|a) h B (a, d, c, e) D: hC (a, d, e) 2. e' 0 E: h D (a, e) 1. a' arg max P(a) h E (a) A: P(a) 3. d' arg max h (a' , d, e' ) C d e=0 a Return (a' , b' , c' , d' , e' ) h E (a) Complexity of inference O(n exp ( w* (d )) w* (d ) the induced width of moral graph along ordering d The effect of the ordering: A B D C E “Moral” graph B E C D D C E B A A w* (d1 ) 4 w* (d 2 ) 2 Other tasks and algorithms MAP and MEU tasks: Similar bucket-elimination algorithms - elim-map, elim-meu (Dechter 1996) Elimination operation: either summation or maximization Restriction on variable ordering: summation must precede maximization (i.e. hypothesis or decision variables are eliminated last) Other inference algorithms: Join-tree clustering Pearl’s poly-tree propagation Conditioning, etc. Relationship with join-tree clustering O rde ri n g: A, B, C , D, E bucket(E) P(e|b,c) bucket(D) P(d|a,b) bucket(C) P(c|a) || h D(a,b) bucket(B) P(b|a) ||h C (a,b) bucket( A) P ( a ) || hB(a) BCE ADB A cluster is a set of buckets (a “super-bucket”) ABC Relationship with Pearl’s belief propagation in poly-trees Z3 Z1 Z (u1 ) Z2 Z3 Z (u3 ) Z (u2 ) 1 P( z1 | u1 ) 2 U1 U2 Z2 3 Z1 U3 Y1 “Causal support” ( x1 ) U1 X1 Y ( x1 ) 1 Y1 U2 “Diagnostic support” U3 X1 Pearl’s belief propagation for single-root query elim-bel using topological ordering and super-buckets for families Elim-bel, elim-mpe, and elim-map are linear for poly-trees. “Road map” Introduction: Bayesian networks Probabilistic inference Exact inference Approximate inference Learning Bayesian Networks Learning parameters Learning graph structure Summary Inference is NP-hard => approximations O(n exp(w*)) S CC XX BB D D Approximations: Local inference Stochastic simulations Variational approximations etc. Local Inference Idea Bucket-elimination approximation: “mini-buckets” Local inference idea: bound the size of recorded dependencies Computation in a bucket is time and space exponential in the number of variables involved Therefore, partition functions in a bucket into “mini-buckets” on smaller number of variables Mini-bucket approximation: MPE task Split a bucket into mini-buckets =>bound complexity h g X X Exponentia l complexity decrease : O(e n ) O(e r ) O(e nr ) Approx-mpe(i) Input: i – max number of variables allowed in a mini-bucket Output: [lower bound (P of a sub-optimal solution), upper bound] Example: approx-mpe(3) versus elim-mpe w* 2 w* 4 Properties of approx-mpe(i) Complexity: O(exp(2i)) time and O(exp(i)) time. Accuracy: determined by upper/lower (U/L) bound. As i increases, both accuracy and complexity increase. Possible use of mini-bucket approximations: As anytime algorithms (Dechter and Rish, 1997) As heuristics in best-first search (Kask and Dechter, 1999) Other tasks: similar mini-bucket approximations for: belief updating, MAP and MEU (Dechter and Rish, 1997) Anytime Approximation an yti m e- m pe ( ) In i ti al i z:e i i0 W h i l et imeand space resourcesare available i i istep U upper bound comput edby approx- m pe(i) L lower bound comput edby approx- m pe(i) keep t hebest solut ionfoundso far U if 1 1 , ret urnsolut ion L end re tu rn t helargest L and t hesmallestU Empirical Evaluation (Dechter and Rish, 1997; Rish, 1999) Randomly generated networks Uniform random probabilities Random noisy-OR CPCS networks Probabilistic decoding Comparing approx-mpe and anytime-mpe versus elim-mpe Random networks Uniform random: 60 nodes, 90 edges (200 instances) In 80% of cases, 10-100 times speed-up while U/L<2 Noisy-OR – even better results P( x 0 | y1 ,..., yn ) qi randomnoise parameterqi q y 1 i Exact elim-mpe was infeasible; appprox-mpe took 0.1 to 80 sec. CPCS networks – medical diagnosis (noisy-OR model) Test case: no evidence Anytime-mpe(0.0001) U/L error vs time 3.8 cpcs422b cpcs360b Upper/Lower 3.4 3.0 2.6 2.2 1.8 1.4 1.0 0.6 i=1 1 10 100 i=21 Time and parameter i Algorithm elim-mpe anytime-mpe( ), 10 4 1 anytime-mpe( ), 10 1000 Time (sec) cpcs360 cpcs422 115.8 1697.6 70.3 505.2 70.3 110.5 Effect of evidence More likely evidence=>higher MPE => higher accuracy (why?) log(U/L) histogram for i=10 on 1000 instances of random evidence 1000 1000 900 900 800 800 700 700 Frequency Frequency log(U/L) histogram for i=10 on 1000 instances of likely evidence 600 500 400 300 200 600 500 400 300 200 100 100 0 0 1 2 3 4 5 6 7 log(U/L) 8 9 10 11 12 0 0 2 4 6 8 log(U/L) Likely evidence versus random (unlikely) evidence 10 12 Probabilistic decoding Error-correcting linear block code State-of-the-art: approximate algorithm – iterative belief propagation (IBP) (Pearl’s poly-tree algorithm applied to loopy networks) approx-mpe vs. IBP approx - mpe is better on low - w * codes IBP is better on randomly generated (high - w*) codes Bit error rate (BER) as a function of noise (sigma): Mini-buckets: summary Mini-buckets – local inference approximation Idea: bound size of recorded functions Approx-mpe(i) - mini-bucket algorithm for MPE Better results for noisy-OR than for random problems Accuracy increases with decreasing noise in Accuracy increases for likely evidence Sparser graphs -> higher accuracy Coding networks: approx-mpe outperfroms IBP on lowinduced width codes “Road map” Introduction: Bayesian networks Probabilistic inference Exact inference Approximate inference Local inference Stochastic simulations Variational approximations Learning Bayesian Networks Summary Approximation via Sampling 1. Generate N samples from P( X) : S (s1 ,...,s N ), where s i (x1i , x i2 ,...,x in ) 2. Estimateprobabilities by frequencies : # sam ples with Y y P(Y y ) , N 3. How to handle evidence E ? - acceptance- rejection (e.g., forward sampling) - " clamping"evidence nodes to their values : * likelihood weighing * Gibbs sampling(MCMC) Forward Sampling (logic sampling (Henrion, 1988)) ,sel p mas f o # - N,ec nedi ve - E :t up nI ) n X , . .. , 1 X ( og ni red r ol artsec n a na E h tiwt ne tsis n ocsel p mas N :t up t u O N o t 1 #el p m as r oF . 1 n o t 1 i r o f .2 .3 ) ia p | ix (P m o r f ix el p mas i X .4 :e lp m a stc e j er , ix i X d na E i X f i .5 2 pe tso t o g d na 1 i Forward sampling (example) X1 X2 P( x2 | x1 ) P( x1 ) X3 X4 P( x3 | x1 ) P( x4 | x2 , x3 ) Evidence: X 3 0 // generatesample k 1. Sample x1 from P ( x1 ) 2. Sample x 2 from P ( x 2 | x1 ) 3. Sample x3 from P ( x3 | x1 ) 4. If x3 0, reject sample and start from1, ot herwise 5. sample x 4 from P ( x 4 | x 2, x3 ) Drawback: high rejection rate! Likelihood Weighing (Fung and Chang, 1990; Shachter and Peot, 1990) “Clamping” evidence+forward sampling+ weighing samples by evidence likelihood . ie ix ngissa , E i X hc a e r oF . 1 : sed o ne h t f o g nire dr ol artsec n a na d niF . 2 . ) n X ,..., 1 X ( o N o t 1 #el p m as r oF . 3 E i X r o f .4 ) ia p | ix (P m o r f ix el p mas i X .5 ) ia p | ie ( P )ep l mas (e r ocs .6 E i X se r ocsezila m r o n .7 )ep l mas (e r ocs ) E | y Y ( P ne h T s e lpmas y Y er ehw Works well for likely evidence! Gibbs Sampling (Geman and Geman, 1984) Markov Chain Monte Carlo (MCMC): create a Markov chain of samples . ie ix , E i X hc a e r oF . 1 e ul avm o d n ar ix , E i X hc a e r oF . 2 N o t 1 #el p m as r oF . 3 E i X r o f .4 )} i X { \ X | ix (P m o r f ix el p mas i X .5 Advantage: guaranteed to converge to P(X) Disadvantage: convergence may be slow Gibbs Sampling (cont’d) (Pearl, 1988) :yllacolde tupmoc si )} i X{ \ X | ix (P t:na tropmI ) j ap | j x ( P ) i ap | ix ( P )} i X { \ X | ix ( P ih c j X Markov blanket: Xi ) j ap ( i hc i ap ) i X ( M j h c j X tek n al b v okr aM ne viG , )s t ne ra prie h t d na, ne rdli hc,s t ne ra p ( sed o nre h t o lla f o tned ne ped ni si i X “Road map” Introduction: Bayesian networks Probabilistic inference Exact inference Approximate inference Local inference Stochastic simulations Variational approximations Learning Bayesian Networks Summary Variational Approximations Idea: variational transformation of CPDs simplifies inference Advantages: Compute upper and lower bounds on P(Y) Usually faster than sampling techniques Disadvantages: More complex and less general: must be derived for each particular form of CPD functions Variational bounds: example log(x ) min{x log 1} log(x) x log 1 - variational parameter This approach can be generalized for any concave (convex) function in order to compute its upper (lower) bounds: convex duality (Jaakkola and Jordan, 1997) Convex duality (Jaakkola and Jordan, 1997) 1. If f ( x ) is concave, it has a dual function f * ( ) s.t. : f ( x ) min{T x f * ( )} * f ( ) min{T x f ( x )} x and we get upper bounds f ( x ) T x f * ( ) f * ( ) T x f ( x ) 2. For convex f ( x ), we get lower bounds. Example: QMR-DT network (Quick Medical Reference – Decision-Theoretic (Shwe et al., 1991)) d1 600 diseases 4000 findings f1 d2 f2 dk f3 fn Noisy-OR model: P( f i 0 | d ) (1 qi 0 ) (1 qij ) P( f i 0 | d ) e dj d j pai i 0 d j pai ij d j where ij - log( 1-qij ) , Inference in QMR-DT P(d , f ) Inference: P(d1 | f ) d2 ,...,dk P( d , f ) P( f | d ) P( d ) P( f i | d ) P( f i | d ) f i 1 (1 e i 0 d j pai ij d j f i 0 ) f i 1 Positive evidence “couples” the disease nodes P( d j ) dj e i 0 f i 0 e f i 0 i 0 d j pa i ij d j [e factorized fi 0ij ] dj d j pai Inference complexity: O(exp(min{p,k})) p = # of positive findings, k = max family size (Heckerman, 1989 (“Quickscore”), Rish and Dechter, 1998) factorized Variational approach to QMR-DT (Jaakkola and Jordan, 1997) f ( x ) ln(1 e x ) is concaveand has a dual f * ( ) ln ( 1) ln( 1) T hen P( f i 1 | d ) 1 e P( f i 1 | d ) e i (i 0 i 0 d j pai ij d j d j pai ij d j ) f * ( i ) e can be bounded by : ii 0 f * ( i ) ij [e ] d j pa i The effect of positive evidence is now factorized (diseases are “decoupled”) dj Variational approximations Bounds on local CPDs yield a bound on posterior Two approaches: sequential and block Sequential: applies variational transformation to (a subset of) nodes sequentially during inference using a heuristic node ordering; then optimizes across variational parameters Block: selects in advance nodes to be transformed, then selects variational parameters minimizing the KL-distance between true and approximate posteriors Block approach P(Y | E ) exact posteriorof Y given evidence E Q (Y | E , ) approximation after replacingsome CP Ds with their variationalbounds Find * arg min D(Q || P ) where D(Q || P ) is the Kullback - Leibler (KL) distance Q( S ) D(Q || P ) Q ( S ) log P( S ) S Inference in BN: summary Exact inference is often intractable => need approximations Approximation principles: Approximating elimination – local inference, bounding size of dependencies among variables (cliques in a problem’s graph). Mini-buckets, IBP Other approximations: stochastic simulations, variational techniques, etc. Further research: Combining “orthogonal” approximation approaches Better understanding of “what works well where”: which approximation suits which problem structure Other approximation paradigms (e.g., other ways of approximating probabilities, constraints, cost functions) “Road map” Introduction: Bayesian networks Probabilistic inference Exact inference Approximate inference Learning Bayesian Networks Learning parameters Learning graph structure Summary Why learn Bayesian networks? Combining domain expert knowledge with data <9.7 0.6 8 14 18> <0.2 1.3 5 ?? ??> <1.3 2.8 ?? 0 1 > <?? 5.6 0 10 ??> ………………. Efficient representation and inference Incremental learning: P(H) or Handling missing data: Learning causal relationships: S <1.3 2.8 ?? 0 1 > C Learning Bayesian Networks Known graph – learn parameters S Complete data: P(C|S) parameter estimation (ML, MAP) B C Incomplete data: non-linear parametric optimization (gradient descent, EM) P(S) P(X|C,S) P(B|S) P(D|C,B) D X Unknown graph – learn graph and parameters S Complete data: optimization (search in space of graphs) Incomplete data: EM plus Multiple Imputation, structural EM, mixture models S X B C B C D X D ˆ arg max Score(G) G G Learning Parameters: complete data C B X max log P( D | ) - decomposable! ML-estimate: Pa X Multinomial x ,pa X P( x | pa X ) MAP-estimate (Bayesian statistics) counts ML( x ,pa X ) x x ,pa X x ,pa X max log P( D | ) P() N x ,pa X x ,pa X N N x Conjugate priors - Dirichlet MAP( x ,pa X ) N x ,pa X x ,pa X x Dir( pa X | 1,pa X ,...,m,pa X ) Equivalent sample size (prior knowledge) Learning graph structure ˆ arg max Score(G) Find G G Heuristic search: Incomplete data (score non-decomposable):stochastic methods Add S->B S Complete data – local computations NP-hard optimization S B C Delete S->B B C Reverse S->B S C Constrained-based methods Data impose independence relations (constraints) B S C B Learning BNs: incomplete data Learning parameters EM algorithm [Lauritzen, 95] Gibbs Sampling [Heckerman, 96] Gradient Descent [Russell et al., 96] Learning both structure and parameters Sum over missing values [Cooper & Herskovits, 92; Cooper, 95] Monte-Carlo approaches [Heckerman, 96] Gaussian approximation [Heckerman, 96] Structural EM [Friedman, 98] EM and Multiple Imputation [Singh 97,98,00] Learning Parameters: incomplete data Non-decomposable marginal likelihood (hidden nodes) Data Initial parameters Current model Expectation Inference: P(S|X=0,D=1,C=0,B=1) (G,Θ) S X D C <? 0 1 0 <1 1 ? 0 <0 0 0 ? <? ? 0 ? ……… B 1> 1> ?> 1> Expected counts Maximization Update parameters (ML, MAP) EM-algorithm: iterate until convergence S 1 1 0 1 X D C 0 1 0 1 1 0 0 0 0 0 0 0 ……….. B 1 1 0 1 Learning Parameters: incomplete data (Lauritzen, 95) Complete-data log-likelihood is n qi ri Nijk log ijk i 1 j 1k 1 E step Compute E( Nijk | Yobs, M step Compute E( Nijk | Yobs, E( Nij | Yobs, Learning structure: incomplete data Depends on the type of missing data - missing independent of anything else (MCAR) OR missing based on values of other variables (MAR) While MCAR can be resolved by decomposable scores, MAR cannot For likelihood-based methods, no need to explicitly model missing data mechanism Very few attempts at MAR: stochastic methods Learning structure: incomplete data Approximate EM by using Multiple Imputation to yield efficient Monte-Carlo method [Singh 97, 98, 00] trade-off between performance & quality learned network almost optimal approximate complete-data log-likelihood function using Multiple Imputation yields decomposable score, dependent only on each node & its parents converges to local maxima of observed-data likelihood Learning structure: incomplete data Q Bs | B (t ) s l B | D S obs ,D mis P D mis obs (t ) S |D ,B dD mis 1 M obs mis l BS | D , Ds M s 1 1 M log P Dobs , Dmis s | BS M s 1 P D mis | Dobs , BS( t ) P D mis | D obs , BS( t ) , P | Dobs , BS( t ) dD mis 1 T P D mis | Dobs , BS( t ) , r T r 1 Scoring functions: Minimum Description Length (MDL) Learning data compression <9.7 0.6 8 14 18> <0.2 1.3 5 ?? ??> <1.3 2.8 ?? 0 1 > <?? 5.6 0 10 ??> ………………. log N MDL ( BN | D ) log P( D | , G ) || 2 DL(Data|model) DL(Model) Other: MDL = -BIC (Bayesian Information Criterion) Bayesian score (BDe) - asymptotically equivalent to MDL Learning Structure plus Parameters p(Y | D) p(Y | M , D) p( M | D) M No. of models is super exponential Alternatives: Model Selection or Model Averaging Model Selection Generally, choose a single model M*. Equivalent to saying P(M*|D) = 1 p(Y | D) p(Y | M * , D) Task is now to: 1) define a metric to decide which model is best 2) search for that model through the space of all models One Reasonable Score: Posterior Probability of a Structure p( S | D) p( S ) p( D| S ) h h h p( S ) p( D|s , S ) p(s | S ) ds h structure prior h likelihood h parameter prior Global and Local Predictive Scores [Spiegelhalter et al 93] p( D| S h )| p( D| S0h ) Bayes’ factor log p ( D | S ) h m h ... log p ( x l | x 1 , , x l 1 , S ) l 1 log p ( x 1 | S h ) log p ( x 2 | x 1 , S h ) log p ( x 3 | x 1 , x 2 , S h ) L Local is useful for diagnostic problems Local Predictive Score Spiegelhalter et al. (1993) disease Y ... X1 Xn symptoms X2 m pred(S h ) log p( yl | x l , d1 ,..., d l 1 , S h ) l 1 Exact computation of p(D|Sh) No missing data Cases are independent, given the model. Uniform priors on parameters discrete variables n p( D| S h ) g (i , i ) i 1 [Cooper & Herskovits, 92] Bayesian Dirichlet Score Cooper and Herskovits (1991) n qi i 1 j 1 p( D| S ) h (ij ) (ij N ij ri ) k 1 (ijk N ijk ) (ijk ) N ijk : # cases where X i = xik and Pa i = pa ij ri : number of states of X i qi : number of instances of parents of X i ri ij ijk k 1 ri N ij N ijk k 1 Learning BNs without specifying an ordering n! ordering; ordering greatly affects the quality of network learned. use conditional independence tests, and dseparation to get an ordering [Singh & Valtorta’ 95] Learning BNs via the MDL principle Idea: best model is that which gives the most compact representation of the data So, encode the data using the model plus encode the model. Minimize this. [Lam & Bacchus, 93] Learning BNs: summary Bayesian Networks – graphical probabilistic models Efficient representation and inference Expert knowledge + learning from data Learning: parameters (parameter estimation, EM) structure (optimization w/ score functions – e.g., MDL) Applications/systems: collaborative filtering (MSBN), fraud detection (AT&T), classification (AutoClass (NASA), TANBLT(SRI)) Future directions: causality, time, model evaluation criteria, approximate inference/learning, on-line learning, etc.