程序案例-MANG 6554 - Father Essays

MANG 6554 Advanced Analytics Libo Li 2021-2022 Libo.li@soton.ac.uk LL 1 Learning objectives LL 2 Gain a basic understanding of Bayesian theorem Get familiar with Bayesian classifier (Na ve Bayes and Bayesian belief networks) Develop a philosophical understanding of Bayesian inference and its relation to statistical model formulations. LL Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar. Introduction to data mining. Pearson Education India, 2016. 3 Bayes Classifier A probabilistic framework for solving classification problems Conditional Probability: Bayes theorem: )( )()|( )|( XP YPYXP XYP = )( ),( )|( )( ),( )|( YP YXP YXP XP YXP XYP = = LL 4 Example of Bayes Theorem Given: A salesperson knows that if a customer has already brought a keyboard, 50% of the chance he/she will buy a mouse. Prior probability of any customer purchasing a keyboard is 1/50 Prior probability of any customer purchasing a mouse is 1/20 If a customer has brought a mouse, what’s the probability he/she buys a keyboard 2.0 20/1 50/15.0 )( )()|( )|( = == mP kPkmP mkP LL 5 Using Bayes Theorem for Classification Consider each attribute and class label as random variables Given a record with attributes (X1, X2,…, Xd) Goal is to predict class Y Specifically, we want to find the value of Y that maximizes P(Y| X1, X2,…, Xd ) Can we estimate P(Y| X1, X2,…, Xd ) directly from data LL 6 Example Data 120K)IncomeDivorced,No,Refund( ===X Given a Test Record: Can we estimate P(Evade = Yes | X) and P(Evade = No | X) In the following we will replace Evade = Yes by Yes, and Evade = No by No Id Refund Marital Status Taxable income (K) Evade (Y) 1 Yes Single 125 No 2 No Married 100 No 3 No Single 70 No 4 Yes Married 120 No 5 No Divorced 95 Yes 6 No Married 60 No 7 Yes Divorced 220 No 8 No Single 85 Yes 9 No Married 75 No 10 No Single 90 Yes LL 7 Using Bayes Theorem for Classification Approach: compute posterior probability P(Y | X1, X2, …, Xd) using the Bayes theorem Maximum a-posteriori: Choose Y that maximizes P(Y | X1, X2, …, Xd) Equivalent to choosing value of Y that maximizes P(X1, X2, …, Xd|Y) P(Y) How to estimate P(X1, X2, …, Xd | Y ) )( )()|( )|( 21 21 21 d d n XXXP YPYXXXP XXXYP = LL 8 Example Data 120K)IncomeDivorced,No,Refund( ===X Given a Test Record: Id Refund Marital Status Taxable income (K) Evade (Y) 1 Yes Single 125 No 2 No Married 100 No 3 No Single 70 No 4 Yes Married 120 No 5 No Divorced 95 Yes 6 No Married 60 No 7 Yes Divorced 220 No 8 No Single 85 Yes 9 No Married 75 No 10 No Single 90 Yes LL 9 Na ve Bayes Classifier Assume independence among attributes Xi when class is given: P(X1, X2, …, Xd |Yj) = P(X1| Yj) P(X2| Yj)… P(Xd| Yj) Now we can estimate P(Xi| Yj) for all Xi and Yj combinations from the training data New point is classified to Yj if P(Yj) P(Xi| Yj) is maximal. LL 10 Na ve Bayes on Example Data 120K)IncomeDivorced,No,Refund( ===X Given a Test Record: P(X | Yes) = P(Refund = No | Yes) x P(Divorced | Yes) x P(Income = 120K | Yes) P(X | No) = P(Refund = No | No) x P(Divorced | No) x P(Income = 120K | No) Id Refund Marital Status Taxable income (K) Evade (Y) 1 Yes Single 125 No 2 No Married 100 No 3 No Single 70 No 4 Yes Married 120 No 5 No Divorced 95 Yes 6 No Married 60 No 7 Yes Divorced 220 No 8 No Single 85 Yes 9 No Married 75 No 10 No Single 90 Yes LL 11 Estimate Probabilities from Data Class: P(Y) = Nc/N – e.g., P(No) = 7/10, P(Yes) = 3/10 For categorical attributes: P(Xi | Yk) = |Xik|/ Nc – where |Xik| is number of instances having attribute value Xi and belonging to class Yk – Examples: P(Status=Married|No) = 4/7 P(Refund=Yes|Yes)=0 Id Refund Marital Status Taxable income (K) Evade (Y) 1 Yes Single 125 No 2 No Married 100 No 3 No Single 70 No 4 Yes Married 120 No 5 No Divorced 95 Yes 6 No Married 60 No 7 Yes Divorced 220 No 8 No Single 85 Yes 9 No Married 75 No 10 No Single 90 Yes LL 12 Estimate Probabilities from Data For continuous attributes: – Discretization: Partition the range into bins: Replace continuous value with bin value Attribute changed from continuous to ordinal – Probability density estimation: Assume attribute follows a normal distribution Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known, use it to estimate the conditional probability P(Xi|Y) LL 13 Estimate Probabilities from Data Normal distribution: – One for each (Xi,Yi) pair For (Income, Class=No): – If Class=No sample mean = 110 sample variance = 2975 2 2 2 )( 2 2 1 )|( ij ijiX ij ji eYXP = 0072.0 )54.54(2 1 )|120( )2975(2 )110120( 2 === eNoIncomeP Id Refund Marital Status Taxable income (K) Evade (Y) 1 Yes Single 125 No 2 No Married 100 No 3 No Single 70 No 4 Yes Married 120 No 5 No Divorced 95 Yes 6 No Married 60 No 7 Yes Divorced 220 No 8 No Single 85 Yes 9 No Married 75 No 10 No Single 90 Yes LL 14 Example of Na ve Bayes Classifier 120K)IncomeDivorced,No,Refund( ===X P(X | No) = P(Refund=No | No) P(Divorced | No) P(Income=120K | No) = 4/7 1/7 0.0072 = 0.0006 P(X | Yes) = P(Refund=No | Yes) P(Divorced | Yes) P(Income=120K | Yes) = 1 1/3 1.2 10-9 = 4 10-10 Since P(X|No)P(No) > P(X|Yes)P(Yes) Therefore P(No|X) > P(Yes|X) => Class = No Given a Test Record: Na ve Bayes Classifier: P(Refund = Yes | No) = 3/7 P(Refund = No | No) = 4/7 P(Refund = Yes | Yes) = 0 P(Refund = No | Yes) = 1 P(Marital Status = Single | No) = 2/7 P(Marital Status = Divorced | No) = 1/7 P(Marital Status = Married | No) = 4/7 P(Marital Status = Single | Yes) = 2/3 P(Marital Status = Divorced | Yes) = 1/3 P(Marital Status = Married | Yes) = 0 For Taxable Income: If class = No: sample mean = 110 sample variance = 2975 If class = Yes: sample mean = 90 sample variance = 25 P(No) = 7/10 P(Yes) = 3/10 (|) = (|)() () (|) = (|)() () + = 1 LL 15 Example of Na ve Bayes Classifier 120K)IncomeDivorced,No,Refund( ===X P(Yes) = 3/10 P(No) = 7/10 P(Yes | Divorced) = 1/3 x 3/10 / P(Divorced) P(No | Divorced) = 1/7 x 7/10 / P(Divorced) P(Yes | Refund = No, Divorced) = 1 x 1/3 x 3/10 / P(Divorced, Refund = No) P(No | Refund = No, Divorced) = 4/7 x 1/7 x 7/10 / P(Divorced, Refund = No) Given a Test Record: Na ve Bayes Classifier: P(Refund = Yes | No) = 3/7 P(Refund = No | No) = 4/7 P(Refund = Yes | Yes) = 0 P(Refund = No | Yes) = 1 P(Marital Status = Single | No) = 2/7 P(Marital Status = Divorced | No) = 1/7 P(Marital Status = Married | No) = 4/7 P(Marital Status = Single | Yes) = 2/3 P(Marital Status = Divorced | Yes) = 1/3 P(Marital Status = Married | Yes) = 0 For Taxable Income: If class = No: sample mean = 110 sample variance = 2975 If class = Yes: sample mean = 90 sample variance = 25 LL 16 Issues with Na ve Bayes Classifier P(Yes) = 3/10 P(No) = 7/10 P(Yes | Married) = 0 x 3/10 / P(Married) P(No | Married) = 4/7 x 7/10 / P(Married) Na ve Bayes Classifier: P(Refund = Yes | No) = 3/7 P(Refund = No | No) = 4/7 P(Refund = Yes | Yes) = 0 P(Refund = No | Yes) = 1 P(Marital Status = Single | No) = 2/7 P(Marital Status = Divorced | No) = 1/7 P(Marital Status = Married | No) = 4/7 P(Marital Status = Single | Yes) = 2/3 P(Marital Status = Divorced | Yes) = 1/3 P(Marital Status = Married | Yes) = 0 For Taxable Income: If class = No: sample mean = 110 sample variance = 2975 If class = Yes: sample mean = 90 sample variance = 25 LL 17 Issues with Na ve Bayes Classifier Na ve Bayes Classifier: P(Refund = Yes | No) = 2/6 P(Refund = No | No) = 4/6 P(Refund = Yes | Yes) = 0 P(Refund = No | Yes) = 1 P(Marital Status = Single | No) = 2/6 P(Marital Status = Divorced | No) = 0 P(Marital Status = Married | No) = 4/6 P(Marital Status = Single | Yes) = 2/3 P(Marital Status = Divorced | Yes) = 1/3 P(Marital Status = Married | Yes) = 0/3 For Taxable Income: If class = No: sample mean = 91 sample variance = 685 If class = No: sample mean = 90 sample variance = 25 Consider the table with Tid = 7 deleted Given X = (Refund = Yes, Divorced, 120K) P(X | No) = 2/6 X 0 X 0.0083 = 0 P(X | Yes) = 0 X 1/3 X 1.2 X 10-9 = 0 Na ve Bayes will not be able to classify X as Yes or No! Id Refund Marital Status Taxable income (K) Evade (Y) 1 Yes Single 125 No 2 No Married 100 No 3 No Single 70 No 4 Yes Married 120 No 5 No Divorced 95 Yes 6 No Married 60 No 7 Yes Divorced 220 No 8 No Single 85 Yes 9 No Married 75 No 10 No Single 90 Yes LL 18 Issues with Na ve Bayes Classifier If one of the conditional probabilities is zero, then the entire expression becomes zero Need to use other estimates of conditional probabilities than simple fractions Probability estimation: mN mpN CAP cN N CAP N N CAP c ic i c ic i c ic i + + = + + = = )|(:estimate-m 1 )|(:Laplace )|( :Original c: number of classes p: prior probability of the class m: parameter Nc: number of instances in the class Nic: number of instances having attribute value Ai in class c LL 19 Example of Na ve Bayes Classifier Name Give Birth Can Fly Live in Water Have Legs Class human yes no no yes mammals python no no no no non-mammals salmon no no yes no non-mammals whale yes no yes no mammals frog no no sometimes yes non-mammals komodo no no no yes non-mammals bat yes yes no yes mammals pigeon no yes no yes non-mammals cat yes no no yes mammals leopard shark yes no yes no non-mammals turtle no no sometimes yes non-mammals penguin no no sometimes yes non-mammals porcupine yes no no yes mammals eel no no yes no non-mammals salamander no no sometimes yes non-mammals gila monster no no no yes non-mammals platypus no no no yes mammals owl no yes no yes non-mammals dolphin yes no yes no mammals eagle no yes no yes non-mammals Give Birth Can Fly Live in Water Have Legs Class yes no yes no 0027.0 20 13 004.0)()|( 021.0 20 7 06.0)()|( 0042.0 13 4 13 3 13 10 13 1 )|( 06.0 7 2 7 2 7 6 7 6 )|( = = = = = = = = NPNAP MPMAP NAP MAP A: attributes M: mammals N: non-mammals P(A|M)P(M) > P(A|N)P(N) => Mammals LL 20 Na ve Bayes (Summary) Robust to isolated noise points Handle missing values by ignoring the instance during probability estimate calculations Robust to irrelevant attributes Independence assumption may not hold for some attributes – Use other techniques such as Bayesian Belief Networks (BBN) LL 21 Conditional Independence X and Y are conditionally independent given Z if P(X|YZ) = P(X|Z) Example: Arm length and reading skills – Young child has shorter arm length and limited reading skills, compared to adults – If age is fixed, no apparent relationship between arm length and reading skills – Arm length and reading skills are conditionally independent given age LL 22 Bayesian Belief Networks Provides graphical representation of probabilistic relationships among a set of random variables Consists of: – A directed acyclic graph (dag) Node corresponds to a variable Arc corresponds to dependence relationship between a pair of variables – A probability table associating each node to its immediate parent A B C LL 23 Conditional Independence A node in a Bayesian network is conditionally independent of all of its nondescendants, if its parents are known A B C D D is parent of C A is child of C B is descendant of D D is ancestor of A LL 24 Conditional Independence Na ve Bayes assumption: …X 1 X 2 X 3 X 4 y X d LL 25 Probability Tables If X does not have any parents, table contains prior probability P(X) If X has only one parent (Y), table contains conditional probability P(X|Y) If X has multiple parents (Y1, Y2,…, Yk), table contains conditional probability P(X|Y1, Y2,…, Yk) Y X LL 26 Example of Bayesian Belief Network Exercise Diet Heart Disease Chest Pain Blood Pressure Exercise=Yes 0.7 Exercise=No 0.3 Diet=Healthy 0.25 Diet=Unhealthy 0.75 E=Healthy D=Yes E=Healthy D=No E=Unhealthy D=Yes E=Unhealthy D=No HD=Yes 0.25 0.45 0.55 0.75 HD=No 0.75 0.55 0.45 0.25 HD=Yes HD=No CP=Yes 0.8 0.01 CP=No 0.2 0.99 HD=Yes HD=No BP=High 0.85 0.2 BP=Low 0.15 0.8 LL 27 Example of Inferencing using BBN Given: X = (E=No, D=Yes, CP=Yes, BP=High) – Compute P(HD|E,D,CP,BP) P(HD=Yes| E=No,D=Yes) = 0.55 P(CP=Yes| HD=Yes) = 0.8 P(BP=High| HD=Yes) = 0.85 – P(HD=Yes|E=No,D=Yes,CP=Yes,BP=High) 0.55 0.8 0.85 = 0.374 P(HD=No| E=No,D=Yes) = 0.45 P(CP=Yes| HD=No) = 0.01 P(BP=High| HD=No) = 0.2 – P(HD=No|E=No,D=Yes,CP=Yes,BP=High) 0.45 0.01 0.2 = 0.0009 Classify X as Yes LL https://support.sas.com/resources/papers/proceedings14/SAS400-2014.pdf 28 Bayesian inference From Bayesian theorem, we recall: If we generalize , The following holds: = (|) () ∝ (|) Posterior ∝ prior ×likelihood )( )()|( )|( XP YPYXP XYP = LL 29 Bayesian inference In a simple example, sales record D={1, 2, 3…} We could estimate a simple parametrical model using D. We assume this is a normal distribution (, 2) – given the data, = , 2= (|)- given the parameter, what is the likelihood of the data Or more generally, we could test a set of 1, 2,… and see how the likelihood (| ) changes given the different parameter settings. Easy question if you know how to calculate mean and average LL 30 Bayesian inference – given the data (|)- given the parameter, what is the likelihood of the data Or more generally, we could test a set of 1, 2,… and see how the likelihood (| ) changes given the different parameter settings. ∝ (|) Difficult question if you do not know how to calculate LL 31 Bayesian inference = (, ) () (|)() () = (|)() Formulate a prior distribution () to express your beliefs about Often, we do not know about the exact form of posterior distribution, hence simulation techniques (Markov chain Monte Carlo (MCMC)) is needed to draw samples to approximate the posterior distribution from a target distribution. Popular MCMC algorithms: Metropolis–Hastings, Gibbs sampling, Thompson sampling… LL Bolstad, W.M. and Curran, J.M., 2016. Introduction to Bayesian statistics. John Wiley & Sons. 32 Bayesian inference A Bayesian linear regression case: = + + ∈ 0, 2 Inference over , , 2 allows us to construct the regression model. Often, there are hierarchies within the parametric structure, e.g., (, |2) ∝ 1, (2) ∝ 1 2 |, , , 2~( + , 2) (|, , , 2)= 1 22 ( ( (+))2 22 ) LL 33 Bayesian vs frequentist Should we trust the p value Uncertainty in decision making e.g., hypothesis testing https://ocw.mit.edu/courses/mathematics/18-05-introduction-to- probability-and-statistics-spring- 2014/readings/MIT18_05S14_Reading20.pdf Hackenberger, B.K., 2019. Bayes or not Bayes, is this the question . Croatian medical journal, 60(1), p.50. https://cxl.com/blog/bayesian-frequentist-ab-testing/