程序案例-ALY-6020

ALY-6020 Predictive Analytics: Generalized Linear Models Ajit Appari, Ph.D., M.Tech., B.Tech. College of Professional Studies, Northeastern University Email: a.appari@northeastern.edu November 10, 2021 Linear Regression: The Classic Bivariate Model 2 If red line/curve is the fitted line Which one is linear regression model A quick review of linear regression model before discussing GLM Multiple Linear Regression Model For multivariable problem (three or more variables), i.e., Response as a function of two or more predictors where i=1,…,n (sample size); ei is random error for i-th observation Often xi,1=1 implying β1 is constant term All n equations stacked together as matrix 3 Linear Regression: The Classic Model 4 Linear Regression Model Parameters of Linear Regression Model NOTE-1: Model is linear in parameters, irrespective of the nature of predictors. NOTE-2: Normal distribution assumption is for error; NOT for Y Quantitative inputs Transformations, e.g. ln(x), Sin(x) Indicator [0/1] or Categorical, Polynomial, x2, x3 Interactions, X3=X1 . X2 Linear Regression: The Classic Model Model is linear in parameters Predictors can be in non-linear relationship with Response variable and yet such relationship can be estimated using linear regression Error follows Normal Distribution with zero mean This does not imply Response variable has to follow normal distribution (a common mistake among analytics professionals) Predictors are not related to error In practice it is difficult to avoid because of missing variables Predictors are not correlated to each other {No multicollinearity} In practice, most systems have correlated predictors Errors are independent and identically distributed Errors across observations are uncorrelated {cross-sectional/ longitudinal} Error are homoscedastic {error variance does not vary with predictors values} 5 Homoscedastic Vs Heteroscedastic Errors 6 Linear Regression Estimation: The OLS Approach 7 Minimize RSS: ∑=1 2 = ∑=1 2Estimation Approach: Ordinary Least Square (OLS) Bivariate Linear Regression Model: Simple version Residual or Prediction Error = Deviation of Observed Values from their Predicted Values on the Fitted Line ∶ = + . : = = . Linear Regression Estimation: The OLS Approach 8 Systematic Component Random Component Linear Regression Model with Two Covariates Random Component Systematic Component Generalized Linear Models 9 Generalized Linear Model: Why Needed OLS estimation fails if random component follows non- normal distribution Maximum likelihood estimation approach is used to predict linear models Potential Scenarios of non-normal regression modeling Binary variable as response { 0 or 1} Modeled as Logistic Proportion of total cases as response {ranges from 0 to 1} Modeled as Binomial distribution If #cases=1 same as binary 10 Generalized Linear Model: Why Needed Potential Scenarios of non-normal regression modeling Count variable as response {non-negative number} Modeled as Poisson {variance=mean} Modeled as Negative Binomial { variance> mean} Poisson for Rates {if denominator/exposure variable is very large} Positive continuous variable, e.g. Rates, Service time Modeled as Gamma 11 Generalized Linear Models 12 One of the oldest agricultural research center. Established 1843; Birthplace of modern statistical theory and practice GLM Framework Generalized Linear Models (GLM) is a framework that: Extends ordinary linear regression model of continuous response variable to cases of categorical or discrete variables Maximum Likelihood Estimation approach (default) Iteratively Reweighted Least Squares Method (Nelder & Wedderburn 1972) GLM has three components: Random component: Error component follows exponential dispersion model family Systematic component: Linear predictor of response Link function: links the linear predictor to expected mean of response{unique to the GLM} 13 GLM Framework Random Component (Error Distribution): Exponential Dispersion Model Family. The probability function ; , is defined as ; , = , . is called the canonical parameter. >0 is dispersion parameter {similar to }. Should be very close to 1; under dispersion if <<1; and over dispersion if >> 1 is a known function called cumulant function. , is a normalizing function to ensure ; , is a probability function, i.e. ∫ ; , = for continuous y or ∑ ; , = for discrete y 14 GLM Framework Members of Exponential Dispersion Model Family Normal {e.g., Sample averages when sample sizes are sufficiently large} Bernoulli {e.g., Yes/No decisions} Binomial {e.g., number of success in ‘n’ trials sum of ‘n’ Bernoulli trials} Categorical or Multinomial Logit {e.g., a customer’s race/ethnicity} Multinomial {e.g., Customer counts in each race/ethnicity out of n customers} Exponential {e.g., Waiting Time in a queue or interarrival time} Gamma {e.g., amount of rainfall in reservoir, Waiting Time till k-th customer served, customer life-time value, annual health expenditure} Poisson {e.g., Number of customers in the queue} Negative Binomial – for over-dispersed count variable {e.g., number of hospital visits in a year}; generalization of Poisson distribution 15 GLM Framework Systematic Component (Linear Predictor) = + Conditional Expectation = 1, 2 … . ; Parameter vector = 0, 1 ,2 … . . . O is offset a parameter known a priori ; commonly occurs in Poisson GLM but may appear in any GLM. Offset is measure of exposure (a.k.a. denominator) variable. Annual Birth Count across cities can be modeled as Poisson, but this expected annual count depends on city’s adult population – offset or exposure Number of workers with lung diseases in various coal mines depends on the number of workers and how long they have worked. Offset or exposure would be number of person-years. X is model matrix [1,1,2 … . . .], with first column fixed at “1” for intercept parameter 0 , s are explanatory variables that includes interactions (3 = 1 2 ), quadratic term 5 = 4 4 or polynomial terms 5 = 4 4 4 16 GLM Framework Link Function g[μ] : connects conditional mean of response to the linear predictor = = + Regression parameters are estimated using Maximum Likelihood We skip all math and focus on example/ applications 17 GLM Framework Link Function g[μ] : connects conditional mean of response to the linear predictor = = + Regression parameters are estimated using Maximum Likelihood We skip all math and focus on example/ applications 18 GLM Framework: Canonical Link Functions 19 Range for Y 0,1,2… ∞ (0,1) Or (0,1,..n)/n ( ∞,∞) (0, ∞) (0, N) for Y Natural Link Function for Select Probability Distribution of Y Negative Binomial Distribution: k number of successes [or failures] before a r number of failures [or successes] has occurred; Alternately: n trials required for r success or failures. When p->0 or 1 GLM : Multiple Linear Regression in R Function: glm() from stats Package{part of Base R} glm(formula, data, subset, na.action, weights, offset, family = guassian(link=“identity”), start = NULL, control = glm.control(…), model = TRUE, y = TRUE, x = FALSE, …) formula: specify the model, e.g. y ~ x1 + x2 + x3 + offset(x4) data: data frame subset: if a subset of data needs to be regressed, e.g. regress data when gender=F na.action = what to do for missing observations, remove or impute weights= specify to perform weighted GLM; useful when response is observed over some type of denominator, e.g., varying time length, varying sample size, selection bias (probability of the observation being in sample) {Heckman’s correction} offset=same as above glm.control(…): used to set the parameters to control fitting process epsilon= convergence tolerance, maxit= max number of IWLS steps, trace=False {produce output at each iteration} model, y, x: are logical values indicating if they are to be returned as output. 20 Other GLM Packages Package glmx: Generalized Linear Models Extended Package glmnet: for Lasso and Elastic-net Regularized Generalized Linear Models Package glmertree: for Generalized Linear Mix Model Trees Package biglm: for Big data analysis- Bounded Memory Linear and Generalized Linear Models Package fishMod: for specialized GLMS on count data – Poisson-Sum-of-Gamma GLMs, Tweedie GLMs, and Delta Log-Normal GLMs Package hglm: for Hierarchical Generalized Linear Models Package oglmx: for Ordered Generalized Linear Models Package pglm: for Panel Generalized Linear Models Package plsRglm: Partial Least Square for Generalized Linear Models 21