Logistic Regression

balajiv.ganesh · Jan 31, 2013

Description
This is a presentation describes logistic regression with the help of examples.

Logistic Regression
?

In many ways logistic regression is like ordinary regression. It requires a dependent variable, y, and one or more independent variables.

?

Logistic regression can be used to model situations in which the dependent variable, y, may only assume two discrete values, such as 0 and 1.
The ordinary multiple regression model is not applicable.

?

In Logistic regression, the interest is in predicting which of two possible events are going to happen given certain other information. For example in Political Science, logistic regression could be used to analyze the factors that determine whether an individual participates in a general election or not.

Another example: Failing or Passing an exam
• Let us define a variable ‘Outcome’ Outcome = 0 if the individual fails the exam = 1 if the individual passes the exam • We can reasonably assume that Failing or Passing an exam depends on the quantity of hours we use to study • Note that in this case, the dependent variable takes only two possible values. We will call it ‘dichotomic’ variable.

Linear Probability Models (LPM)
• Our dataset contains information about 14 students. • Our statistical software (SPSS) will happily perform a linear regression of Outcome, on the quantity of study hours.
Student id Outcome Quantity of Study Hours

1
2 3 4 5 6

0
1 0 0 0 1

3
34 17 6 12 15

7
8 9 10 11 12 13 14

1
1 0 1 0 1 1 0

26
29 14 58 2 31 26 11

Linear Probability Models (LPM) – What is wrong about them?
• Let us do a scatter plot and insert the regression line: • The probability of Outcome=1 can take values between 0 and 1 • But we do not observe probabilities but the actual event happening • A straight line will predict values between negative and positive infinity, outside the [0,1] interval!
OUTCOME
1.2 1.0

.8

.6

.4

.2

0.0

-.2 0 10 20 30 40 50 60

HSTUDY

What is wrong with LPM?
Coefficients Model Unstandardized Coefficients Sig. B Std. Error 1 (Constant) 0.031861 0.161591 0.846994 HSTUDY 0.026219 0.006483 0.001627 Dependent Variable: OUTCOME

a

• Above is the SPSS output on the linear regression of ‘Outcome’ on Hours of Study • The results suggest that an increase in 1 hour of studying increases the probability of passing the exam, on average, by approx. 0.026 or 2.6%. • So what would the model predict if we studied 100 hours for the exam?

Linear Probability Models (LPM) – What is wrong with them? • Basically, the linear relation we had postulated before between X and Y is not appropriate when our dependent variable is dichotomic. Predictions for the probability of the event occurring would lie outside the [0,1] interval, which is unacceptable.

Non Linear Probability Models
• We want to be able to model the probability of the event occurring with an explanatory variable ‘X’, but we want the predicted probability to remain within the [0,1] bounds. • We will focus on the Logistic. • The Logistic Curve will relate the explanatory variable X to the probability of the event occurring. In our example, it will relate the number of study hours with the probability of passing the exam.

The Logit Model
• A Logit Model states that:
– Prob(Yi=1) = F (b0 + b1 Xi) – Prob(Yi=0) = 1 - F (b0 + b1 Xi) – Where F(.) is the ‘Logistic Function’. – So, the probability of the event occurring is a logistic function of the independent variables

• Therefore, we will be interested in finding estimates for b0 and b1 so that the Logistic Function best fits the data

The logistic function

The logistic function

e ˆ ?i ? b0 ? b1 X1 1? e
• Where ?-hat is the estimated probability. The logistic regression equation: y = b0 + b1 x

b0 ? b1 X1

Example
• Consider the data on the CGPA (up to first semester in the second year of MBA) of 20 MBA students, and their success in the first interview for placement. Pass is indicated as 1 and Fail as 0. The data is file logistic_regression.sav

Coefficientsa Unstandardized Coefficients B Std. Error -5.530 2.884 1.870 .872 Standardize d Coefficients Beta .451

Model 1

(Constant)

CGPA a. Dependent Variable:

t -1.917 2.144

Sig. .071 .046

A simple linear regression will give the model to be: Y = 1.870 – 5.530 x This equation is not feasible as it does not always yield probabilities between 0 and 1. ( for x=3.8 ?)

We proceed with logistic regression using SPSS. Analyze – Regression – Binary Logistic. Dependant = Result; Covariate = CGPA.

Variables in the Equation B Step 1a CGPA Constant 9.781 -31.528 S.E. 5.299 17.336 Wald 3.408 3.307 df 1 1 Sig. Exp(B)

.065 1.770E4 .069 .000

a. Variable(s) entered on step 1: CGPA.

A logit regression will give the model to be: Y = 9.781 x – 31.528 If p is the probability of passing the interview, we have Y = log [p/(1-p)] = 9.781 x – 31.528 P = exp (9.781 x – 31.528) / 1 + exp (9.781 x – 31.528)

Example
• A systems analyst studied the effect of computer programming experience on ability to complete a task within a specified time • Twenty-five persons selected for the study, with varying amounts of computer experience (in months) • Results are coded in binary fashion: Y = 1 if task completed successfully; Y = 0, otherwise

Example: continued
• Results from a standard package provide the following results: b0 = –3.0597 and b1 = 0.1615 • Estimated logistic regression function exp(?3.0597 ? .1615 X ) ˆ ?? 1 ? exp(?3.0597 ? .1615 X ) • For example, the fitted value for X = 14 is exp(?3.0597 ? .1615(14)) ˆ ?? ? .31 1 ? exp(?3.0597 ? .1615(14))

doc_876868929.ppt

Logistic Regression

Attachments