Regression and Correlation using SPSS

balajiv.ganesh · Jan 31, 2013

Description
This is a PPT describing on regression and correlation with examples.

SIMPLE REGRESSION AND CORRELATION
Statistical summaries such as the average
and variability are usually adequate when
we have univariate data.

But for bivariate data these are not sufficient.
It is also important to see the relationship
between the two.

SIMPLE REGRESSION AND CORRELATION
There are three objectives when studying the
relationship of these variables ( bivariate data)

1. Describing and understanding the relationship.

2. Forecasting and predicting a new observation.

3. Adjusting and controlling a process
Two basic tools for summarizing bivariate data:

1. Regression analysis

2. Correlation analysis

SIMPLE REGRESSION AND CORRELATION
Regression analysis is used to predict or control the
value of one variable on the basis of other variables.
The technique involves developing a mathematical
equation that describes the relationship between the
variable to be forecast ( dependent variable ) and the
independent variable that is believed to be related to
the dependent variable.
It is helpful to explore the data through using a scatter
plot or scatter diagram.
This diagram provides two types of information that is ;
1. Virtually looking at the pattern which indicated the
relationship ( if any )
2. If related, we can see what kind of line or equation
describes the relationship.
Simple Linear Regression Model
y = a+ bx +c
where:
a and b are called parameters of the model,
c is a random variable called the error term.
? The simple linear regression model is:
? The equation that describes how y is related to x and
an error term is called the regression model.
Assumptions About the Error Term c
1. The error c is a random variable with mean of zero.
2. The variance of c , denoted by o
2
, is the same for
all values of the independent variable.
3. The values of c are independent.
4. The error c is a normally distributed random
variable.

SIMPLE REGRESSION AND CORRELATION
Estimating Using the Regression Line
First, lets look at the equation of
a straight line is:
bX a Y
Independent
variable
Slope of the line
Dependent
variable
Y-intercept

SIMPLE REGRESSION AND CORRELATION
The Method of Least Squares
To estimate the straight line we have
to use the least squares method.

This method minimizes the sum of squares
of error between the estimated points on the
line and the actual observed points.

SIMPLE REGRESSION AND CORRELATION
The estimating line
bX a Y
ˆ
Slope of the best-fitting Regression Line
( )
¿ ¿
¿ ¿ ¿
÷
÷
=
2
2
X X n
Y X XY n
b
Y-intercept of the Best-fitting Regression Line
X b Y a

SIMPLE REGRESSION AND CORRELATION
Suppose an appliance store conducts a
five-month experiment to determine
the effect of advertising on sales revenue.
The results are shown below.
Month Advertising Exp.($100s) Sales Rev.($1000S)
1 1 1
2 2 1
3 3 2
4 4 2
5 5 4

SIMPLE REGRESSION AND CORRELATION
X Y X
2
XY
1 1 1 1
2 1 4 2
3 2 9 6
4 2 16 8
5 4 25 20
¿
=15 X
¿
=10 Y
¿
= 55
2
X
¿
= 37 XY
3
5
15
= = X 2
5
10
= = Y

SIMPLE REGRESSION AND CORRELATION
1 0 3 7 0 2 . . a ÷ = × ÷ =
X b Y a ÷ =
X . . Y
ˆ
7 0 1 0
( )
¿ ¿
¿ ¿ ¿
÷
÷
=
2
2
X X n
Y X XY n
b
b = 0.7

SIMPLE REGRESSION AND CORRELATION
The standard error of estimate is used to
measure the reliability of the estimating
equation.

It measures the variability or scatter of
the observed values around the regression
line.

SIMPLE REGRESSION AND CORRELATION
Standard Error of Estimate
2
2
÷
÷
=
n
Y
ˆ
Y
s
e
Short-cut
2
2
n
XY b Y a Y
s
e

SIMPLE REGRESSION AND CORRELATION
Y
2
1
1
4
4
16
¿
= 26
2
Y
2
2
÷
÷ ÷
=
n
XY b Y a Y
s
e
2 5
37 7 0 10 1 0 26
÷
÷ ÷ ÷
=
. .
s
e
6055 0. =

SIMPLE REGRESSION AND CORRELATION
Correlation analysis is used to describe
the degree to which one variable is
linearly related to another.

There are two measures for describing
correlation:

1. The Coefficient of Determination

2. The Coefficient of Correlation
The coefficient of correlation:

( ) | | ( ) | |
2
2
2
2
¿
÷
¿ ¿
÷
¿
¿ ¿ ¿
÷
=
y y n x x n
y x xy n
r
Sample Coefficient of Determination
2
r
Alternate Formula
¿
¿ ¿
÷
÷ +
=
2 2
2
2
Y n Y
Y n XY b Y a
r

SIMPLE REGRESSION AND CORRELATION
÷
÷ +
=
2 2
2
2
Y n Y
nY XY b Y a
r
( ) ( ) ( )
( )
2
2
2
2 5 26
2 5 37 7 . 0 10 1 . 0
÷
÷ + ÷
= r
8167 . 0 =

SIMPLE REGRESSION AND CORRELATION
Interpretation:
Therefore, we can conclude that 81.67 %
of the variation in the sales revenues is
explain by the variation in advertising
expenditure.

The Coefficient of Correlation or
Karl Pearson’s Coefficient of Correlation
The coefficient of correlation is the square
root of the coefficient of determination.

The sign of r indicates the direction of the
relationship between the two variables X
and Y.

The sign of r will be the same as the
sign of the coefficient “b” in the regression
equation Y = a + b X

SIMPLE REGRESSION AND CORRELATION
If the slope of the estimating
line is positive

If the slope of the estimating
line is negative
:- r is the positive
square root

:- r is the negative
square root
2
r r
9037 . 0 8167 . 0 = = r
The relationship between the two variables is direct

Multiple Regression and Correlation Analysis
Multiple regression uses many independent
variables to predict or explain the variation
in a dependent variables
The basic multiple regression model is a
first-order model, containing each predictor
but no nonlinear terms such as squared values.

In this model, each slope should be interpreted
as a partial slope, the predicted effect of a one-
change in a variable, holding all other variables
constant.

Multiple Regression and Correlation Analysis
Estimating Equation Describing Relationship
among Three Variables
c + + + =
2 2 1 1
ˆ
X b X b a Y
Testing for Significance: F Test
The F test is used to determine whether a significant
relationship exists between the dependent variable
and the set of all the independent variables.
The F test is referred to as the test for overall
significance.
A separate t test is conducted for each of the
independent variables in the model.
If the F test shows an overall significance, the t test is
used to determine whether each of the individual
independent variables is significant.
Testing for Significance: t Test
We refer to each of these t tests as a test for individual
significance.
Testing for Significance: Multicollinearity
The term multicollinearity refers to the correlation
among the independent variables.
When the independent variables are highly correlated
(say, |r | > .7), it is not possible to determine the
separate effect of any particular independent variable
on the dependent variable.
Every attempt should be made to avoid including
independent variables that are highly correlated.
Autocorrelation
? Often, the data used for regression studies in
business and economics are collected over time.
? It is not uncommon for the value of y at one time
period to be related to the value of y at previous time
periods.
? In this case, we say autocorrelation (or serial
correlation) is present in the data.
? When autocorrelation is present, one of the
regression assumptions is violated: the error terms
are not independent.

Multiple Regression and Correlation Analysis - Example
The department is interested to know whether
the amount of field audits and computer hours
spent on tracking have yielded any results.
Further the department has introduced the
reward system for tracking the culprits. The
data on actual unpaid taxes for ten cases is
considered for analysis. Initially the regression
of Actual Unpaid Taxes (%) on Field Audits (%)
and Computer Hours was carried out and as a
next step the Reward to Informants (%) was
also considered as a variable and analyzed.
The analysis yielded the following SPSS
outputs.

Multiple Regression and Correlation Analysis - Example

Multiple Regression and Correlation Analysis

Multiple Regression and Correlation Analysis

Multiple Regression and Correlation Analysis
2 2 1 1
X b X b a Y
ˆ
2 1
099 1 564 0 820 13 X . X . . Y
ˆ
+ + ÷ =
Using two independent variables :
3 2 1
405 0 177 1 597 0 796 45 X . X . X . . Y
ˆ
+ + + ÷ =
Using three independent variables:
Model Assessment
? The model is assessed using three
tools:
–The coefficient of determination
–The standard error of estimate
–The F-test of the analysis of
variance

? From the printout, R
2
= 0.983
? 98.3% of the variation in actual unpaid
taxes is explained by the three
independent variables. 1.7% remains
unexplained.
? When adjusted for degrees of freedom,
Adjusted R
2
= 1-[SSE/(n-k-1)] /
[SS(Total)/(n-1)] = 97.5%
Coefficient of Determination
? The standard deviation of the error is
estimated by the Standard Error of
Estimate.

Standard Error of Estimate
? From the printout, s
c
= 0.29
? Question: Can we conclude the model
fits the data well?
? We pose the question:
Is there at least one independent variable
linearly related to the dependent variable?
? To answer the question we test the
hypothesis

H
0
: B
0
= B
1
= B
2
= … = B
k
H
1
: At least one B
i
is not equal to zero.

? If at least one B
i
is not equal to zero, the
model has some validity.
Testing the Validity of the Model

Multiple Regression and Correlation Analysis
Inferences about the Regression as a Whole
0
3 2 1
= = = B B B : H
o
0
1 i
B atleastone : H

Multiple Regression and Correlation Analysis

Multiple Regression and Correlation Analysis
Test Statistic
1 ÷ ÷
=
k n SSE
k SSR
F
Value of the test statistic:
51 118
6 491 0
3 109 29
.
.
.
F = =

Multiple Regression and Correlation Analysis
Test of whether a variable is significant.
Test whether reward to informants is a significant
explanatory variable.
0
3 0
B : H
0
3 1
B : H

Multiple Regression and Correlation Analysis

Multiple Regression and Correlation Analysis
Test statistic, with n-2 degrees of freedom:
i
b
i
o
s
b
t =
Rejection Region
306 . 2
8 / 05 . 0
= > t t
o
Value of the test statistic:
6429 9
042 0
405 0
.
.
.
t
o
= =
Conclusion:
The standardized regression coefficient
is 9.6429 which is outside the acceptance
region. Alternately, the actual significance
is zero. Therefore we will reject the null
hypothesis. The reward to informants is
a significant explanatory variable.
? b
0
= - 45.796. This is the intercept, the value
of y when all the variables take the value
zero. Since the data range of all the
independent variables do not cover the value
zero, do not interpret the intercept.
? b
1
= 0.597. In this model, for each additional
field audit, the actual unpaid taxes increases
on average by .597% (assuming the other
variables are held constant).
Interpreting the Coefficients
? Where to locate a new motor inn?
– La Quinta Motor Inns is planning an
expansion.
– Management wishes to predict which sites are
likely to be profitable.
– Several areas where predictors of profitability
can be identified are:
• Competition
• Market awareness
• Demand generators
• Demographics
• Physical quality
Another Example
Estimating the Coefficients and Assessing the Model
Estimating the Coefficients and
Assessing the Model, Example
Profitability
Competition
Market
awareness
Customers Community Physical
Operating Margin
Rooms Nearest Office
space
College
enrollment
Income Disttwn
Distance to
downtown.
Median
household
income.
Distance to
the nearest
La Quinta inn.
Number of
hotels/motels
rooms within
3 miles from
the site.
? Data were collected from randomly selected 100 inns
that belong to La Quinta, and ran for the following
suggested model (Multiple Regr_Margin.sav):

Margin = |
0
+ |
1
Rooms + |
2
Nearest +
|
3
Office + |
4
College + |
5
Income +
|
6
Disttwn
Estimating the Coefficients and
Assessing the Model, Example
Margin Number Nearest Office Space Enrollment Income Distance
55.5 3203 4.2 549 8 37 2.7
33.8 2810 2.8 496 17.5 35 14.4
49 2890 2.4 254 20 35 2.6
31.9 3422 3.3 434 15.5 38 12.1
57.4 2687 0.9 678 15.5 42 6.9
49 3759 2.9 635 19 33 10.8
This is the sample regression equation
(sometimes called the prediction equation)
Regression Analysis, SPSS Output
Margin = 38.139 - 0.008Number +1.646Nearest + 0.020Office Space
+0.212Enrollment + 0.413Income - 0.225Distance
? From the printout, s
c
= 5.5121 & R
2
= 0.525
? 52.5% of the variation in operating margin is
explained by the six independent variables.
47.5% remains unexplained.
? When adjusted for degrees of freedom,
Adjusted R
2
= 49.4%
Coefficient of Determination
? We pose the question:
Is there at least one independent variable
linearly related to the dependent variable?
? To answer the question we test the
hypothesis

H
0
: B
0
= B
1
= B
2
= … = B
k
H
1
: At least one B
i
is not equal to zero.

? If at least one B
i
is not equal to zero, the
model has some validity.
Testing the Validity of the Model
? The hypotheses are tested by an
ANOVA procedure ( the SPSS output)
Testing the Validity of the La
Quinta Inns Regression Model
MSE=SSE/(n-k-1)
MSR=SSR/k
MSR/MSE
SSE
SSR
k =
n–k–1 =
n-1 =
ANOVA
df SS MS F Significance F
Regression 6 3123.8 520.6 17.14 0.0000
Residual 93 2825.6 30.4
Total 99 5949.5
F
o,k,n-k-1
= F
0.05,6,100-6-1
=2.17
F = 17.14 > 2.17
Also, the p-value (Significance F) = 0.0000
Reject the null hypothesis.
Testing the Validity of the La
Quinta Inns Regression Model
ANOVA
df SS MS F Significance F
Regression 6 3123.8 520.6 17.14 0.0000
Residual 93 2825.6 30.4
Total 99 5949.5
Conclusion: There is sufficient evidence to reject
the null hypothesis in favor of the alternative hypothesis.
At least one of the |
i
is not equal to zero. Thus, at least
one independent variable is linearly related to y.
This linear regression model is valid
? b
0
= 38.139. This is the intercept, the value of
y when all the variables take the value zero.
Since the data range of all the independent
variables do not cover the value zero, do not
interpret the intercept.
? b
1
= – 0.008. In this model, for each
additional room within 3 mile of the La Quinta
inn, the operating margin decreases on
average by .008% (assuming the other
variables are held constant).
Interpreting the Coefficients
? b
2
= 1.646. In this model, for each additional mile
that the nearest competitor is to a La Quinta inn,
the operating margin increases on average by
1.646% when the other variables are held constant.
? b
3
= 0.020.

For each additional 1000 sq-ft of office
space, the operating margin will increase on
average by .02% when the other variables are held
constant.
? b
4
= 0.212. For each additional thousand students
the operating margin increases on average by
.212% when the other variables are held constant.
Interpreting the Coefficients
? b
5
= 0.413. For additional $1000 increase in
median household income, the operating
margin increases on average by .413%, when
the other variables remain constant.
? b
6
= -0.225. For each additional mile to the
downtown center, the operating margin
decreases on average by .225% when the
other variables are held constant.
Interpreting the Coefficients
? The hypothesis for each |
i
is

? SPSS printout

H
0
: |
i
= 0
H
1
: |
i
= 0
d.f. = n - k -1
Test statistic
i
b
i i
s
b
t
| ÷
=
Testing the Coefficients
? Predict the average operating margin of an
inn at a site with the following characteristics:
– 3815 rooms within 3 miles,
– Closet competitor .9 miles away,
– 476,000 sq-ft of office space,
– 24,500 college students,
– $35,000 median household income,
– 11.2 miles distance to downtown center.
MARGIN = 38.14 - 0.008(3815) +1.65(.9) + 0.020(476)
+0.212(24.5) + 0.413(35) - 0.225(11.2) = 37.1%
La Quinta Inns, Predictions
? Interval estimates
It is predicted, with 95% confidence that the
operating margin will lie between

37.1 ± 1.96(5.5121) = 26.3% and 47.9%.

La Quinta Inns, Predictions
In many situations we must work with qualitative
independent variables such as gender (male, female),
method of payment (cash, check, credit card), etc.
For example, x
2
might represent gender where x
2
= 0
indicates male and x
2
= 1 indicates female.
Qualitative Independent Variables
In this case, x
2
is called a dummy or indicator variable.
Dummy Variables
In order to carry out regression analysis
with dummy variables, the categories of
dummy variables must be coded as 0
and 1.
Example: Consider the SPSS file
HOUSE3.sav
Using SPSS for Interactions
? Create a new variable – Interaction and
proceed.
? Consider the example HOUSE3.sav
Using SPSS for Quadratic
Regression
? Create a new variable that is square of
the X variable and proceed.
? Consider the example FLYASH.sav

doc_648941113.ppt

Regression and Correlation using SPSS

Attachments