Correlation and Regression

knt.nallasamygounder · Jan 21, 2013

Description
It Explains the Comprehensive presentation on correlation and regression for management students.

1

2

?

? ? ?

?

A correlation measures a relationship between two variables. The data can be represented by ordered pairs (x,y). Data metric (ratio or interval) Index to determine if a linear relationship exists between the two variables Examples of variables that may be correlated: ? Household size and oil consumption ? Class attentiveness and attendance ? Traditional vs Modern attitude

3

?

? ?

?

A correlation coefficient indicates the extent to which two variables are related. It can range from -1.0 to +1.0 A positive correlation coefficient indicates a positive relationship, a negative coefficient indicates an inverse relationship Correlation CANNOT be equated with causality.

4

Scatter Plots and Types of Correlation
x = oil consn y = hh size
4.00 3.75 3.50 3.25 3.00 2.75 2.50 2.25 2.00 1.75 1.50 300 350 400 450 500 550 600 650 700 750 800

HH size

Oil consumed

Positive Correlation–as x increases, y increases
5

Scatter Plots and Types of Correlation
60 50

x = Modern y = Traditional

Traditional

40 30 20 10 0 0 2 4 6 8 10 12 14 16 18 20

Modern

Negative Correlation–as x increases, y decreases
6

Scatter Plots and Types of Correlation
x = Income y salt consumption
160 150 140 130 120 110 100 90 80 60 64 68 72 76 80

Salt consn

Income

No linear correlation

7

Coefficient Range
0.00 0.20 0.40 0.60 0.80 0.20 0.40 0.60 0.80 1.00

Strength of Relationship
Very Low Low Moderate High Moderate Very High

8

A correlation tells you that a relationship exists between 2 variables (aside from the 3rd variable problem), but tell you absolutely nothing about cause and effect.

9

?

When variable A actually causes the change in B.

10

?

Variables A and B really do NOT have anything to do with each other but happen to go up or down simultaneously.

11

?

Variable A is correlated with variable B but there is a third factor C (the common underlying cause) that causes the changes in both A and B.

12

?

Pearson’s Corr Coeff , bivariate corr

Dividing both Num and Deno by n-1 – standardized scores . Thus varies between +1 and -1

r = COV xy Sx Sy

13

?

?

“r” indicates… ? strength of relationship (strong, weak, or none) ? direction of relationship ? positive (direct) – variables move in same direction ? negative (inverse) – variables move in opposite directions r ranges in value from –1.0 to +1.0

-1.0 Strong Negative

0.0 No Rel.

+1.0 Strong Positive
14

?

Coefficient of Determination
? r² ? Amount of variance accounted for in y by x ? Percentage increase in accuracy you gain by using the regression line to make predictions

0%

20%

40%

60%

80%

100%

15

?

r

correlation for a sample

?
?

?

? based on observations we have
actual correlation in population

Beware Sampling Error!!
? even if ?=0 (there’s no actual correlation), you might get r =.08 or r = -.26 just by chance. ? We look at r, but we want to know about ?

? the true correlation

16

? ?

Ho : ? =0 H1 : =0
?

?

Test Statistic is t = r n-2] 1-r
2

?

½

17

?

linearity:
? can’t describe non-linear relationships ? e.g., relation between anxiety & performance

?

truncation of range:
? underestimate stength of relationship if you can’t see full range of x value

?

no proof of causation
? third variable problem:
? could be 3rd variable causing change in both variables ? directionality: can’t be sure which way causality “flows”

18

?

Simple correlation - describes linear association between two variables. Partial r describes linear association between 2 variables , after the effect of additional variables have been taken out. (Brand purchase/ad seen / distribution) 3 Var – x, y,z . Correlate x and z to predict new x . Original x – New x = adjusted x Correlate y and z , predict new y and adjusted y Now correlate adjusted x and y. – Partial correlation
19

?

? ? ? ? ?

?

r xy.z = rxy – (rxz)(ryz) ? 1- r2 xz ? 1- r2 yz
Various orders - first order , second order ( controls for two variables) …… upto n Useful for detecting spurious relationships – after z has been taken out , corr between x and y may be 0.

?

?

20

?

To test a hypothesis of a relationship between two ordinal variables with many ranks and few ties.
Referred to as rs Used for a single representative sample from one population Ranges from –1.0 to zero to 1.0.

?

?

21

?

?
? ?

1.00 means that the rankings are in perfect agreement -1.00 is if they are in perfect disagreement 0 signifies that there is no relationship Square rs and multiply by 100 to find the percent of variation between the 2 ranks

22

23

? ?

?

? ?

Whether a relationship exists between variables How much of the variation in the dependent variable is explained by the independent variables : strength of relationship Determine the form of the relationship ( math eqn) Predict values of dependent variable Control for other variables when evaluating the contribution of a single variable or set of variables

24

?

?

Bivariate Regression , Simple Linear Regression , OLS :A statistical test to predict one variable from another or examine relationship between two variables. Mathematical Formula for a Straight Line
y = bx + a or y = bx +a +e a=1 b= 2

Assumptions •x and y are normally distributed •Homoscedasticity - the variance of y is the same for each value of x • There is an error in prediction
25

?

?

?

?

Objective in simple regression is to generate the best line between the two variables i.e., the best line that fits the data points The best line, or fitted line, is the one that minimizes the distances of the points from the line, Since distances can be +ve or –ve , distances are squared. Best line =lowest sum or least squares Fitted line gives ratio for the correspondence between x and y.

?

26

?

X deviations = X- X Y deviations = Y - Y

bar

=x =y

?

bar

?

xy is the cross product of deviations.
Objective is to keep that to the minimum.

?

True y value

error
Pred y val
27

Y
Y obs Total Variation Y pred Explained Variation Y mean Y
GM

Unexplained/ Error

X

29

?

Total Sum of Squares =TSS = RSS + ESS Coefficient of Determination r2 = RSS/ TSS

?

?

Therefore r2 = Explained Variance / Total Variance
If r2 = .95 95% of variance in Y is explained by variance in X

?

?

.05 is the error – as that is not explained by variation in x (like r2 – the maximum is 1)
RSS + ESS = TSS RSS = TSS – ESS r2 = RSS/TSS = (TSS - ESS)/TSS = 1 - ESS/TSS ESS/TSS = 1 – r2 Wilke’s Lamda or Unexplained Variance ( Tolerance)

? ?

?

?

31

?

Mantra!!
? As variability decreases, prediction accuracy ___ ? if we can account for variance, we can make better predictions

?

As r increases:
? r² increases
? “variance accounted for” increases ? the prediction accuracy increases

? prediction error decreases (distance between y’ and y)
?

We like big r’s!!!

33

Procedure for Regression Diagnostics
? ? ?

Develop a model that has a theoretical basis. Gather data for the two variables in the model. Draw the scatter diagram to determine whether a linear model appears to be appropriate. Determine the regression equation. Check the required conditions for the errors. Check the existence of outliers and influential observations

? ? ? ? ?

Assess the model fit.
Interpret the regression equation.

35

36

1.

Is there a relationship between stated loyalty towards brand and consumer’s ratings on product , packaging, price/vfm , outlet type purchased from and advertising response ? Assess relationship between recall of advertising and response to ad and its content– Enjoyable , stand out , comprehension , news , humour , celebrity….. Intention to recommend a laptop brand depends on current levels of satisfaction . Which attributes have greater impact on satisfaction. Intention to buy electroform jewellery is impacted by which variables? –Age , working status, affluence , desire to be well dressed?....... Intention to visit Pizza Hut ? Which product attributes determine CDM’s preference over other brands?
37

2.

3.

4.

5.

?

1. Relationship between 1 dependent & 2 or
more independent variables is a linear function
Population Y-intercept Population slopes

Random error

Yi ? ? 0 ? ? 1X 1i ? ? 2 X 2i ??? ? k X ki ? ? i
Dependent (response) variable Independent (explanatory) variables

Estimated as :

Partial Regression Coeff

? = a +b1 X1+ +b2 X2+ b3 X3 +……+ bk Xk

38

Model equation

Yi=a+b1X1i+b2X2i+…+bpXpi+ei a, b1, b2, ….,bp

Parameters to be estimated Error

ei= Yi-a-b1X1i-b2X2i-…bpXpi
Minimize ?i ei2

Method

Minimizes Error sum of squares ,Hence the name “Ordinary Least Square Regression”
39

PARTIAL REGRESSION COEFFICIENTS
?

? = a +b1 X1+ +b2 X2 VS ? = a +b X1
b and b1 will be difft , even though it is the same x

Why ?
•

•

•

X1 and X2 are usually corrletated. In bivariate, the effect of X1 and X2 on Y was attributed to X1. Interpretation of b1 is – measures change in Y with unit change in X1 – with X2 held constant. Suppose we regressed X1 on X2 and estimate the eqn Now we regress the residual X1 on Y . This b is = b1. Thus b1 is the effect on Y after effect of X2 removed.
40

X1 = a +b X2 . Residual X1 = Obsvd X1 – predicted X1

• •

?

When variables standarized – mean 0 and variance 1

B1 = b1 {Sx1 /Sy} . . . Bk = b1 {Sxk /Sy}
Std B- unit indept; Useful to compare B ( relative impact on Y) For Estmn – b. ( you need units)

42

Residuals…

the left overs from least-squares regression

Deviations from the overall pattern are important. The deviations In regression are the “scatter” of points about the line. The vertical distances from the line to the points are called residuals and they are the “left-over” variation after a regression line is fit.

Residual = observed y – predicted y

Y res = Yi - ?i

44

Residual Plots

( Diff between Y(obvd) and Y (pred)
We want normally distributed error terms
Residuals Versus Alcohol
(response is hrt_deat)

Things to look for:
1. 2. A curved pattern means the relationship is not linear. Increasing/decreasing spread about the line Individual points with large residuals Individual points that are extreme in the x direction

50

Residual

0

3.
-50

4.
0 1 2 3 4 5 6 7 8 9

Alcohol

Do we have any influential points here?

45

Ideal residual pattern

Curvature…a linear fit is not appropriate

Increasing variation

46

?
1.

Select from a large predictor set, a small subset that account for most variation in data

Forward Inclusion : Predictor variables entered at time – depending on explained variance (F test). The largest explnd variance enters first. 2. Backward elimination : All entered first – removed depending on F ratio. 3. Stepwise : Forward inclusion , with backward elimination at each step.

47

49

c. Model – No. of models d. Variables Entered f . Stepwise or usual method.

50

c. Corrln between Obsvd and Predicted value of dependent variable d. This is an overall measure of the strength of association e. Adjusted R-square - adjustment of the R-squared that penalizes the addition of extraneous predictors to the model. f. Std. Error of the Estimate - root mean squared error. It is the standard deviation of the error term.

51

d. . Regression, Residual, Total : Partitions variance that can be explnd by model and that which cant.
e. Sum of Squares - Sum of Squares associated with 3 variance. f. Total variance - N-1 degrees of freedom. The Regression d.o.f – no. of coeff + 1for intercept . The Error -is the DF total minus the DF model, 199 - 4 =195. g. Mean Square - the Sum of Squares divided by their respective DF

h. F and Sig. - This is the F-statistic & the p-value associated with it. The p-value is compared to some alpha level in testing the null hypothesis that all of the model coefficients are 0.
52

d. B – Partial Regression Coeff – forms the eqn
Ypredicted = b0 + b1*x1 + b2*x2 + b3*x3 + b4*x4 e. Std. Error - standard errors associated with the coefficients. f. Beta - These are the standardized coefficients – makes Beta comparable g. t and Sig. - at sig of .05 .. The coefficient for math is significantly different from 0 ;p< .05 socst (.050) is not sig p > .05

53

doc_581059441.pptx

Correlation and Regression

Attachments