Day 7 Class outline. BUSN 212. Wednesday Week 3.

11-30-2011

Preliminaries:   

(1) Quiz 1 and SAS assignment #1 are both moved to next Monday, December 4th.  (2) We started "regression" analysis Monday. (3) Note handout on project posted to Moodle this  morning.  Please take some time to read this. I should be meeting with your group soon to discuss project ideas with you.  (4) The midterm exam is scheduled for Friday of week five (12-16-2011).

 


Recap of introduction to regression:

When we reject the null in tests of significance on rho this indicates that there is a significant linear (line-like) relationship.  Means that the relationship between the two variables can be captured by a line and a linear equation (a straight line equation). 

What sort of information does a straight line equation yield?   (1) slope, (2) y intercept, (3) predictions of y based on x, and (4) prediction of x based on y.

How do we get such estimates?

Technique of OLS (ordinary least squares) or for short, just LS (least squares). 

An example?  LS estimates of slope and intercept of quantity calculators sold and calculator price example (JMP output). 

(1) Slope: positive or negative?  Large or small in absolute value?  If positive and large this means that a relatively small increase in x (a one unit increase) leads to a pretty substantial increase in y (the increase in y is equal to the slope). If positive and small means that a relatively small increase in x (a one unit increase) leads to a pretty modest increase in y (again the increase equal to the slope). If negative and large in absolute value means that a relatively small increase in x (a one unit increase) leads to a pretty substantial drop in y (the decrease equal to the slope). If negative and small means that a relatively small increase in x (a one unit increase) leads to a pretty modest drop in y (the decrease equal to the slope). In general this is how you interpret the estimated slope:

We estimate that on average an increase in                   (x) of one                (units x is measured in) will lead to a

                    (ABSOLUTE VALUE OF b1)                 (units Y is measured in) increase/decrease (increase if b1> 0 or decrease if b1 < 0)  in                        (y).

(2) y intercept: What we estimate we would expect  y on average to be if the explanatory variable were equal to zero. 

(3) If the relationship between y and x follows the linear equation estimated then our estimate of what we would expect y to be given a particular value of x can be generated straight from the equation itself.  In our example on Monday: y^=27.273-1.017978*x where y =quarterly sales of calculators per store (measured in 1000s of calculators ) and x=the price charged per calculator we forecasted that this company should expect sales of about 10,476 calculators per store. If they are targeting quarterly sales of  8,000  calculators per store they should charge no more than $18.93 per calcultor.

Another example? Performance time and training time.  JMP output.


Outlier analysis. (SAS output for PT and TT)

Like sample correlation coefficient, LS estimates sensitive to outlier.  With simple regression can more often than not identify outliers visually.  Technique for identifying outliers statistically introduced in problem set 2.  Technique is called least median squares (LMS) .  LMS is an extremely powerful technique we can use to identify outliers.

Idea is that, as you know, median values are far less sensitive to outliers than mean averages are.

For more background on LMS click here or here.


Writing SAS code to generate LMS statistics:

As an example take the following program:

TITLE'first example of LMS';

TITLE2'PERFORMANCE TIME';

DATA TIME;

LABEL TT='TRAINING TIME (HR)'

PT='PERFORMANCE TIME (MIN)';

INPUT TT PT;

DATALINES;

27 15

28 14

22 18

22 17

15 22

29 13

24 15

20 16

26 14

15 21

22 16

25 18

23 17

28 13

20 15

25 15

18 20

;

proc iml;

use time;

read all var {TT} into x;

/*you would list the explanatory variables here*/

read all var {PT} into y;

/*you would list the dependent variable here*/

optn=j(8,1,.);

optn[2]=2;

optn[3]=1;

optn[8]=0;

call lms(sc,coef,wgt,optn,y,x);

run;

/*ignore the red and leave the six lines above as they are here*/

SECOND EXAMPLE (quarterly sales calculators and price example)



 

The coefficient of determination, R2

Note on JMP output where you find  R (see SUMMARY OF FIT).

R squared is a measure of well the "line of best fit" (the LS sample regression equation) fits the data. 

Calculated by taking the ratio of total "explained" variation in y to the total variation in y.

Total variation in y is equal to the numerator of the sample variance formula.   The amount of "explained" variation in y is equal to the total variation in y that you start with  minus the amount of variation left unaccounted for by the estimated regression line, SSE or the error sum of squares (sum of squared residuals).

ON SAS OUTPUT:

first example of LMS
PERFORMANCE TIME

 
The REG Procedure
Model: MODEL1
Dependent Variable: PT PERFORMANCE TIME (MIN)

 

Number of Observations Read 14
Number of Observations Used 14

 
Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 1 103.69724 103.69724 217.12 <.0001
Error 12 5.73134 0.47761    
Corrected Total 13 109.42857      

 
Root MSE 0.69109 R-Square 0.9476
Dependent Mean 16.42857 Adj R-Sq 0.9433
Coeff Var 4.20666    

 
Parameter Estimates
Variable Label DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept Intercept 1 30.72555 0.98771 31.11 <.0001
TT TRAINING TIME (HR) 1 -0.61777 0.04193 -14.73 <.0001