Class outline Day 9: 12-07-2011.  Wednesday week 4.  Winter 2011-2012.

Preliminaries:  (1) Working on quiz 1 and once these are graded, scores tallied, recorded, and posted to Moodle, results analyzed, and a key made I will work on the SAS assignment.  Goal is to have both graded and returned to you by Friday but expect Monday. (2) We start looking at problems on practice set 3 today. (3) You should be seriously thinking about your projects this week.  (4) Also, if you had significant difficulty on Monday you and I should discuss how effectively you are using time outside of class. The answer key I am preparing for Quiz 1 will focus on what I must emphasize many of you obviously did do and what all of you should have been doing to prepare yourself for the quiz and you worked on the SAS assignment. Over and above the content of the course teaching all of you what  so many of you already know, that it is never a good idea to cut corners and cram is very important.  Learning this  brings all of you closer to the adult world of work and responsibility.  As I asked you to on the first day of class have a look at Leamnson's "Learning (Your First Job)".  Pay particularly close attention to the sections entitled "Studying" and "Between Classes". 

WHERE HAVE BEEN:

Correlation and introduction to regression:

Sections 7, and 1-3 of chapter 11. We estimate the strength and direction of the linear relationship between two variables with the sample correlation coefficient, r.  The interpretation of this number, the fact that it is sensitive to outliers, that it cannot help us identify strong NON-linear relationships, and the fact that correlation does not imply causation are all important.  With r in hand we often want to see what about the actual correlation, ρ, we can infer.  When doing such a test (test on ρ), we assume that both variables are normally distributed. As a test of this (that both x and y are normally distributed), we use the Shapiro-Wilk test.

When we reject the null in tests of significance on rho this indicates that there is a significant linear (line-like) relationship.  Means that the relationship between the two variables can be captured by a line and a linear equation (a straight line equation). 

What sort of information does a straight line equation yield?   (1) slope, (2) y intercept and (3) predictions of y based on x.

Technique of OLS (ordinary least squares) or for short, just LS (least squares). 

Important terms: dependent variable and independent variable (simple regression) (independent variables in the case of multiple regression.)


 

The coefficient of determination, R2

Section 4 of chapter 11.

Note on SAS output where you find  the coefficient of determination: R

R squared is a measure of well the "line of best fit" (the LS sample regression equation) fits the data. 

Calculated by taking the ratio of total "explained" variation in y to the total variation in y.

Total variation in y is equal to the numerator of the sample variance formula.   The amount of "explained" variation in y is equal to the total variation in y that you start with  minus the amount of variation left unaccounted for by the estimated regression line, SSE or the error sum of squares (sum of squared residuals).


 

Outlier analysis. 

Notes.  Not discussed in text. Like sample correlation coefficient, LS estimates sensitive to outlier.  With simple regression can more often than not identify outliers visually.  Technique for identifying outliers statistically introduced in problem set 2.  Technique is called least median squares (LMS) .  LMS is an extremely powerful technique we can use to identify outliers.

Idea is that, as you know, median values are far less sensitive to outliers than mean averages are.

For more background on LMS click here or here.


 

Writing SAS code to generate LMS statistics:

As an example take the following program:

TITLE'first example of LMS';

TITLE2'PERFORMANCE TIME';

DATA TIME;

LABEL TT='TRAINING TIME (HR)'

PT='PERFORMANCE TIME (MIN)';

INPUT TT PT;

DATALINES;

27 15

28 14

22 18

22 17

15 22

29 13

24 15

20 16

26 14

15 21

22 16

25 18

23 17

28 13

20 15

25 15

18 20

;

proc iml;

use time;

read all var {TT} into x;

/*you would list the explanatory variables here*/

read all var {PT} into y;

/*you would list the dependent variable here*/

optn=j(8,1,.);

optn[2]=2;

optn[3]=1;

optn[8]=0;

call lms(sc,coef,wgt,optn,y,x);

run;

/*ignore the red and leave the six lines above as they are here*/

 (Second example.  Calculator price and sales example)



 

 

MULTIPLE REGRESSION (PROBLEM SET 3  SECTIONS 1-3 OF CHAPTER 12)

 

WHERE WE STILL HAVE TO GO BEFORE CHRISTMAS BREAK : INTRODUCTION TO MULTIPLE REGRESSION.

"Multiple regression". STARTING WITH R SQUARED.

Hours of internet use example:  Simple regression: Hours of internet use and number of kids. R squared (i.e. coefficient of determination)

MLR (multiple linear regression) Estimation of regression planes versus regression lines.

        (a) Running MLR with SAS:
                proc reg; model H=C I E N;

        (b) Interpretation of results. (powerpoint)

        (c) Forecasts using estimated regression equation.

        (d) Using JMP to get MLR results (click here.)

        (e) Reading SAS output (click here

        (f)  Additional examples.

 

 

 

: