Preliminaries: (1) Still working on quiz 1. Will have them for you on Monday . (2) Quiz 2 will be posted to Moodle Wednesday morning and is due Friday at class time. (3) Your group project proposals (see Moodle) are due Wednesday. If for some reason your group needs more time Friday is okay but Wednesday is better. (4) The midterm Friday the 16th will be over all of the material on sets 1 and 2 and most of what is covered on set 3. These three sets, their keys, the two quizzes, and the key for quiz 1 should leave little room for doubt about what to expect next Friday. Study guide: Review the three practice sets and their keys, when you get them quiz 1 and its key, and make sure you carefully work through the take-home quiz. Use your notes, handouts, these outlines, and the assigned reading for reference. Work on all the problems on these practice sets with notes in hand but without the answer keys. Note questions you have trouble with. The material in such questions are your weak spots and are therefore topics where YOU need extra help. Please make sure that you get the help you need.
Note that you will be given copies of the formula sheet and t tables you used for Monday's quiz. You cannot use the formula sheets and tables you have been working with as you study. Friday's exam is a closed note test.
We started looking at problems on practice set 3 Wednesday . Today we continue to work through this problem set.
We estimate the strength and direction of the linear relationship between two variables with the sample correlation coefficient, r. The interpretation of this number, the fact that it is sensitive to outliers, that it cannot help us identify strong NON-linear relationships, and the fact that correlation does not imply causation are all important. With r in hand we often want to see what about the actual correlation, ρ, we can infer. When doing such a test (test on ρ), we assume that both variables are normally distributed. As a test of this (that both x and y are normally distributed), we use the Shapiro-Wilk test.
When we reject the null in tests of significance on rho this indicates that there is a significant linear (line-like) relationship. Means that the relationship between the two variables can be captured by a line and a linear equation (a straight line equation).
What sort of information does a straight line equation yield? (1) slope, (2) y intercept and (3) predictions of y based on x.
Technique of OLS (ordinary least squares) or for short, just LS (least squares).
Important terms: dependent variable and independent variable (simple regression) (independent variables in the case of multiple regression.)
Section 4 of chapter 11.
Note on SAS output where you find the coefficient of determination: R2
R squared is a measure of well the "line of best fit" (the LS sample regression equation) fits the data.
Calculated by taking the ratio of total "explained" variation in y to the total variation in y.
Total variation in y is equal to the numerator of the sample variance formula. The amount of "explained" variation in y is equal to the total variation in y that you start with minus the amount of variation left unaccounted for by the estimated regression line, SSE or the error sum of squares (sum of squared residuals).
Like sample correlation coefficient, LS estimates sensitive to outlier. With simple regression can more often than not identify outliers visually. Technique for identifying outliers statistically introduced in problem set 2. Technique is called least median squares (LMS) . LMS is an extremely powerful technique we can use to identify outliers.
Idea is that, as you know, median values are far less sensitive to outliers than mean averages are.
For more background on LMS click here or here.
As an example take the following program:
TITLE
'first example of LMS';TITLE2
'PERFORMANCE TIME';DATA
TIME;LABEL
TT='TRAINING TIME (HR)'PT=
'PERFORMANCE TIME (MIN)';INPUT
TT PT;DATALINES
;27 15
28 14
22 18
22 17
15 22
29 13
24 15
20 16
26 14
15 21
22 16
25 18
23 17
28 13
20 15
25 15
18 20
;
proc
iml;use
time;read
all var {TT} into x;/*you would list the explanatory variables here*/
read
all var {PT} into y;/*you would list the dependent variable here*/
optn=j(8,1,.);
optn[2]=2;
optn[3]=1;
optn[8]=0;
call
lms(sc,coef,wgt,optn,y,x);run;
/*ignore the red and leave the six lines above as they are here*/
(Second example. Calculator price and sales example)
MULTIPLE REGRESSION (PROBLEM SET 3 SECTIONS 1-3 OF CHAPTER 12)
Using just one explanatory variable will often generate disappointingly low R2 values. We add variables (either quantitative or, with the help of "dummy variables", qualitative factors) to improve the fit of the estimated equation to the data.
Appliance sales example:
MLR (multiple linear regression) Estimation of regression planes versus regression lines. (a) Running MLR
with SAS:
proc reg; model S =A C D W;
(b) Interpretation of results.
(c) Forecasts using estimated regression equation. (Say for a period where we spend $5,000 on advertising, face two competitors, offer zero discounts or other such deals but do offer a three year warranty.
(d) Using JMP to get MLR results (click here.)
(e) Reading SAS output (click here)
Can you put qualitative factors into a regression analysis (for example location of a property, or the ethnicity or political affiliation of a person)? Yes! We define a variable -called a dummy or indicator variable- that represents that characteristic numerically) Problem 1 problem set 3 (employee salary based on an employee's gender and their years of experience).
More than two categories to a qualitative factor? You are not looking at something like gender which is divided into two categories male or female but instead looking at a qualitative factor that can be broken down into three or more categories? Follow the dummy variable rule. What is this?
Dummy variable rule. If have two categories to a qualitative factor would be silly to use two dummy variables (i.e. gender a dummy variable for male which is 1 if a person is male and 0 if female along with a dummy variable for female, defined as 1 if female and 0 if male. What is the one factor telling you that the other one isn't?) Similarly if you have three categories for a qualitative factor you should use two dummy variables, four categories then three dummies, five then four and so on. The category you do NOT create a dummy variable for is referred to as the omitted category or the base case. The estimated coefficients are then interpreted relative to that omitted category or base case. (problems 3 and 5 on problem set 3). Forgetting this rule leads to what is often referred to as the "dummy variable trap" (see Newbold, Carlson and Thorne, p. 568).
Examples: Problems 1 and 2 of set 3.
: