*(Not counting midterm, quizzes days, and the Friday after Thanksgiving.)
Schedule for the rest of the term: (1) You have this weekend and until noon Thursday of finals week to work on your projects. Note that a write-up of your results (along with electronic copies of data files and SAS programs used) is due no later than noon 2-16-2012 Thursday of finals week. Also remember your group member evaluations. (2) The final is scheduled for Wednesday of finals week 3:00-5:00. The final is to be held in rooms 17 and 18 of Evald. On the final you will have to demonstrate that you are still in command of the material covered during the first half of the term as well as comfort with the material covered since then so the final exam is, of necessity, comprehensive. Think about it. We can't talk about severe multicollinearity (second half of term) without looking at correlation coefficients (first half), we can't talk about tests of significance (second half) without looking at beta estimates (first half), VIFs (second half) without looking at R squared (first half), specification error (second half) without talking about the distinction between simple and multiple regression (first half) and so on. (3) SAS assignment 3 will be combined with Quiz 4. SAS assignment 3/Quiz 4 will be due at the final. It will be posted to Moodle as soon as possible. (4) Note that today we meet in the computer lab (room 110) on the first floor of Olin. I can look at your programs, help debug them, and discuss your results with you during class time.
I am, of course, available throughout finals week. I will send you an email with -among other things-my schedule for finals week.
As you prep for our final exam please review your notes, the readings, problems sets 1-5, quizzes 1, 2, 3, the midterm exam and the keys posted to Moodle for all three quizzes, the problem sets, and the midterm. In the meantime I will be busy grading SAS assignment 2 .I plan to have it all graded and ready for you to pick up by Tuesday morning (my Valentine's Day present to you all). Once this is graded you should review this and the key. Along with all this, you need to work on quiz 4/SAS assignment 3 (posted to Moodle Thursday AM), the review problem (white ain't right/think pink), and the key to this review problem. Note that on the final you will NOT have to calculate least squares estimates of the betas, that you will NOT have to calculate sample correlation coefficients, and that you will NOT have to perform tests of significance using the sample correlation coefficient. SAS and JMP are able to generate the beta estimates and sample correlation coefficients for us and it is better to perform tests of significance using multiple regression results rather than relying on "pair-wise" correlations.
PLEASE NOTE THAT AN ANSWER KEY TO THE REVIEW PROBLEM HAS BEEN POSTED TO MOODLE. ALSO PLEASE NOTE THAT YOU SHOULD WORK WITH THE PINK COPIES OF THE REVIEW PROBLEM AND SAS OUTPUTS DISTRIBUTED IN CLASS TODAY. PLEASE RECYCLE THE COPIES DISTRIBUTED IN CLASS ON MONDAY.
Look at the following as you work on your programs.
Sample SAS program to identify outliers:
title
'sample LMS program for group regression project';footnote
'base for location dummies is bettendorf or moline';/*Your group will need to run this before getting any other results. When you find outliers
make sure that you have checked and that your group has not made any mistakes entering data.
If you haven't made any mistakes entering data AND cannot explain why your outliers are
outliers then you should simply eliminate the outliers using a condition such as the one found
in the sample program*/
data
price label;label
p='price (in thousands)'b=
'Bedrooms'size=
'size (square feet)'age=
'age'RI=
'dummy for Rock Island'DAV=
'dummy for Davenport';input
p b size age RI DAV;price=p*
1000;/*added this to convert figures in $1,000s to just plain old dollars.*/
agesq=age**2;
/*added squared term because of nonlinear relationship between price and age*/
if
RI=1 then City='Rock_island';if
DAV=1 then City='Davenport';if
RI=0 and DAV=0 then City='BettMol';if
RI=0 and DAV=0 then BM=1; else BM=0;datalines
;38.5 3 1091 123 0 0
75 2 723 66 0 0
89.9 4 1908 91 0 0
119.9 3 1313 48 0 0
163.9 4 1890 89 0 0
384 4 3404 43 0 0
1275 5 5806 7 0 0
134.9 3 1132 26 0 1
218.9 3 2229 10 0 1
114.8 3 940 26 0 1
149.5 3 1308 85 0 1
289.9 4 2430 1 0 1
319.9 4 1961 1 0 1
349 4 3084 10 0 1
149.9 3 1232 45 1 0
149 5 1323 36 1 0
147 3 1180 48 1 0
146.9 2 1661 36 1 0
845.9 5 4354 9 1 0
429.5 5 4014 18 1 0
159.9 3 1472 35 1 0
;
ods html;
proc
iml;use
price;read
all var {size age agesq dav ri} into x;read
all var {price} into y;optn=j(
8,1,.);optn[
2]=2;optn[
3]=1;optn[
8]=0;call
lms(sc,coef,wgt,optn,y,x);run
;ods html close;
/*ignore the red*/
Sample SAS program to generate results needed for write-up:
/*IN THE FOLLOWING LMS HAS ALREADY BEEN RUN AND FOUR OUTLIERS HAVE BEEN IDENTIFIED*/
title
'sample program for group regression project';footnote
'base for location dummies is bettendorf or moline';data
price label;label
p='price (in thousands)'b=
'Bedrooms'size=
'size (square feet)'age=
'age'RI=
'dummy for Rock Island'DAV=
'dummy for Davenport';input
p b size age RI DAV;price=p*
1000;agesq=age**
2;/*added squared term because of nonlinear relationship between price and age*/
if
RI=1 then City='Rock_island';if
DAV=1 then City='Davenport';if
RI=0 and DAV=0 then City='BettMol';if
RI=0 and DAV=0 then BM=1; else BM=0;if
_n_=3 or _n_=7 or _n_=13 or _n_=19 then delete;/*ran proc iml/LMS and found 4 outliers in the data. Observations 3, 7, 13 and 19*/
datalines
;38.5 3 1091 123 0 0
75 2 723 66 0 0
89.9 4 1908 91 0 0
119.9 3 1313 48 0 0
163.9 4 1890 89 0 0
384 4 3404 43 0 0
1275 5 5806 7 0 0
134.9 3 1132 26 0 1
218.9 3 2229 10 0 1
114.8 3 940 26 0 1
149.5 3 1308 85 0 1
289.9 4 2430 1 0 1
319.9 4 1961 1 0 1
349 4 3084 10 0 1
149.9 3 1232 45 1 0
149 5 1323 36 1 0
147 3 1180 48 1 0
146.9 2 1661 36 1 0
845.9 5 4354 9 1 0
429.5 5 4014 18 1 0
159.9 3 1472 35 1 0
;
/*proc iml;
use price;
read all var {size age agesq dav ri} into x;
read all var {price} into y;
optn=j(8,1,.);
optn[2]=2;
optn[3]=1;
optn[8]=0;
call lms(sc,coef,wgt,optn,y,x);
run;*/
ods html;
proc
print; var price size age agesq DAV RI;title2
'print out of data being analyzed in the program';* The following command is the first step needed
to check for multicollinearity;
proc
corr; var price size age agesq DAV RI;* We might like to look at the following to see if there is any evidence
that the straight line model does not match the data. Can also do this
with JMP!';
proc
plot hpercent=50 vpercent=50;plot
(Price)*(size age);title2
'plots that can be used to see if straight line approach the right approach';* We need the following to get the results of the general model with
Durbin Watson and variance inflation factors;
proc
reg;model
price=size age agesq DAV RI/vif dw;title2
'regression model being estimated';output
out=new p=yhat r=resids;proc
plot data=new vpercent=50 hpercent=50; plot (resids)*(size age agesq DAV Ri)/vref=0;title2
'residual plots to help identify heteroscedasticity';*the above generates plots to
help identify if heteroscedasticity is present;
data
hetero;set
new;rssq=resids**
2;proc
reg data=hetero; model rssq=size age agesq dav ri;title2
'R squared from this part of program is the one used in the';title3
'B-P/ChiSquare test for hetero';proc
reg data=hetero; model rssq=yhat;title2
'R squared from this part of the program is the one used in the';title3
'Yhat/Chisquare test for hetero';proc
univariate plot normal;var resids;title2
'Note this step gives you the Shapiro-Wilk stat';title3
'and p-value';* We need the following two steps to see if the error terms are normal;
proc
chart; vbar resids;title2
'vertical bar chart to assess for small samples if the error term follows a';title3
'normal distribution';title4
'note you can also do this with JMP';run
;ods html close;