(1) Quiz 1 and SAS assignment #1 are both moved to next Monday, December 4th. (2) We started "regression" analysis Monday. (3) Note handout on project posted to Moodle this morning. Please take some time to read this. I should be meeting with your group soon to discuss project ideas with you. (4) The midterm exam is scheduled for Friday of week five (12-16-2011).
When we reject the null in tests of significance on rho this indicates that there is a significant linear (line-like) relationship. Means that the relationship between the two variables can be captured by a line and a linear equation (a straight line equation).
What sort of information does a straight line equation yield? (1) slope, (2) y intercept, (3) predictions of y based on x, and (4) prediction of x based on y.
How do we get such estimates?
Technique of OLS (ordinary least squares) or for short, just LS (least squares).
Important terms: dependent variable and independent variable (simple regression) (independent variables in the case of multiple regression.)
An example? LS estimates of slope and intercept of quantity calculators sold and calculator price example (JMP output).
(1) Slope: positive or negative? Large or small in absolute value? If positive and large this means that a relatively small increase in x (a one unit increase) leads to a pretty substantial increase in y (the increase in y is equal to the slope). If positive and small means that a relatively small increase in x (a one unit increase) leads to a pretty modest increase in y (again the increase equal to the slope). If negative and large in absolute value means that a relatively small increase in x (a one unit increase) leads to a pretty substantial drop in y (the decrease equal to the slope). If negative and small means that a relatively small increase in x (a one unit increase) leads to a pretty modest drop in y (the decrease equal to the slope). In general this is how you interpret the estimated slope:
We estimate that on average an increase in (x) of one (units x is measured in) will lead to a
(ABSOLUTE VALUE OF b1) (units Y is measured in) increase/decrease (increase if b1> 0 or decrease if b1 < 0) in (y).
(2) y intercept: What we estimate we would expect y on average to be if the explanatory variable were equal to zero.
(3) If the relationship between y and x follows the linear equation estimated then our estimate of what we would expect y to be given a particular value of x can be generated straight from the equation itself. In our example on Monday: y^=27.273-1.017978*x where y =quarterly sales of calculators per store (measured in 1000s of calculators ) and x=the price charged per calculator we forecasted that this company should expect sales of about 10,476 calculators per store. If they are targeting quarterly sales of 8,000 calculators per store they should charge no more than $18.93 per calcultor.
Another example? Performance time and training time. JMP output.
Like sample correlation coefficient, LS estimates sensitive to outlier. With simple regression can more often than not identify outliers visually. Technique for identifying outliers statistically introduced in problem set 2. Technique is called least median squares (LMS) . LMS is an extremely powerful technique we can use to identify outliers.
Idea is that, as you know, median values are far less sensitive to outliers than mean averages are.
For more background on LMS click here or here.
As an example take the following program:
TITLE
'first example of LMS';TITLE2
'PERFORMANCE TIME';DATA
TIME;LABEL
TT='TRAINING TIME (HR)'PT=
'PERFORMANCE TIME (MIN)';INPUT
TT PT;DATALINES
;27 15
28 14
22 18
22 17
15 22
29 13
24 15
20 16
26 14
15 21
22 16
25 18
23 17
28 13
20 15
25 15
18 20
;
proc
iml;use
time;read
all var {TT} into x;/*you would list the explanatory variables here*/
read
all var {PT} into y;/*you would list the dependent variable here*/
optn=j(8,1,.);
optn[2]=2;
optn[3]=1;
optn[8]=0;
call
lms(sc,coef,wgt,optn,y,x);run;
/*ignore the red and leave the six lines above as they are here*/
SECOND EXAMPLE (quarterly sales calculators and price example)
Note on JMP output where you find R2 (see SUMMARY OF FIT).
R squared is a measure of well the "line of best fit" (the LS sample regression equation) fits the data.
Calculated by taking the ratio of total "explained" variation in y to the total variation in y.
Total variation in y is equal to the numerator of the sample variance formula. The amount of "explained" variation in y is equal to the total variation in y that you start with minus the amount of variation left unaccounted for by the estimated regression line, SSE or the error sum of squares (sum of squared residuals).
ON SAS OUTPUT:
| first example of LMS |
| PERFORMANCE TIME |
| Number of Observations Read | 14 |
|---|---|
| Number of Observations Used | 14 |
| Analysis of Variance | |||||
|---|---|---|---|---|---|
| Source | DF | Sum of Squares |
Mean Square |
F Value | Pr > F |
| Model | 1 | 103.69724 | 103.69724 | 217.12 | <.0001 |
| Error | 12 | 5.73134 | 0.47761 | ||
| Corrected Total | 13 | 109.42857 | |||
| Root MSE | 0.69109 | R-Square | 0.9476 |
|---|---|---|---|
| Dependent Mean | 16.42857 | Adj R-Sq | 0.9433 |
| Coeff Var | 4.20666 |
| Parameter Estimates | ||||||
|---|---|---|---|---|---|---|
| Variable | Label | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| |
| Intercept | Intercept | 1 | 30.72555 | 0.98771 | 31.11 | <.0001 |
| TT | TRAINING TIME (HR) | 1 | -0.61777 | 0.04193 | -14.73 | <.0001 |
R squared. SAS output and the coefficient of determination: R2 Measure
of how well the "line of best
fit" (the LS sample regression equation) fits the data. Calculated by taking
the ratio of total "explained" variation in y to the total variation in y.
Total variation in y is equal to the numerator of the sample variance formula.
The amount of "explained" variation in y is equal to the total variation
in y that you start with minus the amount of variation left unaccounted for by the
estimated regression line, SSE or the error sum of squares (sum of squared residuals).