1. Understand the linear relationship between two variables
A measure of linear relationship: sample covariance and
Pearson Coefficient of correlation (scale free) :Data Analysis > Correlation
2. Perform parametric and non-parametric tests for the coefficient of correlation
Parametric Test: T-test, Data Analysis Plus > Correlation (Pearson)
Non-Parametric Test: Z-test, Data Analysis Plus > Correlation (Spearman)
3. Understand the concept of simple linear regression analysis
A statistical process to construst a relationship between variables
Regression coefficients and random error term (estimated using the method of least squares that minimises the error sum of squares (SSE))
4. Obtain the estimated regression equation of a SLR model
Assumptions:
1) The random errors are uncorrelated
2) The random errors is normally distributed with constant variance
3) The independent variable x is measured without error
4) The regression coefficients are independent of the random error
Data > Data Analysis > Regression
How to check the validity of the model? Use the F-test statistic in the ANOVA table.
How much variation of the dependent variable y can be explained by the regression equation? Use the coefficient of determination, R square – the proportion of the variation of y that can be explained by the regression equation, this regression model can explain balabala of the total variation of y.
How to check the assumptions of the regression model? Use a residual plot.
5. Perform overall significance test for the SLR model
F-test result in the ANOVA table
6. Perform significance test on an individual independent variable for the SLR model
T-test result in the Parameter estimate table
7. Predict future y-value with confidence interval
Data analysis > Prediction Interval, do it manually
8. Interpret the residual plots and assess model assumptions
Model assumptions of the regression analysis require that the random errors are (i) normally distributed, (ii) uncorrelated, and (iii) homoscedastic (i.e. constant variance)
To check normality, we can use a Q-Q plot (close to a straight line) or a formal statistical test (Chi-square test on the residual: data analysis plus)
To show that the errors are uncorrelated, we can plot the residual vs it lagged values. The errors are uncorrelated if no pattern is observed.
To show whether the errors are homoscedastic or heteroscedastic, we can use a residual plot which is a plot of the residual vs the predicted y-value. A residual plot can be used to identify outlying y-value, influential observations, nonlinear relationship between the variables and uncorrelated errors. It is desirable to have a residual plot that has no patter
Checklist:
1. Understand the relationship between one dependent variable and many independent variable
Data Analysis > Regression
1) Estimate the regression coefficients
2) Test the validity of the MLR model
3) Test the significance of an individual independent variable
4) Estimate the error variance
2. Perform overall significance test for the model and significance test on an individual independent variable
Test 1: overall significance test for the regression model – F-test result in the ANOVA table
Test 2: significant test for a particular independent variable- T-test result in the Parametre estimate table
Basic regression output does not tell which model gives the best fit to the data.
But it can tell which variables are important and which are unimportant. (p-value)
3. Use R square and adjusted R square
R square increases with the number of independent variables added to the model
So it is not suitable for comparing models with different number of independent variables
Therefore, to compare different regression models, adjusted R square is used.
The formular for R square and adjusted R square
4. Learn how to automatically select the best subset of independent variable to be included in the multiple linear regression model
Stepwise regression method is a