1. In this data set, the dependent variable is sale price of the house. There are nine dependent variables which are SQFT, BEDROOMS, BATHS, AGE, OWNER, POOL, TRADITIONAL, FIREPLACE and WATERFRONT. Meanwhile, OWNER, POOL, TRADITIONAL and FIREPLACE are dummy variables.
This data set collected 864 houses sold over last year. The sample mean of the price is $155792.2 and median is $130000. The maximum is $1580000 and the minimum is $22000.the standard deviation is $130250.
2. The normal distribution is important because of central limit theorem. It indicates that if it is normal distributed, which means in the sample, the random variables have well-defined expected values and variance. Normal distribution is symmetric about its mean, and is non-zero over the entire real line. It is a symmetric representation when we did not know the real valued random variables whose distributions are not known.
According to the histogram of the sample data, it is easy to see that the distribution of price is positively skewed. The median price is $130,000,very far away from maximum price$1580,000.
Then test the sale prices are distributed normally at 5% level of significance. The p value from the diagram is zero which is less than 0.05, so the sale prices are not normal distributed at 5% level of significance.
3. From question 2’s diagram, we can see that the variables have characteristic that they are positive and have distributions that are positively skewed with a long tail to the right. Logarithmic can do better transformation than linear function. The transformation log of sale price has the effect of making larger values of independent variables less extreme. Log of prices is closer to a normal distribution.
From the diagram, we can see that skewness is 0.4210, which is much smaller than previous one, the residuals are more symmetric around zero. Although the JB value is 375 which is bigger than 5.99, but it is better than linear function. The p value is zero at 5% level of significance. The sale prices are not normal distributed, but it is close to be normal distributed.
4 (SE) (0.0282) (0.0000097) (0.000581)
The intercept 11.03271 is the value of log(price) when other variables are zero.
A one-unit increase in square feet leads to a approximate 0.0396% increase in price, keeping all other variables constant. (Significance at 1% level)
A year increase in house age leads to a approximate 8.062% decrease in sale price, keeping all other variables constant. (Significance at 1% level)
From residual diagram, it is easy to see that the JB value is 377.2571 larger than 5.99. The p value is zero at 5% level of significance. The residual of the regression is not normal distributed. It is consistent with (2) and (3).
From the estimated regression equation, an increase of age by one year will decrease the sale price by 8.062%.
5. In (4), it only considered two factors, which are SQFT and AGE. It is not enough to estimate sale price of house. In multiple regression model, the omitted variables may be mis-specified. Then the OLS is biased and inconsistent. omitted-variable bias occurs when a model is created which incorrectly leaves out one or more important causal factors.
The adequacy of a model can be tested by Ramsey’s RESET SPECIFICATION.
Hypotheses:
Decision rule:Reject H0 if F>F(1-0.05,J,N-K) F=1.3877
Decision: do not reject
Conclusion: There is insufficient evidence to suggest that the model is mis-specified at 5% level of significance.
6. Because some omitted variables may be missed when the regression is created which incorrectly leaves out one or more important causal factors. BEDROOMS,POOL and FIREPLACE also influence sale price of house, but they were not considered in previous regression model.
When the number of bedrooms increases by one, the expected decrease in sale price will decrease100*(exp(-0.0077)-1)=7.67%. In my opinion, it is not