Group Members: Wenli Hu, Joyce Jiang, Xi Tian, Ye Yu April 19th, 2012
Is it possible to collect data from the entire population? -If so, we can talk about what is true for the entire population -Often we cannot (time/cost) -If not, we can use a smaller subset: a SAMPLE
Research Background Introduction
Sampling Methods 1. simple random sampling 2. post- stratification 3. regression 4. stratified sampling Conclusion
Pima Indians are the American Indians who live today in the Gila River Indian Community. (Arizona) Genetically, Pima Indians have a high rate of diabetes (type II) much higher than “normal” rate in the US. They are said to be genetically susceptible to diabetes and obesity. These Pima Indians are taken as an example of how genetics can cause diabetes. Pima women seem to have higher rate than men.
Done by the National Institute of Diabetes and Digestive and Kidney Diseases Data received: 9 May 1990 Population: 768 women Pima Indians
Tested positive instances: 268 Tested negative instances: 500
Our observations attributes
Plasma glucose concentration a 2hours in an oral
glucose tolerance test Age Class variable (0 or 1)
Simplest Establish a sample size and proceed to randomly select units until we reach the sample size
• Data set:
We have a list of 532 patients and randomly select 50 of them from this list (without replacement). N=532 n=50
...
•
Data analysis
Advantages
-Simple and unbiased
Disadvantages
-Requires an accurate list of the whole population -Expensive to conduct
stratification after selection of the sample Not balanced with respect to diabetes type
...
Diabetes
Yes No
Sample Size
177 355
Glucose Mean
142.69 114.08
Variance
824.40 632.69
= 26.43
Advantages -make weighted estimates to ensure proportional representation. Disadvantages -Requires more information about the population being sampled.
Regression estimator: age as auxiliary variable
z$glu
80
20
100
120
140
160
180
200
30
40 z$age
50
60
Coefficients:Estimate Std. Error t value Pr(>|t|) (Intercept) 87.3574 13.2850 6.576 3.29e-08 *** z$age 1.0855 0.3939 2.756 0.00826 **
Y: glucose X: age x=31.61466
X =31.6
l = a + b* x
= 87.3574 + 1.0855* 31.6466 = 121.67
Var ( l ) = (N-n)*MSE / (N*n) = (532-50) * 1076.4 / (532*50) = 19.50
performs regression analysis for sample survey data
handle survey sample designs including designs with stratification, clustering, and unequal weighting With ESTIMATE statements, you can specify a regression estimator
proc surveyreg data=Municipalities total=50; cluster Cluster; model Population85=Population75; estimate '1985 population' Intercept 284 Population75 8200; run;
Cited from: http://www.math.montana.edu/~jobo/thai/4ratreg.pdf
Stratified Sampling
nh
n Nh N
n1 17 n2 33