Covariance and Correlation
Numerical Summary of Data
Pan Chao
November 17, 2014
Numerical summary of data
Covariance and Correlation
Measures of center
Measures of Center
1. Mean: arithmetic average x1 + x2 + . . . + xn
1∑
= xi n n n
x
¯=
i=1
Example:
1, 2, 2, 3, 4, 7, 9
x
¯=
1+2+2+3+4+7+9
= 4.
7
Numerical summary of data
Covariance and Correlation
Measures of center
2. Mode: most frequent value in a data set, highest peak.
Example: 2 is the mode in the previous example.
Remark: can have more than one modes.
Numerical summary of data
Covariance and Correlation
Measures of center
3. Median: midpoint of the data such that half of the values are smaller and half of the values are larger.
How to find the median:
1. arrange the data in increasing order (from smallest to largest)
2. count the number of observations, n.
3a. If n is odd, median is the middle ordered value:
(
M=
n+1
2
)th ordered value
3b. If n is even, median is the average of the two middle ordered values: (n
)th
( n )th and +1 ordered value
M = average of
2
2
Example : observations 7, 9, 10, 12, 14 (The sample median is 10)
Example : observations 3, 4, 9, 12, 14, 19 (The sample median is 10.5)
Numerical summary of data
Covariance and Correlation
Measures of center
Example
Bob’s last 20 golf scores, beginning with his last score
69
76
77
76
73
75
81
83
77
77
82
77
77
78
75
80
80
78
79
84
1. What is the mode for this data set?
69, 73, 75, 75, 76, 76, 77, 77, 77, 77, 77,
78, 78, 79, 80, 80, 81, 82, 83, 84
2. Determine the median (77)
3. Calculate Bob’s mean golf score (77.7)
Numerical summary of data
Measures of variability
Measures of Variability
1. Range: = max - min
(simplest, but not always useful)
Covariance and Correlation
Numerical summary of data
Covariance and Correlation
Measures of variability
2. Variance: based on the difference between each observation and the mean.
Population variance:
∑
σ2 =
(xi − µ)2
N
Sample variance (almost always):
∑
(xi − x
¯ )2
2
s = n−1 Remarks:
Variance is always non-negative (≥ 0)
0 variance means there is no variation. i.e. the whole data set has the same value.
Numerical summary of data
Covariance and Correlation
Measures of variability
3. Standard deviation: most commonly used for measuring how far observations are from the mean.
Population version: σ= √ σ2 Sample version (almost always):
√
s = s2
Numerical summary of data
Covariance and Correlation
Measures of variability
Example
Compute the standard deviation of the data set including 0, 2, 4 i 1
2
3
xi
0
2
4
xi − x
¯
-2
0
2
Mean: x
¯=2
Variance: s2 = 4
Standard deviation: s = 2
(xi − x
¯ )2
4
0
4
Numerical summary of data
Covariance and Correlation
Measures of variability
4. pth percentile: value such that p% of the observations fall at or below it
Median:
First quartile:
Third quartile:
M = 50th percentile
Q1 = 25th percentile
Q3 = 75th percentile
Numerical summary of data
Covariance and Correlation
Measures of variability
How to find a percentile for data?
1. Order the data in increasing order.
2. Calculate i = np/100, where n is the sample size, p is the percentile. 3a. If i is not an integer, round i up to the next integer. Then take the ith value.
3b. If i is an integer, take an average of the ith and (i + 1)th values. Example: -20, 1, 23, 25, 32.5, 33, 67
Median = 25
First quartile = 1
Third quartile = 33
Example: 1, 2, 4, 6, 8, 9, 12, 13
Median = 7
First quartile = 3
Third quartile = 10.5
Numerical summary of data
Covariance and Correlation
Measures of variability
5. Interquartiles Range (IQR): = Q3 − Q1
Outliers: an observation is said to be a suspected outlier if it is
> Q3 + 1.5∗IQR
OR
< Q1 − 1.5∗IQR
Example: 1, 2, 3, 4, 5, 6, 11
M = 4, Q1 = 2, Q3 = 6, IQR = 4, [Q1 -1.5IQR, Q3 +1.5IQR]
= [-4, 12]
Numerical summary of data
Covariance and Correlation
Five-number summary and boxplot
Five-number Summary
Min, Q1 , Median, Q3 , Max
Remark: Divide our