September 12, 2018

Reproducible research

  • Motivation for software like R
  • What if your data changes?
  • What if you need to change a graph?
  • What if someone else wants to verify your results?

Reminder

What’s a numerical variable?

What are the types?

Measures of Center

Means

How do we compute it?

For what kind of data?

Mean formula

\[\bar{x} = \dfrac{x_1+x_2+x_3+\ldots+x_n}{n}\]

  • \(\bar{x}\) vs \(\mu\)

Compared to…

Median - What does is measure?

Mode - What does it measure?

(What if there are an even number of numbers?)

Example

Data set: 1, 2, 9, 20, 5

Mean? Median? SD?

Why is the mean larger than the median?

CDC

cdc %>% 
  summarize(avg.height = mean(height), sd.height = sd(height), median.height = median(height))
##   avg.height sd.height median.height
## 1    67.1829  4.125954            67

CDC more data….

What are Q1 and Q3?

cdc %>% 
  summary()
##       genhlth        exerany          hlthplan         smoke100     
##  excellent:4657   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  fair     :2019   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  good     :5675   Median :1.0000   Median :1.0000   Median :0.0000  
##  poor     : 677   Mean   :0.7457   Mean   :0.8738   Mean   :0.4721  
##  very good:6972   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      height          weight         wtdesire          age        gender   
##  Min.   :48.00   Min.   : 68.0   Min.   : 68.0   Min.   :18.00   f:10431  
##  1st Qu.:64.00   1st Qu.:140.0   1st Qu.:130.0   1st Qu.:31.00   m: 9569  
##  Median :67.00   Median :165.0   Median :150.0   Median :43.00            
##  Mean   :67.18   Mean   :169.7   Mean   :155.1   Mean   :45.07            
##  3rd Qu.:70.00   3rd Qu.:190.0   3rd Qu.:175.0   3rd Qu.:57.00            
##  Max.   :93.00   Max.   :500.0   Max.   :680.0   Max.   :99.00

Why the choices?

Why do you think we might have different “measures of center”?

Measures of Spread

Measuring the spread of data: Standard deviation vs IQR

  • How are each computed?
  • Let’s build the formula for standard deviation…
  • SD: Why do we take the square root?
  • \(\sigma\) vs \(s\)
  • Why denominator of \(n-1\)? (it’s complicated)

Outliers

Rules for Medians/IQR: 1.5 * IQR

Sensitivity

  • Which of the statistics are sensitive vs robust?
  • How can we tell?
  • In the # of death penalty cases, what happens if Texas is removed?

Graphing

1 Variable: Histograms

cdc %>% 
  ggplot(aes(height))+
  geom_histogram()

We can change the number of break, the number of bins, etc., but we typically won’t worry too much about those.

On a histogram…

  • Mean is a balance point
  • Median has half the data on either side
  • SD is the “average distance from the mean”

Skew

What does skew measure?

Box and whisker plots

Boxplots

Tell us:

  • median
  • Q1
  • Q3
  • min
  • max
  • outliers (if there are any)

CDC: Age

cdc %>% 
  summary()
##       genhlth        exerany          hlthplan         smoke100     
##  excellent:4657   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  fair     :2019   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  good     :5675   Median :1.0000   Median :1.0000   Median :0.0000  
##  poor     : 677   Mean   :0.7457   Mean   :0.8738   Mean   :0.4721  
##  very good:6972   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      height          weight         wtdesire          age        gender   
##  Min.   :48.00   Min.   : 68.0   Min.   : 68.0   Min.   :18.00   f:10431  
##  1st Qu.:64.00   1st Qu.:140.0   1st Qu.:130.0   1st Qu.:31.00   m: 9569  
##  Median :67.00   Median :165.0   Median :150.0   Median :43.00            
##  Mean   :67.18   Mean   :169.7   Mean   :155.1   Mean   :45.07            
##  3rd Qu.:70.00   3rd Qu.:190.0   3rd Qu.:175.0   3rd Qu.:57.00            
##  Max.   :93.00   Max.   :500.0   Max.   :680.0   Max.   :99.00

cdc %>% 
  summarize(mean.age = mean(age), median.age = median(age), sd.age = sd(age), IQR.age = IQR(age))
##   mean.age median.age   sd.age IQR.age
## 1 45.06825         43 17.19269      26

Boxplot

cdc %>% 
  ggplot(aes(y = age))+
  geom_boxplot()

CDC: Height

cdc %>% 
  summary()
##       genhlth        exerany          hlthplan         smoke100     
##  excellent:4657   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  fair     :2019   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  good     :5675   Median :1.0000   Median :1.0000   Median :0.0000  
##  poor     : 677   Mean   :0.7457   Mean   :0.8738   Mean   :0.4721  
##  very good:6972   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      height          weight         wtdesire          age        gender   
##  Min.   :48.00   Min.   : 68.0   Min.   : 68.0   Min.   :18.00   f:10431  
##  1st Qu.:64.00   1st Qu.:140.0   1st Qu.:130.0   1st Qu.:31.00   m: 9569  
##  Median :67.00   Median :165.0   Median :150.0   Median :43.00            
##  Mean   :67.18   Mean   :169.7   Mean   :155.1   Mean   :45.07            
##  3rd Qu.:70.00   3rd Qu.:190.0   3rd Qu.:175.0   3rd Qu.:57.00            
##  Max.   :93.00   Max.   :500.0   Max.   :680.0   Max.   :99.00

Height Boxplot

cdc %>% 
  ggplot(aes(y = height))+
  geom_boxplot()

Notation

  • Sample mean: \(\bar{x}\)
  • Population mean: \(\mu\)
  • Sample proportion: \(\hat{p}\)
  • Population proportion: \(p\)

2 Variables: Scatterplots

How do we use explanatory vs response variables? (When relevant)

CDC

head(cdc)
##     genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1      good       0        1        0     70    175      175  77      m
## 2      good       0        1        1     64    125      115  33      f
## 3      good       1        1        1     60    105      105  49      f
## 4      good       1        1        0     66    132      124  42      f
## 5 very good       0        1        0     61    150      130  55      f
## 6 very good       1        1        0     64    114      114  55      f

Scatterplot

cdc %>% 
  ggplot(aes(x = weight, y = wtdesire))+
  geom_point()

Scatterplot

What does the scatterplot tell us?

What does the scatterplot not tell us?