September 12, 2018

## Reproducible research

• Motivation for software like R
• What if your data changes?
• What if you need to change a graph?
• What if someone else wants to verify your results?

## Reminder

Whatâ€™s a numerical variable?

What are the types?

## Means

How do we compute it?

For what kind of data?

## Mean formula

$\bar{x} = \dfrac{x_1+x_2+x_3+\ldots+x_n}{n}$

• $$\bar{x}$$ vs $$\mu$$

## Compared toâ€¦

Median - What does is measure?

Mode - What does it measure?

(What if there are an even number of numbers?)

## Example

Data set: 1, 2, 9, 20, 5

Mean? Median? SD?

Why is the mean larger than the median?

## CDC

cdc %>%
summarize(avg.height = mean(height), sd.height = sd(height), median.height = median(height))
##   avg.height sd.height median.height
## 1    67.1829  4.125954            67

## CDC more dataâ€¦.

What are Q1 and Q3?

cdc %>%
summary()
##       genhlth        exerany          hlthplan         smoke100
##  excellent:4657   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000
##  fair     :2019   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000
##  good     :5675   Median :1.0000   Median :1.0000   Median :0.0000
##  poor     : 677   Mean   :0.7457   Mean   :0.8738   Mean   :0.4721
##  very good:6972   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000
##                   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000
##      height          weight         wtdesire          age        gender
##  Min.   :48.00   Min.   : 68.0   Min.   : 68.0   Min.   :18.00   f:10431
##  1st Qu.:64.00   1st Qu.:140.0   1st Qu.:130.0   1st Qu.:31.00   m: 9569
##  Median :67.00   Median :165.0   Median :150.0   Median :43.00
##  Mean   :67.18   Mean   :169.7   Mean   :155.1   Mean   :45.07
##  3rd Qu.:70.00   3rd Qu.:190.0   3rd Qu.:175.0   3rd Qu.:57.00
##  Max.   :93.00   Max.   :500.0   Max.   :680.0   Max.   :99.00

## Why the choices?

Why do you think we might have different “measures of center”?

## Measuring the spread of data: Standard deviation vs IQR

• How are each computed?
• Letâ€™s build the formula for standard deviationâ€¦
• SD: Why do we take the square root?
• $$\sigma$$ vs $$s$$
• Why denominator of $$n-1$$? (itâ€™s complicated)

## Outliers

Rules for Medians/IQR: 1.5 * IQR

## Sensitivity

• Which of the statistics are sensitive vs robust?
• How can we tell?
• In the # of death penalty cases, what happens if Texas is removed?

## 1 Variable: Histograms

cdc %>%
ggplot(aes(height))+
geom_histogram()

We can change the number of break, the number of bins, etc., but we typically wonâ€™t worry too much about those.

## On a histogramâ€¦

• Mean is a balance point
• Median has half the data on either side
• SD is the “average distance from the mean”

## Skew

What does skew measure?

## Boxplots

Tell us:

• median
• Q1
• Q3
• min
• max
• outliers (if there are any)

## CDC: Age

cdc %>%
summary()
##       genhlth        exerany          hlthplan         smoke100
##  excellent:4657   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000
##  fair     :2019   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000
##  good     :5675   Median :1.0000   Median :1.0000   Median :0.0000
##  poor     : 677   Mean   :0.7457   Mean   :0.8738   Mean   :0.4721
##  very good:6972   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000
##                   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000
##      height          weight         wtdesire          age        gender
##  Min.   :48.00   Min.   : 68.0   Min.   : 68.0   Min.   :18.00   f:10431
##  1st Qu.:64.00   1st Qu.:140.0   1st Qu.:130.0   1st Qu.:31.00   m: 9569
##  Median :67.00   Median :165.0   Median :150.0   Median :43.00
##  Mean   :67.18   Mean   :169.7   Mean   :155.1   Mean   :45.07
##  3rd Qu.:70.00   3rd Qu.:190.0   3rd Qu.:175.0   3rd Qu.:57.00
##  Max.   :93.00   Max.   :500.0   Max.   :680.0   Max.   :99.00
cdc %>%
summarize(mean.age = mean(age), median.age = median(age), sd.age = sd(age), IQR.age = IQR(age))
##   mean.age median.age   sd.age IQR.age
## 1 45.06825         43 17.19269      26

## Boxplot

cdc %>%
ggplot(aes(y = age))+
geom_boxplot()

## CDC: Height

cdc %>%
summary()
##       genhlth        exerany          hlthplan         smoke100
##  excellent:4657   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000
##  fair     :2019   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000
##  good     :5675   Median :1.0000   Median :1.0000   Median :0.0000
##  poor     : 677   Mean   :0.7457   Mean   :0.8738   Mean   :0.4721
##  very good:6972   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000
##                   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000
##      height          weight         wtdesire          age        gender
##  Min.   :48.00   Min.   : 68.0   Min.   : 68.0   Min.   :18.00   f:10431
##  1st Qu.:64.00   1st Qu.:140.0   1st Qu.:130.0   1st Qu.:31.00   m: 9569
##  Median :67.00   Median :165.0   Median :150.0   Median :43.00
##  Mean   :67.18   Mean   :169.7   Mean   :155.1   Mean   :45.07
##  3rd Qu.:70.00   3rd Qu.:190.0   3rd Qu.:175.0   3rd Qu.:57.00
##  Max.   :93.00   Max.   :500.0   Max.   :680.0   Max.   :99.00

## Height Boxplot

cdc %>%
ggplot(aes(y = height))+
geom_boxplot()

## Notation

• Sample mean: $$\bar{x}$$
• Population mean: $$\mu$$
• Sample proportion: $$\hat{p}$$
• Population proportion: $$p$$

## 2 Variables: Scatterplots

How do we use explanatory vs response variables? (When relevant)

## CDC

head(cdc)
##     genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1      good       0        1        0     70    175      175  77      m
## 2      good       0        1        1     64    125      115  33      f
## 3      good       1        1        1     60    105      105  49      f
## 4      good       1        1        0     66    132      124  42      f
## 5 very good       0        1        0     61    150      130  55      f
## 6 very good       1        1        0     64    114      114  55      f

## Scatterplot

cdc %>%
ggplot(aes(x = weight, y = wtdesire))+
geom_point()

## Scatterplot

What does the scatterplot tell us?

What does the scatterplot not tell us?