September 12, 2018

- Motivation for software like R
- What if your data changes?
- What if you need to change a graph?
*What if someone else wants to verify your results?*

Whatâ€™s a numerical variable?

What are the types?

How do we compute it?

For what kind of data?

\[\bar{x} = \dfrac{x_1+x_2+x_3+\ldots+x_n}{n}\]

- \(\bar{x}\) vs \(\mu\)

Median - What does is measure?

Mode - What does it measure?

(What if there are an even number of numbers?)

Data set: 1, 2, 9, 20, 5

Mean? Median? SD?

Why is the mean larger than the median?

cdc %>% summarize(avg.height = mean(height), sd.height = sd(height), median.height = median(height))

## avg.height sd.height median.height ## 1 67.1829 4.125954 67

What are Q1 and Q3?

cdc %>% summary()

## genhlth exerany hlthplan smoke100 ## excellent:4657 Min. :0.0000 Min. :0.0000 Min. :0.0000 ## fair :2019 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000 ## good :5675 Median :1.0000 Median :1.0000 Median :0.0000 ## poor : 677 Mean :0.7457 Mean :0.8738 Mean :0.4721 ## very good:6972 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 ## Max. :1.0000 Max. :1.0000 Max. :1.0000 ## height weight wtdesire age gender ## Min. :48.00 Min. : 68.0 Min. : 68.0 Min. :18.00 f:10431 ## 1st Qu.:64.00 1st Qu.:140.0 1st Qu.:130.0 1st Qu.:31.00 m: 9569 ## Median :67.00 Median :165.0 Median :150.0 Median :43.00 ## Mean :67.18 Mean :169.7 Mean :155.1 Mean :45.07 ## 3rd Qu.:70.00 3rd Qu.:190.0 3rd Qu.:175.0 3rd Qu.:57.00 ## Max. :93.00 Max. :500.0 Max. :680.0 Max. :99.00

Why do you think we might have different “measures of center”?

- How are each computed?
- Letâ€™s build the formula for standard deviationâ€¦
- SD: Why do we take the square root?
- \(\sigma\) vs \(s\)
- Why denominator of \(n-1\)? (itâ€™s complicated)

Rules for Medians/IQR: 1.5 * IQR

- Which of the statistics are sensitive vs robust?
- How can we tell?
- In the # of death penalty cases, what happens if Texas is removed?

cdc %>% ggplot(aes(height))+ geom_histogram()

We can change the number of break, the number of bins, etc., but we typically wonâ€™t worry too much about those.

- Mean is a balance point
- Median has half the data on either side
- SD is the “average distance from the mean”

What does skew measure?

Tell us:

- median
- Q1
- Q3
- min
- max
- outliers (if there are any)

cdc %>% summary()

## genhlth exerany hlthplan smoke100 ## excellent:4657 Min. :0.0000 Min. :0.0000 Min. :0.0000 ## fair :2019 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000 ## good :5675 Median :1.0000 Median :1.0000 Median :0.0000 ## poor : 677 Mean :0.7457 Mean :0.8738 Mean :0.4721 ## very good:6972 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 ## Max. :1.0000 Max. :1.0000 Max. :1.0000 ## height weight wtdesire age gender ## Min. :48.00 Min. : 68.0 Min. : 68.0 Min. :18.00 f:10431 ## 1st Qu.:64.00 1st Qu.:140.0 1st Qu.:130.0 1st Qu.:31.00 m: 9569 ## Median :67.00 Median :165.0 Median :150.0 Median :43.00 ## Mean :67.18 Mean :169.7 Mean :155.1 Mean :45.07 ## 3rd Qu.:70.00 3rd Qu.:190.0 3rd Qu.:175.0 3rd Qu.:57.00 ## Max. :93.00 Max. :500.0 Max. :680.0 Max. :99.00

cdc %>% summarize(mean.age = mean(age), median.age = median(age), sd.age = sd(age), IQR.age = IQR(age))

## mean.age median.age sd.age IQR.age ## 1 45.06825 43 17.19269 26

cdc %>% ggplot(aes(y = age))+ geom_boxplot()

cdc %>% summary()

## genhlth exerany hlthplan smoke100 ## excellent:4657 Min. :0.0000 Min. :0.0000 Min. :0.0000 ## fair :2019 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000 ## good :5675 Median :1.0000 Median :1.0000 Median :0.0000 ## poor : 677 Mean :0.7457 Mean :0.8738 Mean :0.4721 ## very good:6972 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 ## Max. :1.0000 Max. :1.0000 Max. :1.0000 ## height weight wtdesire age gender ## Min. :48.00 Min. : 68.0 Min. : 68.0 Min. :18.00 f:10431 ## 1st Qu.:64.00 1st Qu.:140.0 1st Qu.:130.0 1st Qu.:31.00 m: 9569 ## Median :67.00 Median :165.0 Median :150.0 Median :43.00 ## Mean :67.18 Mean :169.7 Mean :155.1 Mean :45.07 ## 3rd Qu.:70.00 3rd Qu.:190.0 3rd Qu.:175.0 3rd Qu.:57.00 ## Max. :93.00 Max. :500.0 Max. :680.0 Max. :99.00

cdc %>% ggplot(aes(y = height))+ geom_boxplot()

- Sample mean: \(\bar{x}\)
- Population mean: \(\mu\)
- Sample proportion: \(\hat{p}\)
- Population proportion: \(p\)

How do we use explanatory vs response variables? (When relevant)

head(cdc)

## genhlth exerany hlthplan smoke100 height weight wtdesire age gender ## 1 good 0 1 0 70 175 175 77 m ## 2 good 0 1 1 64 125 115 33 f ## 3 good 1 1 1 60 105 105 49 f ## 4 good 1 1 0 66 132 124 42 f ## 5 very good 0 1 0 61 150 130 55 f ## 6 very good 1 1 0 64 114 114 55 f

cdc %>% ggplot(aes(x = weight, y = wtdesire))+ geom_point()

What does the scatterplot tell us?

What does the scatterplot not tell us?