Model Evaluation

<br>
<br>
.right-panel[ 
<br>

# Model Evaluation

## Dr. Mine Dogucu

]

---

---

## What if?

- We want to understand average number of sleep Irvine residents get.  What if everyone in Irvine slept 8 hours every night? (`sleep` = {8, 8,..., 8})

- We want to predict who will graduate college. What if everyone graduated college? (`graduate` = {TRUE, TRUE,..., TRUE})

- We want to understand if Android users spend more time on their phones when compared to iOS users. What if everyone spent 3 hours per day on their phones? (`time` = {3, 3,..., 3}, `os` = {Android, Android, .... iOS})

- We want to understand, if birth height and weight are positively associated in babies. What if every baby was 7.5 lbs? (`weight` = {7.5, 7.5,..., 7.5}, `height` = {20, 22,..., 18})

---

In all this fake scenarios there would be no **variance** in `sleep`, `graduate`, `time`, `weight`. These variables would all be constants thus would not even be a **vari**able.

---

.font50[Things vary. We use statistics to understand **how** things vary and often we want to know **why** they vary.] 
---

```r
glimpse(babies)
```

```
## Rows: 1,236
## Columns: 8
## $ case      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ bwt       <int> 120, 113, 128, 123, 108, 136, 138, 132, 120, 143, 140, 144, …
## $ gestation <int> 284, 282, 279, NA, 282, 286, 244, 245, 289, 299, 351, 282, 2…
## $ parity    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ age       <int> 27, 33, 28, 36, 23, 25, 33, 23, 25, 30, 27, 32, 23, 36, 30, …
## $ height    <int> 62, 64, 64, 69, 67, 62, 62, 65, 62, 66, 68, 64, 63, 61, 63, …
## $ weight    <int> 100, 135, 115, 190, 125, 93, 178, 140, 125, 136, 120, 124, 1…
## $ smoke     <int> 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, …
```

---

```r
model_g <- lm(bwt ~ gestation, data = babies)
tidy(model_g)
```

```
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  -10.1      8.32       -1.21 2.27e- 1
## 2 gestation      0.464    0.0297     15.6  3.22e-50
```

---

```r
glance(model_g)
```

```
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.166         0.166  16.7      244. 3.22e-50     1 -5175. 10356. 10371.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

---

```r
glance(model_g)$r.squared
```

```
## [1] 0.1663449
```

16% of the variation in birth weight is explained by gestation. Higher values of R squared is preferred.

---

---

```r
model_gsa <- lm(bwt ~ gestation + smoke + age, data = babies)
```

---

## Adjusted R-Squared

```r
glance(model_g)
```

```r
glance(model_gsa)
```

```
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.212         0.210  16.2      108. 3.45e-62     3 -5089. 10187. 10213.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```
]

---

```r
babies %>% 
  add_predictions(model_g) %>% 
  add_residuals(model_g) %>% 
  select(bwt, pred, resid)
```

```
## # A tibble: 1,236 x 3
##      bwt  pred  resid
##    <int> <dbl>  <dbl>
##  1   120  122.  -1.79
##  2   113  121.  -7.86
##  3   128  119.   8.53
##  4   123   NA   NA   
##  5   108  121. -12.9 
##  6   136  123.  13.3 
##  7   138  103.  34.8 
##  8   132  104.  28.3 
##  9   120  124.  -4.11
## 10   143  129.  14.2 
## # … with 1,226 more rows
```

---

```r
babies %>% 
  add_predictions(model_gsa) %>% 
  add_residuals(model_gsa) %>% 
  select(bwt, pred, resid)
```

```
## # A tibble: 1,236 x 3
##      bwt  pred  resid
##    <int> <dbl>  <dbl>
##  1   120  125.  -4.80
##  2   113  125. -11.5 
##  3   128  115.  13.3 
##  4   123   NA   NA   
##  5   108  115.  -7.47
##  6   136  125.  10.5 
##  7   138  108.  30.4 
##  8   132  107.  25.0 
##  9   120  127.  -6.81
## 10   143  124.  19.2 
## # … with 1,226 more rows
```

---

## Root Mean Square Error

`$RMSE = \sqrt{\frac{\Sigma_{i=1}^n(y_i-\hat y_i)^2}{n}}$`
---

```r
modelr::rmse(model_gsa, babies)
```

```
## [1] 16.1687
```

```r
modelr::rmse(model_g, babies)
```

```
## [1] 16.6512
```

---

```r
model_full <- lm(bwt ~ gestation + parity + age +
                   height + weight + smoke, 
                 data = babies)
```

```r
modelr::rmse(model_full, babies)
```

```
## [1] 15.78198
```

```r
glance(model_full)$r.squared
```

```
## [1] 0.2579535
```

Can we keep adding all the variables and try to get an EXCELLENT model fit?

---

## Overfitting

- We are fitting the model to sample data.

- Our goal is to understand the population data.

- If we make our model too perfect for our sample data, the model may not perform as well with other sample data from the population.

- In this case we would be overfitting the data.

- We can use **model validation** techniques.

---

## Splitting the Data (Train vs. Test)

```r
set.seed(12345)
babies_split <- rsample::initial_split(babies) ## 75% to 25% split
```

```r
babies_train <- rsample::training(babies_split)
dim(babies_train)
```

```
## [1] 927   8
```

```r
babies_test <- rsample::testing(babies_split)
dim(babies_test)
```

```
## [1] 309   8
```

---

```r
model_gsa_train <- lm(bwt ~ gestation + smoke + age, data = babies_train)

model_gsa_test <- lm(bwt ~ gestation + smoke + age, data = babies_test)
```

---

```r
glance(model_gsa_train)
```

```
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.204         0.201  16.4      76.8 2.87e-44     3 -3815. 7641. 7665.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```
]

```r
glance(model_gsa_test)
```

```
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.243         0.236  15.8      32.2 4.21e-18     3 -1272. 2553. 2572.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```
]

---

```r
modelr::rmse(model_gsa_train, babies_train)
```

```
## [1] 16.31741
```
]

```r
modelr::rmse(model_gsa_test, babies_test)
```

```
## [1] 15.64897
```
]

---

---

```r
babies <- babies %>% 
  mutate(low_bwt = case_when(bwt < 88 ~ TRUE,
                             bwt >= 88~ FALSE)) %>% 
  drop_na(gestation)
```

---
class: middle

```r
model_g <- glm(low_bwt ~ gestation, 
               data = babies,
               family = "binomial")
```

---
class: middle

```r
tidy(model_g)
```

```
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  17.5      2.24         7.82 5.11e-15
## 2 gestation    -0.0758   0.00846     -8.96 3.27e-19
```

`$\hat p = \frac{\exp(b_0 +b_1x)}{1+\exp(b_0 + b_1x)}$`

`$\hat p$` when gestation is 284 = `$\frac{\exp(17.5 -0.0758 \cdot 284)}{1+\exp(17.5 -0.0758 \cdot 284)} = \frac{\exp(-4.0272)}{1+ \exp(-4.0272)} = 0.01751203$`

---

```r
babies <- babies %>% 
  add_predictions(model_g)

select(babies, pred) 
```

```
## # A tibble: 1,223 x 1
##      pred
##     <dbl>
##  1 -4.02 
##  2 -3.87 
##  3 -3.64 
##  4 -3.87 
##  5 -4.17 
##  6 -0.986
##  7 -1.06 
##  8 -4.40 
##  9 -5.15 
## 10 -9.10 
## # … with 1,213 more rows
```

---

```r
babies <- babies %>% 
  mutate(pred_p = exp(pred)/(1+exp(pred)))

select(babies, pred, pred_p)
```

```
## # A tibble: 1,223 x 2
##      pred   pred_p
##     <dbl>    <dbl>
##  1 -4.02  0.0177  
##  2 -3.87  0.0205  
##  3 -3.64  0.0256  
##  4 -3.87  0.0205  
##  5 -4.17  0.0152  
##  6 -0.986 0.272   
##  7 -1.06  0.257   
##  8 -4.40  0.0122  
##  9 -5.15  0.00574 
## 10 -9.10  0.000112
## # … with 1,213 more rows
```

---

## Cutoff

```r
babies <- babies %>% 
  mutate(pred_y = case_when(pred_p < 0.5 ~ FALSE, 
                           pred_p >= 0.5 ~ TRUE))

select(babies, pred, pred_p, pred_y)
```

```
## # A tibble: 1,223 x 3
##      pred   pred_p pred_y
##     <dbl>    <dbl> <lgl> 
##  1 -4.02  0.0177   FALSE 
##  2 -3.87  0.0205   FALSE 
##  3 -3.64  0.0256   FALSE 
##  4 -3.87  0.0205   FALSE 
##  5 -4.17  0.0152   FALSE 
##  6 -0.986 0.272    FALSE 
##  7 -1.06  0.257    FALSE 
##  8 -4.40  0.0122   FALSE 
##  9 -5.15  0.00574  FALSE 
## 10 -9.10  0.000112 FALSE 
## # … with 1,213 more rows
```

---

### Confusion Matrix

```r
janitor::tabyl(babies, low_bwt, pred_y) %>% 
  janitor::adorn_totals(c("row", "col"))
```

```
##  low_bwt FALSE TRUE Total
##    FALSE  1161    5  1166
##     TRUE    53    4    57
##    Total  1214    9  1223
```

**Sensitivity** (true-positive rate): 4/57 = 0.0702  
**Specificity** (true-negative rate): 1161/1166 = 0.9957  
**Accuracy**: (1161+4)/1223 = 0.9526