class: title-slide <br> <br> .right-panel[ <br> # Model Evaluation ## Dr. Mine Dogucu ] --- class: inverse middle .font75[If there was no variance there would be no statistics.] --- ## What if? - We want to understand average number of sleep Irvine residents get. What if everyone in Irvine slept 8 hours every night? (`sleep` = {8, 8,..., 8}) -- - We want to predict who will graduate college. What if everyone graduated college? (`graduate` = {TRUE, TRUE,..., TRUE}) -- - We want to understand if Android users spend more time on their phones when compared to iOS users. What if everyone spent 3 hours per day on their phones? (`time` = {3, 3,..., 3}, `os` = {Android, Android, .... iOS}) -- - We want to understand, if birth height and weight are positively associated in babies. What if every baby was 7.5 lbs? (`weight` = {7.5, 7.5,..., 7.5}, `height` = {20, 22,..., 18}) --- class: middle In all this fake scenarios there would be no **variance** in `sleep`, `graduate`, `time`, `weight`. These variables would all be constants thus would not even be a **vari**able. --- class: middle .font50[Things vary. We use statistics to understand **how** things vary and often we want to know **why** they vary.] --- ```r glimpse(babies) ``` ``` ## Rows: 1,236 ## Columns: 8 ## $ case <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1… ## $ bwt <int> 120, 113, 128, 123, 108, 136, 138, 132, 120, 143, 140, 144, … ## $ gestation <int> 284, 282, 279, NA, 282, 286, 244, 245, 289, 299, 351, 282, 2… ## $ parity <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ age <int> 27, 33, 28, 36, 23, 25, 33, 23, 25, 30, 27, 32, 23, 36, 30, … ## $ height <int> 62, 64, 64, 69, 67, 62, 62, 65, 62, 66, 68, 64, 63, 61, 63, … ## $ weight <int> 100, 135, 115, 190, 125, 93, 178, 140, 125, 136, 120, 124, 1… ## $ smoke <int> 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, … ``` --- ```r model_g <- lm(bwt ~ gestation, data = babies) tidy(model_g) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -10.1 8.32 -1.21 2.27e- 1 ## 2 gestation 0.464 0.0297 15.6 3.22e-50 ``` --- ```r glance(model_g) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.166 0.166 16.7 244. 3.22e-50 1 -5175. 10356. 10371. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` --- ```r glance(model_g)$r.squared ``` ``` ## [1] 0.1663449 ``` 16% of the variation in birth weight is explained by gestation. Higher values of R squared is preferred. <img src="15d-model-eval_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- <img src="15d-model-eval_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ```r model_gsa <- lm(bwt ~ gestation + smoke + age, data = babies) ``` --- ## Adjusted R-Squared .pull-left[ ```r glance(model_g) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.166 0.166 16.7 244. 3.22e-50 1 -5175. 10356. 10371. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` ] .pull-right[ ```r glance(model_gsa) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.212 0.210 16.2 108. 3.45e-62 3 -5089. 10187. 10213. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` ] --- ```r babies %>% add_predictions(model_g) %>% add_residuals(model_g) %>% select(bwt, pred, resid) ``` ``` ## # A tibble: 1,236 x 3 ## bwt pred resid ## <int> <dbl> <dbl> ## 1 120 122. -1.79 ## 2 113 121. -7.86 ## 3 128 119. 8.53 ## 4 123 NA NA ## 5 108 121. -12.9 ## 6 136 123. 13.3 ## 7 138 103. 34.8 ## 8 132 104. 28.3 ## 9 120 124. -4.11 ## 10 143 129. 14.2 ## # … with 1,226 more rows ``` --- ```r babies %>% add_predictions(model_gsa) %>% add_residuals(model_gsa) %>% select(bwt, pred, resid) ``` ``` ## # A tibble: 1,236 x 3 ## bwt pred resid ## <int> <dbl> <dbl> ## 1 120 125. -4.80 ## 2 113 125. -11.5 ## 3 128 115. 13.3 ## 4 123 NA NA ## 5 108 115. -7.47 ## 6 136 125. 10.5 ## 7 138 108. 30.4 ## 8 132 107. 25.0 ## 9 120 127. -6.81 ## 10 143 124. 19.2 ## # … with 1,226 more rows ``` --- class: middle ## Root Mean Square Error `\(RMSE = \sqrt{\frac{\Sigma_{i=1}^n(y_i-\hat y_i)^2}{n}}\)` --- ```r modelr::rmse(model_gsa, babies) ``` ``` ## [1] 16.1687 ``` ```r modelr::rmse(model_g, babies) ``` ``` ## [1] 16.6512 ``` --- ```r model_full <- lm(bwt ~ gestation + parity + age + height + weight + smoke, data = babies) ``` ```r modelr::rmse(model_full, babies) ``` ``` ## [1] 15.78198 ``` ```r glance(model_full)$r.squared ``` ``` ## [1] 0.2579535 ``` Can we keep adding all the variables and try to get an EXCELLENT model fit? --- ## Overfitting - We are fitting the model to sample data. - Our goal is to understand the population data. - If we make our model too perfect for our sample data, the model may not perform as well with other sample data from the population. - In this case we would be overfitting the data. - We can use **model validation** techniques. --- ## Splitting the Data (Train vs. Test) ```r set.seed(12345) babies_split <- rsample::initial_split(babies) ## 75% to 25% split ``` -- ```r babies_train <- rsample::training(babies_split) dim(babies_train) ``` ``` ## [1] 927 8 ``` -- ```r babies_test <- rsample::testing(babies_split) dim(babies_test) ``` ``` ## [1] 309 8 ``` --- ```r model_gsa_train <- lm(bwt ~ gestation + smoke + age, data = babies_train) model_gsa_test <- lm(bwt ~ gestation + smoke + age, data = babies_test) ``` --- .pull-left[ ```r glance(model_gsa_train) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.204 0.201 16.4 76.8 2.87e-44 3 -3815. 7641. 7665. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` ] .pull-right[ ```r glance(model_gsa_test) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.243 0.236 15.8 32.2 4.21e-18 3 -1272. 2553. 2572. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` ] --- .pull-left[ ```r modelr::rmse(model_gsa_train, babies_train) ``` ``` ## [1] 16.31741 ``` ] .pull-right[ ```r modelr::rmse(model_gsa_test, babies_test) ``` ``` ## [1] 15.64897 ``` ] --- class: middle inverse .font50[Evaluation of Logistic Regression Models] --- class: middle ```r babies <- babies %>% mutate(low_bwt = case_when(bwt < 88 ~ TRUE, bwt >= 88~ FALSE)) %>% drop_na(gestation) ``` --- class: middle ```r model_g <- glm(low_bwt ~ gestation, data = babies, family = "binomial") ``` --- class: middle ```r tidy(model_g) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 17.5 2.24 7.82 5.11e-15 ## 2 gestation -0.0758 0.00846 -8.96 3.27e-19 ``` `\(\hat p = \frac{\exp(b_0 +b_1x)}{1+\exp(b_0 + b_1x)}\)` `\(\hat p\)` when gestation is 284 = `\(\frac{\exp(17.5 -0.0758 \cdot 284)}{1+\exp(17.5 -0.0758 \cdot 284)} = \frac{\exp(-4.0272)}{1+ \exp(-4.0272)} = 0.01751203\)` --- ```r babies <- babies %>% add_predictions(model_g) select(babies, pred) ``` ``` ## # A tibble: 1,223 x 1 ## pred ## <dbl> ## 1 -4.02 ## 2 -3.87 ## 3 -3.64 ## 4 -3.87 ## 5 -4.17 ## 6 -0.986 ## 7 -1.06 ## 8 -4.40 ## 9 -5.15 ## 10 -9.10 ## # … with 1,213 more rows ``` --- ```r babies <- babies %>% mutate(pred_p = exp(pred)/(1+exp(pred))) select(babies, pred, pred_p) ``` ``` ## # A tibble: 1,223 x 2 ## pred pred_p ## <dbl> <dbl> ## 1 -4.02 0.0177 ## 2 -3.87 0.0205 ## 3 -3.64 0.0256 ## 4 -3.87 0.0205 ## 5 -4.17 0.0152 ## 6 -0.986 0.272 ## 7 -1.06 0.257 ## 8 -4.40 0.0122 ## 9 -5.15 0.00574 ## 10 -9.10 0.000112 ## # … with 1,213 more rows ``` --- ## Cutoff ```r babies <- babies %>% mutate(pred_y = case_when(pred_p < 0.5 ~ FALSE, pred_p >= 0.5 ~ TRUE)) select(babies, pred, pred_p, pred_y) ``` ``` ## # A tibble: 1,223 x 3 ## pred pred_p pred_y ## <dbl> <dbl> <lgl> ## 1 -4.02 0.0177 FALSE ## 2 -3.87 0.0205 FALSE ## 3 -3.64 0.0256 FALSE ## 4 -3.87 0.0205 FALSE ## 5 -4.17 0.0152 FALSE ## 6 -0.986 0.272 FALSE ## 7 -1.06 0.257 FALSE ## 8 -4.40 0.0122 FALSE ## 9 -5.15 0.00574 FALSE ## 10 -9.10 0.000112 FALSE ## # … with 1,213 more rows ``` --- ### Confusion Matrix ```r janitor::tabyl(babies, low_bwt, pred_y) %>% janitor::adorn_totals(c("row", "col")) ``` ``` ## low_bwt FALSE TRUE Total ## FALSE 1161 5 1166 ## TRUE 53 4 57 ## Total 1214 9 1223 ``` **Sensitivity** (true-positive rate): 4/57 = 0.0702 **Specificity** (true-negative rate): 1161/1166 = 0.9957 **Accuracy**: (1161+4)/1223 = 0.9526