class: title-slide <br> <br> .right-panel[ <br> # Model Assumptions ## Dr. Mine Dogucu ] --- #### Data `babies` in `openintro` package ``` ## Rows: 1,236 ## Columns: 8 ## $ case <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,… ## $ bwt <int> 120, 113, 128, 123, 108, 136, 138, 132, 12… ## $ gestation <int> 284, 282, 279, NA, 282, 286, 244, 245, 289… ## $ parity <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ age <int> 27, 33, 28, 36, 23, 25, 33, 23, 25, 30, 27… ## $ height <int> 62, 64, 64, 69, 67, 62, 62, 65, 62, 66, 68… ## $ weight <int> 100, 135, 115, 190, 125, 93, 178, 140, 125… ## $ smoke <int> 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, … ``` --- ## Baby Weights .pull-left[ ```r ggplot(babies, aes(x = gestation, y = bwt)) + geom_point() ``` ] .pull-right[ <img src="15c-model-assumptions_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] --- ## Baby Weights .pull-left[ ```r ggplot(babies, aes(x = gestation, y = bwt)) + geom_point() + geom_smooth(method = "lm", se = FALSE) ``` `lm` stands for linear model `se` stands for standard error ] .pull-right[ <img src="15c-model-assumptions_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> ] --- class: middle <div align = "center"> | y | Response | Birth weight | Numeric | |---|-------------|-----------------|---------| | x | Explanatory | Gestation | Numeric | --- ## Linear Equations Review .pull-left[ Recall from your previous math classes `\(y = mx + b\)` where `\(m\)` is the slope and `\(b\)` is the y-intercept e.g. `\(y = 2x -1\)` ] -- .pull-right[ ![](15c-model-assumptions_files/figure-html/unnamed-chunk-7-1.png)<!-- --> Notice anything different between baby weights plot and this one? ] --- class: middle .pull-left[ **Math** class `\(y = b + mx\)` `\(b\)` is y-intercept `\(m\)` is slope ] .pull-left[ **Stats** class `\(y_i = \beta_0 +\beta_1x_i + \epsilon_i\)` `\(\beta_0\)` is y-intercept `\(\beta_1\)` is slope `\(\epsilon_i\)` is error/residual `\(i = 1, 2, ...n\)` identifier for each point ] --- class: middle ```r model_g <- lm(bwt ~ gestation, data = babies) ``` `lm` stands for linear model. We are fitting a linear regression model. Note that the variables are entered in y ~ x order. --- ```r broom::tidy(model_g) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -10.1 8.32 -1.21 2.27e- 1 ## 2 gestation 0.464 0.0297 15.6 3.22e-50 ``` -- `\(\hat {y}_i = b_0 + b_1 x_i\)` `\(\hat {\text{bwt}_i} = b_0 + b_1 \text{ gestation}_i\)` `\(\hat {\text{bwt}_i} = -10.1 + 0.464\text{ gestation}_i\)` --- class: middle ## Expected bwt for a baby with 300 days of gestation `\(\hat {\text{bwt}_i} = -10.1 + 0.464\text{ gestation}_i\)` `\(\hat {\text{bwt}} = -10.1 + 0.464 \times 300\)` `\(\hat {\text{bwt}} =\)` 129.1 For a baby with 300 days of gestation the expected birth weight is 129.1 ounces. --- ## Interpretation of estimates .pull-left[ <img src="15c-model-assumptions_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> `\(b_1 = 0.464\)` which means for one unit(day) increase in gestation period the expected increase in birth weight is 0.464 ounces. ] -- .pull-right[ <img src="15c-model-assumptions_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> `\(b_0 = -10.1\)` which means for gestation period of 0 days the expected birth weight is -10.1 ounces!!!!!!!! (does NOT make sense) ] --- class: middle ## Extrapolation - There is no such thing as 0 days of gestation. -- - Birth weight cannot possibly be -10.1 ounces. -- - Extrapolation happens when we use a model outside the range of the x-values that are observed. After all, we cannot really know how the model behaves (e.g. may be non-linear) outside of the scope of what we have observed. --- ## Baby number 148 .pull-left[ ```r babies %>% filter(case == 148) %>% select(bwt, gestation) ``` ``` ## # A tibble: 1 x 2 ## bwt gestation ## <int> <int> ## 1 160 300 ``` ] .pull-right[ ![](15c-model-assumptions_files/figure-html/unnamed-chunk-13-1.png)<!-- --> ] --- ## Baby #148 .pull-left[ **Expected** `\(\hat y_{148} = b_0 +b_1x_{148}\)` `\(\hat y_{148} = -10.1 + 0.464\times300\)` `\(\hat y_{148}\)` = 129.1 ] .pull-left[ **Observed** `\(y_{148} =\)` 160 ] --- ## Residual for `i = 148` .pull-left[ <img src="15c-model-assumptions_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> ] .pull-right[ `\(y_{148} = 160\)` <hr> `\(\hat y_{148}\)` = 129.1 <hr> `\(e_{148} = y_{148} - \hat y_{148}\)` `\(e_{148} =\)` 30.9 ] --- ## Least Squares Regression The goal is to minimize `$$e_1^2 + e_2^2 + ... + e_n^2$$` -- which can be rewritten as `$$\sum_{i = 1}^n e_i^2$$` --- class: middle ## Conditions for Least Squares Regression - Linearity - Normality of Residuals - Constant Variance - Independence --- .pull-left[ .center[**Linear**] ![](15c-model-assumptions_files/figure-html/unnamed-chunk-15-1.png)<!-- --> ] .pull-right[ .center[**Non-linear**] ![](15c-model-assumptions_files/figure-html/unnamed-chunk-16-1.png)<!-- --> ] --- .pull-left[ .center[**Nearly normal**] ![](15c-model-assumptions_files/figure-html/unnamed-chunk-17-1.png)<!-- --> ] .pull-right[ .center[**Not normal**] ![](15c-model-assumptions_files/figure-html/unnamed-chunk-18-1.png)<!-- --> ] --- .pull-left[ .center[**Constant Variance**] ![](15c-model-assumptions_files/figure-html/unnamed-chunk-19-1.png)<!-- --> ] .pull-right[ .center[**Non-constant variance**] ![](15c-model-assumptions_files/figure-html/unnamed-chunk-20-1.png)<!-- --> ] --- class: middle ## Independence Harder to check because we need to know how the data were collected. -- In the description of the dataset it says _[a study]considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area._ -- It is possible that babies born in the same hospital may have similar birth weight. -- Correlated data examples: patients within hospitals, students within schools, people within neighborhoods, time-series data. --- class: middle ### Inference: Confidence Interval ```r confint(model_g) ``` ``` ## 2.5 % 97.5 % ## (Intercept) -26.3915884 6.2632199 ## gestation 0.4059083 0.5226169 ``` Note that the confidence interval for the slope does not contain zero and all the values in the interval are positive indicating a possible positive relationship between gestation and birth weight. --- class: inverse middle center .font75[Transformations] --- ```r library(AmesHousing) ames_raw <- janitor::clean_names(ames_raw) glimpse(ames_raw) ``` ``` ## Rows: 2,930 ## Columns: 82 ## $ order <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1… ## $ pid <chr> "0526301100", "0526350040", "0526351… ## $ ms_sub_class <chr> "020", "020", "020", "020", "060", "… ## $ ms_zoning <chr> "RL", "RH", "RL", "RL", "RL", "RL", … ## $ lot_frontage <int> 141, 80, 81, 93, 74, 78, 41, 43, 39,… ## $ lot_area <int> 31770, 11622, 14267, 11160, 13830, 9… ## $ street <chr> "Pave", "Pave", "Pave", "Pave", "Pav… ## $ alley <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ lot_shape <chr> "IR1", "Reg", "IR1", "Reg", "IR1", "… ## $ land_contour <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "… ## $ utilities <chr> "AllPub", "AllPub", "AllPub", "AllPu… ## $ lot_config <chr> "Corner", "Inside", "Corner", "Corne… ## $ land_slope <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "… ## $ neighborhood <chr> "NAmes", "NAmes", "NAmes", "NAmes", … ## $ condition_1 <chr> "Norm", "Feedr", "Norm", "Norm", "No… ## $ condition_2 <chr> "Norm", "Norm", "Norm", "Norm", "Nor… ## $ bldg_type <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fa… ## $ house_style <chr> "1Story", "1Story", "1Story", "1Stor… ## $ overall_qual <int> 6, 5, 6, 7, 5, 6, 8, 8, 8, 7, 6, 6, … ## $ overall_cond <int> 5, 6, 6, 5, 5, 6, 5, 5, 5, 5, 5, 7, … ## $ year_built <int> 1960, 1961, 1958, 1968, 1997, 1998, … ## $ year_remod_add <int> 1960, 1961, 1958, 1968, 1998, 1998, … ## $ roof_style <chr> "Hip", "Gable", "Hip", "Hip", "Gable… ## $ roof_matl <chr> "CompShg", "CompShg", "CompShg", "Co… ## $ exterior_1st <chr> "BrkFace", "VinylSd", "Wd Sdng", "Br… ## $ exterior_2nd <chr> "Plywood", "VinylSd", "Wd Sdng", "Br… ## $ mas_vnr_type <chr> "Stone", "None", "BrkFace", "None", … ## $ mas_vnr_area <int> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0… ## $ exter_qual <chr> "TA", "TA", "TA", "Gd", "TA", "TA", … ## $ exter_cond <chr> "TA", "TA", "TA", "TA", "TA", "TA", … ## $ foundation <chr> "CBlock", "CBlock", "CBlock", "CBloc… ## $ bsmt_qual <chr> "TA", "TA", "TA", "TA", "Gd", "TA", … ## $ bsmt_cond <chr> "Gd", "TA", "TA", "TA", "TA", "TA", … ## $ bsmt_exposure <chr> "Gd", "No", "No", "No", "No", "No", … ## $ bsmt_fin_type_1 <chr> "BLQ", "Rec", "ALQ", "ALQ", "GLQ", "… ## $ bsmt_fin_sf_1 <int> 639, 468, 923, 1065, 791, 602, 616, … ## $ bsmt_fin_type_2 <chr> "Unf", "LwQ", "Unf", "Unf", "Unf", "… ## $ bsmt_fin_sf_2 <int> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ bsmt_unf_sf <int> 441, 270, 406, 1045, 137, 324, 722, … ## $ total_bsmt_sf <int> 1080, 882, 1329, 2110, 928, 926, 133… ## $ heating <chr> "GasA", "GasA", "GasA", "GasA", "Gas… ## $ heating_qc <chr> "Fa", "TA", "TA", "Ex", "Gd", "Ex", … ## $ central_air <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "… ## $ electrical <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", … ## $ x1st_flr_sf <int> 1656, 896, 1329, 2110, 928, 926, 133… ## $ x2nd_flr_sf <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, … ## $ low_qual_fin_sf <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ gr_liv_area <int> 1656, 896, 1329, 2110, 1629, 1604, 1… ## $ bsmt_full_bath <int> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, … ## $ bsmt_half_bath <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ full_bath <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, … ## $ half_bath <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, … ## $ bedroom_abv_gr <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, … ## $ kitchen_abv_gr <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ kitchen_qual <chr> "TA", "TA", "Gd", "Ex", "TA", "Gd", … ## $ tot_rms_abv_grd <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, … ## $ functional <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "… ## $ fireplaces <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, … ## $ fireplace_qu <chr> "Gd", NA, NA, "TA", "TA", "Gd", NA, … ## $ garage_type <chr> "Attchd", "Attchd", "Attchd", "Attch… ## $ garage_yr_blt <int> 1960, 1961, 1958, 1968, 1997, 1998, … ## $ garage_finish <chr> "Fin", "Unf", "Unf", "Fin", "Fin", "… ## $ garage_cars <int> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, … ## $ garage_area <int> 528, 730, 312, 522, 482, 470, 582, 5… ## $ garage_qual <chr> "TA", "TA", "TA", "TA", "TA", "TA", … ## $ garage_cond <chr> "TA", "TA", "TA", "TA", "TA", "TA", … ## $ paved_drive <chr> "P", "Y", "Y", "Y", "Y", "Y", "Y", "… ## $ wood_deck_sf <int> 210, 140, 393, 0, 212, 360, 0, 0, 23… ## $ open_porch_sf <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60… ## $ enclosed_porch <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0… ## $ x3ssn_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ screen_porch <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0,… ## $ pool_area <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ pool_qc <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ fence <chr> NA, "MnPrv", NA, NA, "MnPrv", NA, NA… ## $ misc_feature <chr> NA, NA, "Gar2", NA, NA, NA, NA, NA, … ## $ misc_val <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ mo_sold <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, … ## $ yr_sold <int> 2010, 2010, 2010, 2010, 2010, 2010, … ## $ sale_type <chr> "WD ", "WD ", "WD ", "WD ", "WD ", "… ## $ sale_condition <chr> "Normal", "Normal", "Normal", "Norma… ## $ sale_price <int> 215000, 105000, 172000, 244000, 1899… ``` --- <img src="15c-model-assumptions_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> --- <img src="15c-model-assumptions_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> Note that log is natural log in R. --- ```r model_y <- lm(log(sale_price) ~ year_built, data = ames_raw) tidy(model_y) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -4.33 0.387 -11.2 1.73e- 28 ## 2 year_built 0.00829 0.000196 42.3 4.45e-305 ``` `\({log(\hat y_i)} = b_0 + b_1x_{1i}\)` `\({log(\hat y_i)} = -4.33 + 0.00829x_{1i}\)` --- class: middle Estimated sale price of a house built in 1980 `\({log(\hat y_i)} = -4.33 + 0.00829 \times 1980\)` -- `\(e^{log(\hat y_i)} = e^{-4.33 + 0.00829 \times 1980}\)` -- `\(\hat y_i = e^{-4.33} \times e^ {0.00829 \times 1980} = 177052.2\)` -- For one-unit (year) increase in x, the y is multiplied by `\(e^{b_1}\)`.