library(tidyverse)
library(MASS)
data(Pima.tr)

To perform the \(z\)-test in R, we can use the function pnorm() in order to find the \(p\)-value. For the body temperature example discussed at the beginning of this chapter, the \(z\)-score was -1. For the one sided hypothesis of the form \(H_{0}: \mu < \mu_{0}\), we find the lower tail probability of -1 as follows:

pnorm(98.4, mean=98.6, sd=1/5, lower.tail=TRUE)
## [1] 0.1586553

Remember to specify the option lower.tail=FALSE to get the upper tail probability. For the two sided hypothesis, we multiply the above probability by 2. Similar approach is used for testing one-sided or two-sided hypothesis regarding population proportion.

When \(\sigma^{2}\) is unknown, and we need to use the data to estimate it separately, we use the \(t\)-test to evaluate hypotheses regarding the mean of a normal distribution. For the BMI example in Section , we found \(t\)-score was \(t = 5.33\). For the one sided hypothesis of the form \(H_{0}: \mu > \mu_{0}\), we need to find the upper tail probability of 5.33 from a \(t\) distribution with \(n-1\) degrees of freedom, where \(n=200\) in this example. we use the pt() function

pt(5.33, df=199, lower.tail=FALSE)
## [1] 1.324778e-07

Alternatively, instead of calculating the \(t\)-score and finding the appropriate tail probabilities to obtain the \(p\)-value, we can use the function t.test(). For the BMI example, we use this function as follows:

t.test(x=Pima.tr$bmi, mu = 30)
## 
##  One Sample t-test
## 
## data:  Pima.tr$bmi
## t = 5.3291, df = 199, p-value = 2.661e-07
## alternative hypothesis: true mean is not equal to 30
## 95 percent confidence interval:
##  31.45521 33.16479
## sample estimates:
## mean of x 
##     32.31

Here, the argument x is a (non-empty) numeric vector of data values, and mu is the population mean according to the null hypothesis. Notice that the output provides the \(t\)-score (t), the degrees of freedom (df), and the \(p\)-value. Additionally, it provides the sample mean, \(\bar{x} = 32.31\), and the 95% confidence interval for the population mean, \([31.46, 33.16]\). We can estimate the interval at other confidence levels (instead of 0.95) by using the option conf.level}`.

t.test(x=Pima.tr$bmi, mu = 30, conf.level=0.9)
## 
##  One Sample t-test
## 
## data:  Pima.tr$bmi
## t = 5.3291, df = 199, p-value = 2.661e-07
## alternative hypothesis: true mean is not equal to 30
## 90 percent confidence interval:
##  31.59367 33.02633
## sample estimates:
## mean of x 
##     32.31

Note that only the confidence interval estimate changes; the parts that are related to hypothesis testing remain as before.

Activity 1

Obtain the 95% confidence interval for the population mean of BMI among Pima Indian women assume the population variance is 36.

Obtain the 95% confidence interval for the population mean of BMI among Pima Indian women assume the population assuming we don’t the population variance.

Activity 2

Suppose the population mean of systolic blood pressure in the US is 115. We hypothesize mean systolic blood pressure is lower than 115 among people who consume a small amount (e.g., around 3.5 ounces) of dark chocolate every day. Assume that systolic blood pressure, \(X\), in this population has a \(N(\mu, \sigma^{2})\) distribution. To evaluate our hypothesis, we randomly selected 25 people, who include a small amount of dark chocolate in their daily diet, and measured their blood pressure.

Can we reject the null hypothesis at 0.1 confidence level

  • If the sample mean is \(\bar{x} = 113\) and the population variance is \(\sigma^2 = 25\)?
pnorm(113, mean=115, sd=1, lower.tail = TRUE)
## [1] 0.02275013
  • Simulate 100000 samples from the null distribution and find the proportion of samples whose value is below 112.
x.bar <- rnorm(100000, mean=115, sd=1)
mean(x.bar < 113)
## [1] 0.02298
  • If the sample mean is \(\bar{x} = 113\) and we do not know the population variance but we know that the sample variance is \(s^2 = 25\)?
t = (113-115)/(5/sqrt(25))
pt(t, df=24, lower.tail = TRUE)
## [1] 0.02846992

Activity 3

Use the Pima.tr data set to evaluate the hypothesis that the population mean of diastolic blood pressure for Pima Indian women is not 70.


Activity 4

Consider the problem of estimating the proportion of people who regularly smoke. We use \(X\) to denote smoking status, and \(\mu\) to denote the population proportion of people who smoke. We hypothesize that the population proportion is less than 0.2. Write down the null and alternative hypotheses. Suppose we interview 150 people and find that 27 of them smoke regularly. Evaluate the null hypothesis.


Activity 5

We believe that the population mean of normal body temperature is less that the widely accepted value of \(98.6\,^{\circ}\mathrm{F}\). Write down the null hypothesis and evaluate it using the ‘BodyTemperature.txt’ data.


Activity 6

Download the ‘BodyTemperature.txt’ data set from the course website. For the heart rate variable, we want to evaluate the following hypotheses. We set the significance level (cutoff) to 0.01.

  • Evaluate the hypothesis that the population mean is less than 75. (Write down the null and alternative hypotheses. Discuss your findings.)

  • Evaluate the hypothesis that the population mean is different from 75. (Write down the null and alternative hypotheses. Discuss your findings.)


Activity 7

  • We hypothesize that more than 5% of pregnant women have history of hypertension. Write down the null and alternative hypotheses. Use the {birthwt} data set (available from the MASS package) to evaluate this hypothesis (with discussion). We set the significance level (cutoff) to 0.05. (In {birthwt} data set, the variable {ht} shows the hypertension history: {ht=1} when women have history of hypertension, {ht=0} otherwise.)