Testing relationships

library(tidyverse)
library(ggplot2)
library(MASS)
library(mfp)
data("birthwt")
data("Pima.tr")
Platelet<- read.table("data/Platelet.txt", header=T, sep="")
data(bodyfat, package="mfp")

Two sample t-test

For two sample $t$-test, we use the function t.test(). For example, using the {birthwt} data set, we can examine whether smoking during pregnancy and birthweight are related.

t.test(bwt~smoke, mu=0, alternative='two.sided', data=birthwt)

## 
##  Welch Two Sample t-test
## 
## data:  bwt by smoke
## t = 2.7299, df = 170.1, p-value = 0.007003
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##   78.57486 488.97860
## sample estimates:
## mean in group 0 mean in group 1 
##        3055.696        2771.919

The first argument to the t.test() function is the “formula” specifying the response variable and the factor (explanatory) variable in the form of {response $\sim$ factor}. In this case, the response variable is {bwt} and the factor is {smoke}. We are using the {data=birthwt} option to avoid having to write {birthwt$bwt $\sim$ birthwt$smoke}. The {mu} option is used to specify the difference in the population means according to the null hypothesis.

When the observations in the two groups are related (paired), we need use the paired $t$-test. For example, suppose our alternative hypothesis is that platelet aggregation is lower before smoking than after, $H_{A}: \mu < 0$ versus $H_0: \mu = 0$. In R we still use the function {t.test()} to examine the support for these hypotheses, but this time, we set the argument {paired} to {TRUE}.

t.test(Platelet$Before, Platelet$After, alternative='less', paired=TRUE)

## 
##  Paired t-test
## 
## data:  Platelet$Before and Platelet$After
## t = -4.2716, df = 10, p-value = 0.0008164
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -5.913967
## sample estimates:
## mean of the differences 
##               -10.27273

The first argument to the function provides the first group of observations, and the second argument provides the second group of observations.

Removing the option would ignore the dependence between the observations in the two groups. (In other words, we would use the independent two-sample $t$-test.)

t.test(Platelet$Before, Platelet$After, alternative='less')

## 
##  Welch Two Sample t-test
## 
## data:  Platelet$Before and Platelet$After
## t = -1.4164, df = 19.516, p-value = 0.08621
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf 2.251395
## sample estimates:
## mean of x mean of y 
##  42.18182  52.45455

The results are very different. Ignoring the dependence between observations is inappropriate and might result in the wrong conclusions.

Correlation test

To test hypotheses about a linear relationship between two numeric variables, we use Pearson’s correlation coefficient and the cor.test() function in R. The following code examines whether percent body fat and abdomen circumference from the {bodyfat} data set are positively correlated, $H_{A}: > 0 $ versus $H_{0}: \rho = 0$.

cor.test(bodyfat$siri, bodyfat$abdomen, alternative="greater")

## 
##  Pearson's product-moment correlation
## 
## data:  bodyfat$siri and bodyfat$abdomen
## t = 22.112, df = 250, p-value < 2.2e-16
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
##  0.77505 1.00000
## sample estimates:
##       cor 
## 0.8134323

The arguments to the cor.test() function are the two random variables and the {alternative=“greater”} option specifies the $H_{A}: \rho > 0$. As before, the other options are “two.sided”, and “less”.

Chi-squared test

To test the relationship between two binary random variables, we use the $\chi^{2}$ test to compare the observed frequencies to the expected frequencies based on the null hypothesis. We use chisq.test() for this purpose. We can first create the contingency table using the table() function, and pass the resulting contingency table to `{chisq.test()’.

For example, the following code creates the contingency table for {smoke} by {low} from the {birthwt} data set, then it performs the $\chi^{2}$ test to examine their relationship:

birthwt.tab <- table(birthwt$smoke, birthwt$low)
birthwt.tab

##    
##      0  1
##   0 86 29
##   1 44 30

chisq.test(birthwt.tab)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  birthwt.tab
## X-squared = 4.2359, df = 1, p-value = 0.03958

Or, we can use

chisq.test(birthwt$smoke, birthwt$low)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  birthwt$smoke and birthwt$low
## X-squared = 4.2359, df = 1, p-value = 0.03958

Use rchisq() function to simulate 10000 samples from $\chi^2(1)$ distribution and find the proportion of these samples with values more than 4.23.

If we only have the summary of the data in the forma of a contingency table as oppose to individual observations, we can enter the contingency table in R and perform the $\chi^{2}$ test as before. For example, consider the study investigating the relationship between aspirin intake and the risk of a heart attack . We can enter the given the contingency table directly in R.

contTable <- matrix(c(189, 10845, 104, 10933), nrow=2, ncol=2, byrow=TRUE)
rownames(contTable) <- c('Placebo', 'Aspirin')
colnames(contTable) <- c('No heart attack', 'Heart attack')
contTable

##         No heart attack Heart attack
## Placebo             189        10845
## Aspirin             104        10933

output <- chisq.test(contTable, correct=FALSE)
output

## 
##  Pearson's Chi-squared test
## 
## data:  contTable
## X-squared = 25.014, df = 1, p-value = 5.692e-07

The argument to the chisq.test() function is the contingency table of observed values. We have assigned the output of the function to a new object called {output}. From this object, we can obtain the observed and expected frequencies with the “$” operator.

output$observed

##         No heart attack Heart attack
## Placebo             189        10845
## Aspirin             104        10933

output$expected

##         No heart attack Heart attack
## Placebo        146.4801     10887.52
## Aspirin        146.5199     10890.48

Activity 1

We hypothesize that the mean of birthweight is different for babies whose mothers were smoking during pregnancy and the babies whose mother were not smoking during pregnancy. Use the {birthwt} data set to evaluate this hypothesis (with discussion). We set the significance level (cutoff) to 0.1. Also, use a boxplot to vizualize the data. (In this data set, {smoke} shows the smoking status of mothers, and {bwt} shows birthweight in grams.)

Activity 2

Use the {Pima.tr} to find the difference between the sample means of diastolic blood pressure for diabetic and non-diabetic Pima Indian women. Is the difference between the means of diastolic blood pressure statistically significant at 0.01 level.

Answer the above question for the number of pregnancies ({npreg}) and BMI.

Use boxplots to visualize the data.

Activity 3

Use the data set {cabbages} from the {MASS} package to examine the relationship between the vitamin C content and cultivars.

Use a boxplot to visualize the data.

Activity 4

Charles Darwin (1809-1882), author of The Origin of Species (1859) investigated the effect of cross-fertilization on the size of plants. The Data and Story library link has the results of one of his experiments (given by R.A. Fisher). In this experiment, pairs of plants, one cross- and one self-fertilized, were planted and grown in the same plot. The following table gives the difference in height (eighths inches) for 15 pairs of plants (cross-fertilized minus self-fertilized Zea mays) raised by Charles Darwin. Use this data to evaluate the null hypothesis that the two methods are not different.

Difference: 49, -67, 8, 16, 6, 23, 28, 41, 14, 29, 56, 24, 75, 60, -48

Activity 5

Consider the following contingency table based on the study conducted to investigate whether taking aspirin reduces the risk of heart attack. Use the log odds ratio test to evaluate the null hypothesis that there is no relationship between taking aspiring and the risk of heat attack.

	Heart attack	No heart attack
Placebo	189	10845
Aspirin	104	10933

Activity 6

Use the {birthwt} data set to examine the relationship between hypertension history ({ht}) and the risk of having low birthweight baby ({low}).

Activity 7

Use the {GBSG} (German Breast Cancer Study Group) data set from the {mfp} package to create a new variable called {rfs} (recurrence free survival) such that {rfs=‘No’} if the patient had at least one recurrence or died (i.e., {cenc=1}), and {rfs=‘Yes’} otherwise. Use the data to investigate whether recurrence free survival is related to hormonal therapy. Note that in {GBSG}, the variable {htreat} indicates whether a patient has received hormonal therapy or not.

Activity 8

For the Pima Indian women population, find the sample correlation coefficient between BMI and diastolic blood pressure. Is the correlation between these two variables statistically significant at 0.01 level?

Activity 9

Use the ‘BodyTemperature.txt’ to estimate the correlation coefficient between normal body temperature and heart rate. Is the correlation between these two variables statistically significant at 0.01 level? How about the correlation between age and normal body temperature?

Activity 10 (Extra)

Read the article ‘Caloric restriction improves memory in elderly humans’. This paper is available online. What was their estimate of the correlation coefficient between memory score and insulin level? Was the correlation statistically significant at 0.1 level?

Activity 11 (Extra)

Read the paper ‘A Critical Appraisal of $98.6\,^{\circ}\mathrm{F}$’, the Upper Limit of the Normal Body Temperature, and Other Legacies of Carl Reinhold August Wunderlich’’ by Mackowiak et. al.. The paper is available online. What method they used to evaluate the relationship between gender and body temperature? What did they find? What was their conclusion about the relationship between race and body temperature?

Activity 12 (Extra)

Read the paper by Kettunen et. al. on the effect of arthroscopy in patients with chronic patellofemoral pain syndrome. This paper is available online. + What is the point estimate and 95% confidence interval for the mean improvement in the Kujala score for each treatment group. + Was the difference between the two group in terms of mean improvement in the Kujala score statistically significant? + Based on the results published in this paper, create a contingency table, where the row variable is the treatment group, and the column variable is an indicator that is equal to 1 if the patient reports at least moderate improvement at the end of follow-up period, and 0 otherwise. Use the log ratio and $\chi^{2}$ test to investigate whether there is a relationship between the type of treatment and reporting at least moderate improvement.