Random variables

Activity 1

What would be the most appropriate probability distribution for each of the following random variables:

Whether a tumor is benign or malignant
Number of people with a malignant tumor out of 10 patients with tumor
Size of tumors
Number of people diagnosed with malignant tumor in California every year

Activity 2

Consider the following plots. Write down these probabilities:

$P(X < 3)$
$P(1< X \le 4)$

Activity 3

Consider the following plots. Write down these probabilities:

$P(Y > 5)$

Activity 4

You can use rbinom to sample from a bimomial distribution.

Suppose the probability of a specific disease is 0.2 and we want to know the probability of observing 3 out of 10 people affected by the disease: $P(Y=3)$. We can use dbinom, which returns the probability of a specific value.

dbinom(x=3, size = 10, prob = 0.2)

## [1] 0.2013266

Find the probability of observing either 3 or fewer patients.

Note that Binomial(1, 0.2) is the same as Bernoulli(0.2):

rbinom(n=10, size = 1, prob = 0.2)

##  [1] 1 0 0 1 0 0 0 1 0 0

rbinom(n=10, size = 10, prob = 0.2)

##  [1] 4 2 2 0 3 4 3 3 2 4

We can plot the probability mass function (pmf).

x <- 0:10
pmf <- dbinom(x, size=10, prob=0.2)
plot(x, pmf, type="h", xlab="Number of Successes", ylab="Probability Mass", main="Binomial(10, 0.2)")
points(x,pmf, pch=16)
abline(h=0, col="gray")

Or, we can use ggplot:

df <- data.frame(x = x, y = pmf)
ggplot(data = df,  aes(x = x, y = y, xend = x, yend = rep(0, length(x)))) +
  geom_point() + geom_segment() + 
  xlab("Number of Successes") + ylab("Probability Mass") +
  scale_x_continuous(breaks=x)

Now generate 1000 samples form Binom(10, 0.2) distribution and plot the distribution of the resulting data.

Again suppose that we are interested in the probability of observing 3 or fewer affected people in a group of 10. We could of course sum the values of pmf: $P(Y \leq 3) = P(Y=0) + P(Y=1) + P(Y=2) + P(Y=3)$. However, it is easier to use the cumulative distribution function for a binomial random variable pbinom to obtain the lower tail probability:

pbinom(3, size=10, prob=0.2, lower.tail=TRUE)

## [1] 0.8791261

By changing the lower.tail option to FALSE, we can find the upper tail probability $P(Y>3)$.

Activity 5

Suppose BMI in a specific population has a normal distribution with mean of 25 and variance of 16: $X \sim N(25, 16)$. Then we can simulate 5 values from this distribution using the rnorm function.

rnorm(n=5, mean=25, sd=4)

## [1] 21.97157 25.66485 22.40555 20.73714 22.70862

These numbers can be regarded as BMI values for 5 randomly selected people from this population. In the rnorm function, the first parameter the number of samples, the second parameter is the mean and the third parameter is the standard deviation (not the variance).

You can also plot the pdf:

x <- seq(from=10, to=40, length=100)
fx<- dnorm(x, mean=25, sd=4)
plot(x, fx, type="l", xlab="BMI", ylab="Density", main="N(25, 16)")
abline(h=0, col="gray")

Or, we can use ggplot:

df <- data.frame(x=x, y=fx)
ggplot(data = df, aes(x =x)) + 
  geom_function(fun = dnorm, args = list(mean = 25, sd = 4))+
  xlab("BMI") + ylab("Density")

Now generate 1000 samples from $N(25, 16)$ and plot the distribution of the resulting data.

Remember that for continuous variables the probability of a specific value is always zero. Instead, for continuous variables, we are interested in the probability of observing a value in a given interval. For instance, the probability of observing a BMI less than or equal to 18.5 is the area under the density curve to the left of 18.5. In R, we find this probability with the cumulative distribution function pnorm:

pnorm(18.5, mean=25, sd=4, lower.tail=TRUE)

## [1] 0.05208128

Once again, we can find the upper tail probability $P(X > 22)$ by setting the option lower.tail=FALSE.

The qnorm function returns the quantile for normal distributions is. For example, the 0.05 quantile for the above distribution is

qnorm(0.05, mean=25, sd=4, lower.tail=T)

## [1] 18.42059

Now find $P(25 < X \le 30)$.

Activity 6 (Extra)

Consider Binomial(20, 0.3) distribution. Do the following tasks:

Use R to plot the probability mass function and cumulative distribution function.
Write down the mean and standard deviation of each distribution.
Find the lower tail probability of 4.
What is the probability that the value of the random variable is 2?
What is the probability that the value of the random variable is bigger than 2 and less than or equal to 4?

Activity 7 (Extra)

Consider $N(3, 2.1)$ distribution. Do the following tasks:

Use R to plot the probability density function and cumulative distribution function.
Write down the mean and standard deviation of each distribution.
Find the lower tail probability of 4.
What is the probability that the value of the random variable is 2?
What is the probability that the value of the random variable is bigger than 2 and less than or equal to 4?

Activity 8 (Extra)

For the probability distributions Binomial(100, 0.3) and $N(30, 21)$, find the lower tail probability of 35 and the upper tail probability of 27. Compare the results based on the two distributions.

Activity 9 (Extra)

Suppose $X$ has a $t$-distribution with 6 degrees of freedom.

Find the lower tail probabilities of $-1$ and $1.5$.
Find the 0.95 and 0.9 quantiles.

Activity 10 (Extra)

National Heart, Lung and Blood Institute defines the following categories based on Systolic Blood Pressure ($SBP$): - Normal: $SBP \le 120$ - Prehypertension: $ 120 < SBP $ - High blood pressure: $SBP > 140$

If $SBP$ in the US has a normal distribution such that $SBP \sim N(125, 15^{2})$,

Use R to find the probability of each group.
Find the intervals that include 68, 95, and 99.7% of the population.
What are the lower and upper tail probabilities for $SBP$ equal to 115?

Activity 11 (Extra)

Assume that BMI in US has a $N(27, 6^2)$ distribution. Following the recommendation by National Heart, Lung, and Blood Institute, we define the following BMI categories:

Underweight: $BMI \le 18.5$
Normal weight: $ 18.5 < BMI $
Overweight: $ 25 < BMI $
Obesity: $BMI > 30$
Use R to find the probability of each group.
Find the intervals that include 68, 95, and 99.7% of the population.
What is the probability of being either underweight OR obese (i.e., the union of the two intervals)?
What are the lower and upper tail probabilities for BMI equal to 29.2?

Activity 12 (Extra)

For the above question, we denote BMI as $X$. Find the value $x$ such that $P(X \le x) = 0.2$. Next, find the value $x$ such that $P(X > x) = 0.2$.

Activity 13 (Extra)

If the height (in inches) of newborn babies has the $N(18, 1)$ distribution, what is the probability the the height of a newborn baby is between 17 and 20 inches? What is the distribution of height in centimeters (1 inch = 2.54 cm)? Using this distribution, what is the probability that the height of a newborn baby is between 43.18 cm (17 inches) and 50.80 cm (20 inches)?

Activity 14 (Extra)

Suppose the distribution of systolic blood pressure, $X$, among people suffering from hypertension is $N(153, 4^{2})$. Further, suppose that researchers have found a new treatment that drops systolic blood pressure by 4 points on average. The effect of drug, $Y$, varies among patients randomly and it does not depend on their current blood pressure level. If the variance of $Y$ is 1. What is the mean (expectation) and variance of systolic blood pressure if every person in the population starts using the drug? What is the distribution of systolic blood pressure in this case if we assume $Y$ has a normal distribution?