In this section, we will discuss a popular and versatile approach to hypothesis testing on continuous data, the z-test , which makes use of the Central Limit Theorem (CLT). We will apply this test to the sleeping drug study.

Afterwards, we will see how the z-test is also helpful as an approximation when the data is discrete, such as in the mammography study.

Modeling choice for the sleeping drug study

When our data was binary, we are typically limited to the Bernoulli model and the corresponding binomial model for the number of targeted observations. When our data can take on continuous values, we have more choices. Depending on the application, we can use one of several well-known distributions, including the uniform, exponential, and normal distributions.

Recall the data collected for the sleeping drug study:

Suppose our candidate models for the difference in number of hours slept are the uniform and the Gaussian models. Both the support and the distribution are important considerations:

The support of a model is the set of values that the observations can take in the model. In the sleeping drug study, the number of hours slept in a day is bounded above, so the difference is also bounded. This points in favor of the uniform model, as it has a bounded support, while a Gaussian model always has unbounded support.

The distribution of a continuous model is based on the shape of the pdf. In model selection, this can be decided based on solving a theoretical model, looking at the empirical distribution of observations, or common knowledge. The number of hours slept by an adult is known to be centered around 8 hours, and outliers tend to be rare, so this points towards the Gaussian model for the sleeping drug study.

Weighing these two considerations, in the sleeping drug study, we select the normal distribution and then ensure that the variance parameter is sufficiently small, so that the probability of falling outside the realistic boundary is negligible.

Furthermore, we can argue towards a normal distribution by reasoning that the number of hours slept is a cumulative effect of a large number of biological and lifestyle variables. As a lot of these variables are unrelated to one another, the cumulative effect can be approximated by a normal distribution. This is justified by the Central Limit Theorem (CLT), which is covered in more detail below, and is the important result that establishes the z-test.

Central limit theorem (CLT) and the z-test statistic

Suppose that we have observations X_1, \ldots , X_ n, which are independent and identically distributed based on a probability model. Under a few regularity assumptions (such as the model having a finite second moment), the distribution of the sample mean \overline{X} will approximate a normal distribution when sample size becomes sufficiently large (typically n \geq 30).

The central limit theorem (CLT) states that: When sampling random variables X_1,\ldots , X_{n} from a population with mean \mu and variance \sigma ^2, \bar{X} is approximately normally distributed with mean \mu and variance \sigma ^2/n when n is large:

\overline{X} := \frac{X_1 + X_2 + \ldots + X_ n}{n} \sim \mathcal{N}\left(\mu , \frac{\sigma ^2}{n}\right) \qquad \text {for } n \text {large} .


Hence, we can define a test statistic \displaystyle z = \frac{\overline{X} - \mu }{\sigma /\sqrt{n}}, which approximately follows a standard normal distribution when n is large:

z = \frac{\bar{X} - \mu }{\sigma /\sqrt{n}} \sim \mathcal{N}(0,1).

The test statistic z is called an (approximate) pivotal quantity, since its (approximate) distribution does not depend on the paramaters \mu or \sigma. We can use the cdf of a pivotal quantity to compute the p-value (which is the probability for the test statistic to take on a value at least as extreme as the one observed), and compare the p-value with \alpha the significance level to decide whether to reject the null hypothesis H_0.

Z-test in the sleeping drug study?

We are interested in testing the efficacy of a sleeping drug. The data collection process recorded the hours of sleep of 10 patients under the drug and under the placebo:

Now, we want to answer the question:

"Does the drug increase hours of sleep enough to matter?"

We model the difference of hours of sleep between the drug and the placebo for each patient as a normal random variable:

Model: X_1,..., X_{10} \sim \mathcal{N}(\mu , \sigma ^2) (X_1, for example, would be: 6.1 - 5.2 = 0.9).

From this, we state the hypotheses for a one-sided test:

Null hypothesis (H_0): \mu = 0

Alternative hypothesis (H_ A): \mu > 0.

Since the data X_ i are modeled as independent Gaussians, the z-test statistic described above has an standard normal distribution under the null hypothesis H_0, even without using the central limit theorem.

We consider using z as the test statistic. However, to calculate z=\frac{\bar{X} }{\sigma /\sqrt{n}}, we need to know the true value of the variance \sigma. Since we do not know the population variance in this experiment, we cannot use the z-test.

In general, if samples cannot be modeled as Gaussian variables, then the sample size also needs to be large in order to use the standard normal to approximate z using the CLT.

The t-test resolves both issues of the unknown true variance and the required large sample size.

Application to the mammography study

We conduct the z-test for the mammography study with the following model and hypotheses:

Model: X_1, \ldots , X_{31000} \stackrel{i.i.d.}{\sim } \text {Bernoulli}(\pi ) each indicating whether a patient in the treatment group dies of breast cancer.

Null hypothesis H_0: \pi = 63/31000; Alternative hypothesis H_ A: \pi < 63/31000.

As done in lecture 1, we have assumed in the null hypothesis that \pi = 63/31000\, \approx \, 0.00203 is the true reference value for the death rate without treatment. Hence, we will assume the true variance of X to be the corresponding value \sigma = \sqrt{\pi (1-\pi )}\, \approx \, 0.045. The z-test statistic is:

\displaystyle \displaystyle z \displaystyle = \displaystyle \frac{\bar{X} - \pi }{\sigma /\sqrt{n}}\, =\, \frac{39/31000 - 63/31000}{\sqrt{(63/31000)(1-63/31000)}/\sqrt{31000}}\, \approx -3.0268. (3.1)
The p-value can be calculated from the area under the pdf of the standard normal distribution to the left of the z-value above:

Z Test Exercise
2 points possible (graded)
Calculate p-value for the mammography study using the z-test described above.

(Please enter the value with a precision of 4 digits after the decimal point. Hint: you could use the norm.cdf function in the scipy.stats package in Python.

unanswered

Let X_ i be the difference of hours of sleep between drug and placebo. What is a reasonable data generation model?

X_ i \sim \mathcal{N}(\mu ,\sigma ^2), X_ i independent of each other

X_ i \sim Poisson(\lambda ), X_ i independent of each other

X_ i \sim Binomial (n,p), X_ i independent of each other

X_ i \sim \mathcal{N}(\mu ,\sigma ^2), X_ i independent of each other

Based on the given information, a reasonable data generation model for the difference in hours of sleep between drug and placebo (denoted as X_i) would be:

X_i ~ N(μ, σ^2), where X_i are independent of each other.

This means that each difference in hours of sleep is normally distributed with a mean μ and variance σ^2.