The Two-Sample Location Problem: Don't Do A Preliminary Test For Equal Variances

Ah... the two-sample t-test. It's old, it's classic, and it's still used thousands of times every day.

There's a lot we can learn from the ol' two-sample t-test. Yes, it can tell us with Type I error level ⍺ whether two means from two random samples from independent populations are the same, or not the same (different?). But how we use the t-test has some lessons for how we conduct data analysis in general.

In fact, this story hints at the the impossibility of truly "objective" analysis and suggests that, despite the robo-histeria, an algorithmic approach to data analysis is unlikely to replace statisticians anytime soon. Or so I hope.

The story goes something like this:

The problem

The two-sample location problem in the typical setup (i.e., ignoring the non-parametric tests) involves two means from independent populations.

Formally, the populations should be normally distributed so that the sample means are exactly normally distributed, but in practice, we often invoke asymptotic arguments that tell us, for large sample sizes, the sample means are approximately normally distributed. 

To characterize a normal distribution, we need to know the mean and the variance (or standard deviation). Most of the time, we don't know variance (and we obviously don't know the means), so we must estimate both. When we're dealing with a single population, this is easy and leads to the one-sample t-test. When we're dealing with the difference of sample means, it's a bit more complicated.

The fork in the road

The variance question leads to a fork in the road, with one path leading to Welch's test and the other to the pooled two-sample t-test. The simple heuristic taught in intro stats classes is: 

  • If you know the variances in the two approximately normal populations are equal, use the pooled two-sample t-test.
  • If you know the variances in the two populations are not equal, use Welsch’s test

But if you don't know if the variances are equal (the "Behren-Fisher problem"), you're up a tree. What procedure should a good statistician use?

A well-intentioned but misleading solution

In the 80s or 90s, some researchers decided that a simple solution to this problem is create a conditional procedure, so-named because the choice of second test is conditional on the results of the first:

  • First, conduct a test for equality of variances, using either an F-test or Levene’s test.
  • Second, we can use the simple heuristic: use the pooled two-sample t-test if we fail to reject the null in the first test, and use Welch's test if we reject the null in the first test.

Implicit in this reasoning is a conjecture that this procedure leads to greater power, since the pooled t-test has greater power than Welch’s test when the variances of the populations are equal. Another potential advantage of this procedure is the appearance of the “unbiased statistician,” since the precedure requires no prior knowledge or intuition in selecting tests. 

This procedure has gained sufficient traction to become incorporated in statistical packages such as SPSS, noted on Wikipedia, and described on “how-to” blogs. As shown below, SPSS provides the results of Levene’s test alongside outputs from the pooled t-test and Welsch’s test. This presentation of results, while not forcing the user to select one test, certainly implies the user would be foolish to select Welch’s test if Levene’s test fails to reject the null hypothesis of equal variances, and vice versa.

 What kind of dummy would ignore the Levene's test output when selecting which t-test to interpret?

What kind of dummy would ignore the Levene's test output when selecting which t-test to interpret?

It's not just software. Wikipedia states the following on the page for Levene’s test:

Levene’s test is often used before a comparison of means. When Levene’s test shows significance, one should switch to more generalized tests that is [sic] free from homoscedasticity assumptions (sometimes even non-parametric tests).

And if you were to google "how to conduct a two-sample t-test in R"? You might get pointed to a blog post on that states:

Before proceeding with the t-test, it is necessary to evaluate the sample variances of the two groups, using a Fisher’s F-test to verify the homoskedasticity (homogeneity of variances).

Wow, pretty convincing! How could so many sources be wrong?

Tests of Homoscedasticity Break Down Too

The major problem with the conditional procedure is that the test for equality of variances (homoscedasticity) breaks down when variances are close, but not actually equal. It has trouble detecting this difference—statistically this is called "low power"—and then frequently selects the "equal variances" pooled t-test, which is the incorrect choice. The pooled t-test then produces type I errors at up to a 50% higher rate than the specified alpha level. 

What compounds this problem is unbalanced sample sizes—one sample is much larger than the other. The theory behind the pooled t-test only holds if (1) variances are equal, or (2) sample sizes are equal in a balanced experiment. If one of these conditions is exactly true, we can do without the other—but approximately true won't cut it.

If variances are slightly unequal and our sample sizes are very unbalanced, we get the worst case scenario where type I error for the conditional procedure can be 50% higher than the specified alpha, shown below. This figure compares type I error for the conditional procedure (with Levene's test as the first step) to the unconditional procedure (just Welch's test) using results from a simulation study (10,000 simulations per data point). The unconditional procedure does well across the board. The conditional procedure is only okay when sample sizes are close to balanced.

 Recall we want the Type I error to be 0.05 for a level alpha=0.05 test. Any deviations over 0.05 are bad news!

Recall we want the Type I error to be 0.05 for a level alpha=0.05 test. Any deviations over 0.05 are bad news!

Next, we'll fix the sample size ratio at 0.2 and vary the variances. Here we see that the conditional procedure is okay when the variances are very different—when Levene's test has high power—but breaks down in the middle area where the variances are slightly different. The unconditional procedure is again fine across the board.



So we have seen that the conditional procedure has some pretty serious issues! The following Q&A addresses what we can take away from this.

Question: Are there any scenarios when we should use the conditional procedure?

Answer: No.

It turns out the argument that the conditional procedure has greater power does not hold water. While it does have greater power in some circumstances, it has higher rates of Type I error in those same circumstances! You can email me for those results or consult the literature.

In the borderline scenario where the variances are close, one should definitely not use the conditional procedure. Welch's test performs well in such scenarios.

Question: Why is the conditional procedure so prevalent?

Answer: At a high level, it's because the people recommending it don't understand the statistical theory of the tests.

I also think the following factors contribute: the appearance of the "unbiased statistician," the idea that math is mechanical, an engineering approach to statistics, and the idea that we can extract a deterministic knowledge from data (not subject to randomness).

Question: What lessons should we take away from this?

Answer: I learned that relying on tests to assess assumptions can lead to unintended consequences. The conditional procedure is perhaps akin to an overfitting problem where researchers may attempt to extract more information from the data than is available.

Lastly, mechanical procedures don't necessarily lead to better outcomes. It's best to use your own knowledge of the data, the data-generating process, and the context of the data to select methods.

Which is why we need statisticians!