The Bayesian t-Test

In January I started my internship as a Data Analyst at Shopify. In the first few weeks of working I was tasked with my first Bayesian adventure: to implement the t-test in the context of the Bayesian framework.

First, here's a simple version of Bayes Theorem:

P(A|B) = \frac{P(B|A)\cdot P(A)}{P(B)}


  • $P(A)$ is the prior, our initial idea of what A should be.
  • $P(B)$ is the evidence
  • $P(A|B)$ is the posterior probability - what our belief of $A$ is given $B$.
  • $P(B|A)$ is the likelihood.

But where does everything come together? Where do I go to get these probabilities in the real world? How does all this Bayesian stuff even work?

Understanding the Problem

With some suggestions, I followed the ideas of John K. Kruschke of Indiana University's model outlined in the paper Bayesian estimation supersedes the t test (or the BEST paper). I think Kruschke's paper does an excellent job motivating the test and setting out how to do it.

What we want to do is for two given populations, estimate the means ($\mu_1$ and $\mu_2$), variances ($\sigma_1$ and $\sigma_2$), and normality ($\nu$) of the populations. Mathematically we want to find the posterior distribution, $P(\mu_1, \sigma_1, \mu_2, \sigma_2, \nu | D)$.

Invoking Bayes Theorem we know:

P(\mu_1, \sigma_1, \mu_2, \sigma_2, \nu | D) & = \frac{Likelihood \times Prior}{ Evidence }
\\ & = \frac{P(D | \mu_1, \sigma_1, \mu_2, \sigma_2, \nu) \cdot P(\mu_1, \sigma_1, \mu_2, \sigma_2, \nu)}{ P(D) }

where $D$ is the data we have observed. With the posterior distribution we can compute the probability of $\mu_1 > \mu_2$ and other measures of interest like effect size (a normalized measure of difference in populations) and the differences in variance of the groups.

This looks like it could get pretty ugly. Luckily, there are some great tools on our side.

The Mathematical Tools for the Job

I've spent a lot of time reading my textbooks and then reading papers trying to figure out a nice way to do all this Bayesian stuff above. What you end up with is this complicated sum of integrals and complicated distribution functions. Fortunately, Kruschke suggests using Markov chain Monte Carlo (MCMC), a method of random sampling and moving data between different states of our model. For this we used the Python library PyMC to do the heavy work.

MCMC is a method with the goal of generating an accurate representation of the posterior distribution of a model. Monte Carlo methods are ways of using random number generators and deterministic computations to solve mathematically complicated equations. MCMC then is the process of creating a Markov chain that has an equilibrium that is the same as the distribution we wish to study.

PyMC isn't the friendliest library when you aren't familiar with it, so I also consulted with Bayesian Methods for Hackers by Cam Davidson-Pilon. Chapters 1 and 2 should be enough to understand what comes next.

Building the t-Test

Preparing for the test

If you aren't using IPython at this point, I strongly recommend you get it because it's a fantastic tool. This isn't a tutorial on IPython, so I'll skip right to the fun stuff.

If you don't have data that you want to test (the Bayesian t is really nice for comparing population data), you can generate some using numpy:

import pymc as pm
import numpy as np
import pandas as pd

# Generate some random normal data
group1 = np.random.normal(15,2,1000)
group2 = np.random.normal(15.7,2,1000)

# Generate Pooled Data
pooled = np.concatenate((group1,group2))

Simply adjust the code above to create new samples to test. Here I want to test two distributions that are different, but might be similar enough to get picked up by the tradition t-test as being insignificantly different.


PyMC works by specifying the model you want to work with, and then you run MCMC on that model with your observed data fed into it. It's not overly complicated when you know your priors (if you don't, John D. Cook has a great page about them).

The BEST paper outlines some good starting values for our prior distributions. The general idea is that we don't want to weight our priors too heavily or else we might introduce bias into the test, but we also don't want fixed values because the data we are using could be extremely large or small. Luckily some nice starting values have been given.

Here's how I built our model:

# Setup our priors
mu1 = pm.Normal("mu_1",mu=pooled.mean(), tau=1.0/pooled.var()/1000.0)
mu2 = pm.Normal("mu_2",mu=pooled.mean(), tau=1.0/pooled.var()/1000.0)

sig1 = pm.Uniform("sigma_1",lower=pooled.var()/1000.0,upper=pooled.var()*1000)
sig2 = pm.Uniform("sigma_2",lower=pooled.var()/1000.0,upper=pooled.var()*1000)

v = pm.Exponential("nu",beta=1.0/29)

Now we want to setup our posterior distributions. Since this is where the data we already have comes in, they're a little bit different than the priors. We inform these distributions with our priors and our given observations and this will be used by the MCMC.

# Include our observed data into the model
t1 = pm.NoncentralT("t_1",mu=mu1, lam=1.0/sig1, nu=v, value=group1[:], observed=True)
t2 = pm.NoncentralT("t_2",mu=mu2, lam=1.0/sig2, nu=v, value=group2[:], observed=True)

Generating the Results

We've finished constructing our model, now we just have to put it together and run MCMC on it. There are several different algorithms for MCMC, but PyMC's default algorithm works perfectly for this application.

# Push our priors into a model
model = pm.Model( [t1, mu1, sig1, t2, mu2, sig2, v] )

# Generate our MCMC object
mcmc = pm.MCMC(model)

Now we run the MCMC to generate our resulting posterior distributions. When running MCMC longer chain lengths result in more precise outcomes. Unfortunately we also have to worry about autocorrelation, which is clumping up of results that will bias our outcome. So here sampling 40000 times, we throw out (burn) the first 10000 results and keep every 2nd result (optionally we sometimes choose every k'th result, but this really depends on the tradeoff you wish to make between computing time and precision). This means that our sampling might take a long time.

# Run MCMC sampler

Checking for convergence

When testing one might want to test whether or not the sampler converged to some value. PyMC lets us plot a trace for each parameter and examine convergence.

from pymc.Matplot import plot as mcplot

The output of this should look more or less like the following:


Looking at the lower-left panel you can see the histogram is relatively flat. This is what we are looking for when checking convergence. If you find that your MCMC isn't converging you can trade additional running time for lower autocorrelation by upping the number of samples, burns and changing the iteration rate of your sampler. It isn't necessary to check convergence every time you run the test, but when debugging this is a powerful tool, especially when you think you're getting questionable results.

Getting the Results

So now that we've got a working model that converges nicely, we want to actually get the data out of it. The reason that we set string representations of each object in our model was so that we can pull out the results from our sampler like this:

mus1 = mcmc.trace('mu_1')[:]
mus2 = mcmc.trace('mu_2')[:]
sigmas1 = mcmc.trace('sigma_1')[:]
sigmas2 = mcmc.trace('sigma_2')[:]
nus = mcmc.trace('nu')[:]

The results from MCMC aren't a singular value, but rather a collection of numbers. This is because we are sampling the distribution of our parameters, not computing them exactly. The result is that we have a measure of confidence on where we think a parameter should be. This allows us to do things like compute the distributions of parameters and the distributions of their differences.

The most interesting metrics for this test are the following:

diff_mus = mus1-mus2
diff_sigmas = sigmas1-sigmas2
normality = np.log(nus)
effect_size = (mus1-mus2)/np.sqrt((sigmas1**2+sigmas2**2)/2.)

Estimates of $\mu_1$ and $\mu_2$

We can calculate estimates of parameters using the expectation of their posteriors. So for estimating our means we can just take the means of our traces above:

print "mu_1", mus1.mean()
print "mu_2", mus2.mean()


mu_1 14.9868612418
mu_2 15.6532483445

which is very close to the actual values of those parameters. For most estimators, we can simply take the mean of our sampler distribution.

How confidence are we in these values? We can plot the kernel densities of our parameters to see exactly what the probability of a given value is. In IPython we just do the following:

from scipy.stats import gaussian_kde

# prepare plotting area to fit both graphs
minx = min(min(mus1),min(mus2))
maxx = max(max(mus1),max(mus2))
# x values to plot on
xs = np.linspace(minx,maxx,1000)

# generate density estimates
gkde1 = gaussian_kde(mus1)
gkde2 = gaussian_kde(mus2)

# draw plots

Distribution Plot

Bayesian Hypothesis Testing and p-Values

We have our estimates and we have distributions for them, let's see what questions we can answer and how certain we are about our answers.

In statistics the common way to go about answering a question is hypothesis testing, and since we're working in the Bayesian framework, we will be doing Bayesian Testing.

Suppose we want to answer the question "do these groups have a different mean"? We test the hypothesis $H_0: \mu_1=\mu_2$ vs $H_1: \mu_1 > \mu_2$. To do this we are computing $P(\mu_1-\mu_2 > 0 |D)$, which is easily computed because we've already computed the distribution of $\mu_1-\mu_2$. All we have left to do is compute the probability from our samples that $H_0$ is true, i.e. our 'p-value' is generated by:

p = (diff_mus > 0).mean()

Similarly we do the complement of this to test whether $H_1$ is true. In this case, $H_1$ has a higher probability.

We can visualize this probability through the following:


minx = min(min(mus1),min(mus2))
maxx = max(max(mus1),max(mus2))
xs = np.linspace(minx,maxx,1000)

gkde1 = gaussian_kde(mus1)
gkde2 = gaussian_kde(mus2)



minx = min(diff_mus)
maxx = max(diff_mus)
xs = np.linspace(minx,maxx,1000)
gkde = gaussian_kde(diff_mus)


plt.axvline(0, color='#000000',alpha=0.3,linestyle='--')

Since this is a Bayesian procedure, what we have in fact tested is:

P(H_0| D) \text{ and } P(H_1 | D)

We accept the hypothesis with the highest probability associated with it (here, $H_1$). As a side note, we can test any number of hypothesis together and simply accept the one with the greatest probability, which can save a lot of time and work when we want to compare lots of different groups.

There are lots of other tests you can easily do using this framework. Hopefully this has shed some light on how to get into Bayesian testing and models.