I've recently wrapped up these snippets into an easy-to-use Python package you can import anywhere. That means no more cutting and pasting or modifying your IPython/Jupyter config files.
Check it out on github.com/HHammond/PrettyPandas and prettypandas.readthedocs.org.
I love IPython and Pandas, but using them to
]]>I've recently wrapped up these snippets into an easy-to-use Python package you can import anywhere. That means no more cutting and pasting or modifying your IPython/Jupyter config files.
Check it out on github.com/HHammond/PrettyPandas and prettypandas.readthedocs.org.
I love IPython and Pandas, but using them to build reports requires lots of little tricks. Here I've compiled a few things that make building reports much nicer:
The code is also available as a gist here.
There are a few ways to update the CSS of the IPython Notebook. (You can follow my instructions here for a guide on how to format your notebook). If you don't already have a css theme which modifies your table formats, you can use just the following code to beautify your tables.
Open ~/.ipython/profile_default/static/custom/custom.css
and add the following lines:
/* Pretty Pandas Dataframes */
.dataframe * {border-color: #c0c0c0 !important;}
.dataframe th{background: #eee;}
.dataframe td{
background: #fff;
text-align: right;
min-width:5em;
}
/* Format summary rows */
.dataframe-summary-row tr:last-child,
.dataframe-summary-col td:last-child{
background: #eee;
font-weight: 500;
}
which formats DataFrames from:
to:
A common request I get when generating reports with IPython and Pandas is to have a subtotal or summary row at the bottom.
Google yielded lots of StackOverflow questions and some messy answers, so I ended up writing my own (which you can use however you want):
import numpy as np
import pandas as pd
from functools import partial
def summary(df, fn=np.sum, axis=0, name='Total',
table_class_prefix='dataframe-summary'):
"""Append a summary row or column to DataFrame.
Input:
------
df : Dataframe to be summarized
fn : Summary function applied over each column
axis : Axis to summarize on (1: by row, 0: by column)
name : Index or column label for summary
table_class_prefix : Custom css class for dataframe
Returns:
--------
Dataframe with applied summary.
"""
total = df.apply(fn, axis=axis).to_frame(name)
table_class = ""
if axis == 0:
total = total.T
table_class = "{}-row".format(table_class_prefix)
elif axis == 1:
table_class = "{}-col".format(table_class_prefix)
out = pd.concat([df, total], axis=axis)
# Patch to_html function to use custom css class
out.to_html = partial(out.to_html, classes=table_class)
return out
summary(df, axis=0)
This snippet also uses the above CSS so you don't need to edit anything to get table formatting. If you want to use your own CSS you just need to edit the .dataframe-summary-row tr:last-child
and .dataframe-summary-col td:last-child
selectors.
Another common request is for a column to represented as percentages. Again, SA answers suggest setting the DataFrame's float format or other workarounds. Here I just use a format string and convert values to percentage strings (which breaks usability but looks pretty). This works on any Python's standard numeric types and Numpy types.
from numbers import Number
def as_percent(v, precision='0.2'):
"""Convert number to percentage string."""
if isinstance(v, Number):
return "{{:{}%}}".format(precision).format(v)
else:
raise TypeError("Numeric type required")
>>>as_percent(0.5)
'50.00%'
df['Observation Proportion'] = df['Observation Proportion'].apply(as_percent)
df
I hope these snippets are as helpful to you as they have been to me! Feel free to contact me if you have any formatting tips that you want to add to this list.
]]>The Dirichlet-Multinomial model is the generalization of the Beta-Binomial model to multiple classes of a categorical or multinomial distribution. We use it to do analysis on data from rates of multiple distinct outcomes.
The model has a wide scope of applications. It can be used to analyze how much profit
]]>The Dirichlet-Multinomial model is the generalization of the Beta-Binomial model to multiple classes of a categorical or multinomial distribution. We use it to do analysis on data from rates of multiple distinct outcomes.
The model has a wide scope of applications. It can be used to analyze how much profit different traffic sources bring into a website, or how on-sale-items in a store affect total profit. The Dirichlet-Multinomial also applies to document classification as it can be used to model which topics or corpus a document belongs to.
Suppose we are doing a test with $K$ distinct categories of events that may be observed. The count of observations from each category, denoted $c_1,\dots,c_K$, form a multinomial distribution. The goal of the model is to estimate the distributions of the true proportions of observations, denoted $p_1,\dots,p_K$, from the multinomial distribution.
The conjugate prior for the multinomial distribution is the Dirichlet distribution, the application of which is the Dirichelt-multinomial model. The model estimates the posterior distrubtion of $p_1,\dots,p_k$ given our current data and prior beliefs. Our prior beliefs are encoded in the model through the hyperparameter $\alpha$, which represents pseudocounts of what we believe the data should look like – typically set as 1's for weak uniform beliefs. Then the following describes the distribution of our probability estimates:
Further, since the marginals of the Dirichlet distribution are Betas, the distribution of each probability estimate is given by:
$$
p_i \ | \ \alpha, c \sim \text{Beta}\left(\beta_i, \sum_{k=1}^{K}{\beta_k} -\beta_i \right)
$$
where $\beta_i = \alpha_i+c_i$.
We can calculate the expected value of each of our parameters as:
$$
\text{E}[p_i \ | \ \alpha, c ] = \frac{\beta_i}{\sum_{k=1}^{K}{\beta_k}}
$$
where $\beta_i = \alpha_i+c_i$.
This is the posterior mean and is the estimate which minimizes mean square error of our posterior. For most models this is the estimate which is most useful.
The posterior mode, or maximum a posteriori estimate (MAP), of $\left(p_1,\dots,p_K \ | \ \alpha, c\right)$ is the vector $\left(x_1,\dots,x_K \right)$ where each $x_i$ is defined as:
$$
x_i = \frac{\beta_i-1}{\sum_{k=1}^{K}{\beta_k} - K}
$$
MAP estimates are best when a model requires us to be wrong the least often. This is in contrast to the posterior mean which minimizes mean squared error.
A useful implementation of the model is to compare the sum of weighted categories such as the distribution of expected profits from a categorical model.
Suppose we have a weighting function, $\Omega$, paramaterized by $\omega_i(p)$, for each category. If we want to estimate the distribution of weighted values, $W$, we can use a Monte-Carlo simulation to approximate the distribution of expected weighted values:
$$
W(P)=\text{E}[\Omega(P)] \approx \sum_{i=1}^{K}{\omega_i(p_i) \cdot p_i}
$$
where $P \sim \text{Dir}(\alpha+c)$.
In the case of linear weights we set $\omega_i(p)=w_i$ for fixed weight $w_i$.
We can use Monte Carlo to estimate the differences and means of our weighted distributions.
In Python this looks like the following:
import numpy
# Define weighting function
Omega = lambda row: sum([row[0] * 10, row[1] * 50, row[2] * 100 - 14])
observations = numpy.array([600.,150.,40.], dtype=float)
alphas = numpy.ones(observations.shape)
# Sample from Dirichlet posterior
samples = numpy.random.dirichlet(observations + alphas, 100000)
# apply sum to sample draws
W_samples = numpy.apply_along_axis(Omega, 1, samples)
#Compute P(W > 10)
(W_samples > 10).mean()
In Julia it looks like this:
using Distributions: Dirichlet
Omega(A,B,C) = sum([A * 10, B * 50, C * 100 - 14])
observations = [600.,150.,40.]
alpha = ones(length(observations))
posterior = Dirichlet(observations + alpha)
N = 100000
samples = rand(posterior, N)
W_samples = [Omega(samples[:,col]...) for col=1:N]
#Compute P(W > 10)
mean(map((x)-> x > 10, W_samples))
]]>KCBO is a toolkit for anyone who wants to do Bayesian data analysis without worrying about the implementation details for a certain test.
The Bayesian school of statistics offers a lot of incredible benefits.
]]>KCBO is a toolkit for anyone who wants to do Bayesian data analysis without worrying about the implementation details for a certain test.
The Bayesian school of statistics offers a lot of incredible benefits. Bayesian analysis yields rich, meaniningful results and provides intuitive answers that should be available to anyone.
Currently KCBO is very much alpha/pre-alpha software and only implements a three tests. Here is a list of future objectives for the project. I have high hopes that soon KCBO will implement all the usual tests and that it will grow into something very powerful for analysts.
You can checkout the Github repo here. Feature requests, issues, criticisms and contributions are welcome.
KCBO is available through PyPI and on Github here.
pip install kcbo
git clone https://github.com/HHammond/kcbo
cd kcbo
python setup.py sdist install
If any of this fails, you may need to install numpy (pip install numpy
) in order to install some dependencies of KCBO, then retry installing it.
There are currently three tests implemented in the KCBO library:
Lognormal-Difference of medians: used to compare medians of log-normal distributed data with the same variance. The one caveat to this test is that since KCBO uses a conjugate prior to the lognormal and hence assumes that all data has the same variance.
Bayesian t-Test: an implementation of Kruschke's t-Test used to compare differences in mean and standard deviation for normally distributed samples.
Conversion Test or Beta-Binomial difference test: test of conversion to success using the Beta-Binomial model. Popular in A/B testing and estimation.
Documentation for these tests will be available on Github soon.
from kcbo import lognormal_comparison_test, t_test, conversion_test
Because this test uses an MC simulation on the lognormal distribution conjugate prior, it assumes that both distributions have the same variance.
df = pd.DataFrame([A,B])
</li>
<li>Run the test:
```python
summary, data = conversion_test(df, groupcol='group',successcol='successes',totalcol='trials')
Beta-Binomial Conversion Rate Test Groups: A, B Estimates: | Group | Estimate | 95% CI Lower | 95% CI Upper | |:--------|-----------:|---------------:|---------------:| | A | 0.49998 | 0.490183 | 0.509817 | | B | 0.511237 | 0.500258 | 0.522202 | Comparisions: | Hypothesis | Difference | P.Value | 95% CI Lower | 95% CI Upper | |:-------------|-------------:|----------:|---------------:|---------------:| | A < B | 0.0112434 | 0.93361 | -0.0033844 | 0.0259017 |
And an example of what you can do with the raw data returned:
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
A = data['A']['distribution']
B = data['B']['distribution']
diff = data[('A','B')]['distribution']
f, axes = plt.subplots(1,2, figsize=(12, 7))
sns.set(style="white", palette="muted")
sns.despine(left=True)
sns.distplot(A, ax=axes[0], label='Density Estimate for A')
sns.distplot(B, ax=axes[0], label='Density Estimate for B')
sns.distplot(diff, ax=axes[1], label='Difference of Densities (B-A)')
axes[0].legend()
axes[1].legend()
plt.show()
A more complete API will be coming to the github repo soon.
I'm looking forward to any feedback and criticisms :)
]]>See the IPython NBViewer version of this post
IPython is an awesome tool, but I've found that figuring out how to customize the look and feel of the notebook isn't easy. My goal here is to demonstrate how you can customize the look of your notebooks using a single .css
See the IPython NBViewer version of this post
IPython is an awesome tool, but I've found that figuring out how to customize the look and feel of the notebook isn't easy. My goal here is to demonstrate how you can customize the look of your notebooks using a single .css
file and use that theme wherever your notebooks go.
First, to make sure that the required CSS files are created, follow the instructions here.
The interactive notebook has a CSS override file hidden inside of your IPython profile (if you don't know how to locate it, run ipython locate
) located at:
IPYTHONDIR/profile_default/static/custom/custom.css
or on unix-based systems:
~/.ipython/profile_default/static/custom/custom.css
Anything inside this file gets priority over the default IPython theme and any CSS modifications made in this tutorial are saved inside of that file.
There's a great tutorial on updating the IPython typography by here.
I used this tutorial in creating my own CSS theme with a few tweaks (mostly fonts and muting a few interface elements). Here's the code that I used for styling the typography of my own notebook:
@import url('http://fonts.googleapis.com/css?family=Crimson+Text');
@import url('http://fonts.googleapis.com/css?family=Kameron');
@import url('http://fonts.googleapis.com/css?family=Lato:200,300,400');
@import url('http://fonts.googleapis.com/css?family=Source+Code+Pro');
/* Change code font */
.CodeMirror pre {
font-family: 'Source Code Pro', Consolas, monocco, monospace;
}
div.input_area {
border-color: rgba(0,0,0,0.10);
background: rbga(0,0,0,0.5);
}
div.text_cell {
max-width: 105ex; /* instead of 100%, */
}
div.text_cell_render {
font-family: "Crimson Text";
font-size: 12pt;
line-height: 145%; /* added for some line spacing of text. */
}
div.text_cell_render h1,
div.text_cell_render h2,
div.text_cell_render h3,
div.text_cell_render h4,
div.text_cell_render h5,
div.text_cell_render h6 {
font-family: 'Kameron';
font-weight: 300;
}
div.text_cell_render h1 {
font-size: 24pt;
}
div.text_cell_render h2 {
font-size: 18pt;
}
div.text_cell_render h3 {
font-size: 14pt;
}
.rendered_html pre,
.rendered_html code {
font-size: medium;
}
.rendered_html ol {
list-style:decimal;
margin: 1em 2em;
}
Next, I wanted to flatten out the look of the interface. Knowing that IPython uses Bootstrap makes modifying things easier, and the rest can be done just using the Google Chrome developer tools.
Here are the tweaks that I thought were important for making my notebook mine:
Here's the remainder of my customization code:
.prompt.input_prompt {
color: rgba(0,0,0,0.5);
}
.cell.command_mode.selected {
border-color: rgba(0,0,0,0.1);
}
.cell.edit_mode.selected {
border-color: rgba(0,0,0,0.15);
box-shadow: 0px 0px 5px #f0f0f0;
-webkit-box-shadow: 0px 0px 5px #f0f0f0;
}
div.output_scroll {
-webkit-box-shadow: inset 0 2px 8px rgba(0,0,0,0.1);
box-shadow: inset 0 2px 8px rgba(0,0,0,0.1);
border-radious: 2px;
}
#menubar .navbar-inner {
background: #fff;
-webkit-box-shadow: none;
box-shadow: none;
border-radius: 0;
border: none;
font-family: lato;
font-weight: 400;
}
.navbar-fixed-top .navbar-inner,
.navbar-static-top .navbar-inner {
box-shadow: none;
-webkit-box-shadow: none;
border: none;
}
div#notebook_panel {
box-shadow: none;
-webkit-box-shadow: none;
border-top: none;
}
div#notebook {
border-top: 1px solid rgba(0,0,0,0.15);
}
#menubar .navbar .navbar-inner,
.toolbar-inner {
padding-left: 0;
padding-right: 0;
}
#checkpoint_status,
#autosave_status {
color: rgba(0,0,0,0.5);
}
#header {
font-family: lato;
}
#notebook_name {
font-weight: 200;
}
/*
This is a lazy fix, we *should* fix the
background for each Bootstrap button type
*/
#site * .btn {
background: #fafafa;
-webkit-box-shadow: none;
box-shadow: none;
}
In NBViewer (and notebooks) I haven't yet found an elegent way to customize CSS. There is a popular but ugly/dangerous way which involves injecting custom css into each cell. Even the solution I'm posting here I think is poor for the reasons given here; it's likely to break on future versions and is tedious to insert.
My current working solution was borrwed from Cam DP and his book Bayesian Methods for Hackers. The idea is that if there is a <style>
tag in IPython, the notebook renders it and NBViewer also renders it. I modified Cam's code to use the default IPython profile's custom.css
file.
Just run the following code snippet in the bottom of any IPython notebook:
from IPython import utils
from IPython.core.display import HTML
import os
def css_styling():
"""Load default custom.css file from ipython profile"""
base = utils.path.get_ipython_dir()
styles = "<style>\n%s\n</style>" % (open(os.path.join(base,'profile_default/static/custom/custom.css'),'r').read())
return HTML(styles)
css_styling()
There is one curious feature in the code for NBVeiwer. This code actually checks the notebook metadata (which is nice and clean) and looks for a css theme (a specified css file, but only in the servers /static/css/theme/
directory. The unfortunate part of this is that you can only use Cam's theme from Bayesian Methods or one other linear algebra notebook theme (cdp_1.css
or css_linalg.css
respectively, and you need to remove the .css
bit when specifying a theme).
Making NBConvert do what you want is one of the trickiest bits about theming IPython. Again, the above code snippet makes NBConvert do exactly what we wanted. You can also specify a custom.css
file in the directory that you export your notebooks in and that file will automatically overwrite default styles.
Creating an NBConvert theme is a little bit more involved. NBConvert uses Jinja2 for templates. I'm not going to get into creating NBConvert templates right now because I think the default battery is pretty good for most cases. Minrk has some great examples of how to create a template here. When creating a template you don't always need the config file each time, just the .tpl
file and specifying the output format.
The command needed to generate from a custom NBConvert is (from the directory containing custom_template):
ipython nbconvert --template custom_template.tpl <notebook>
followed by:
ln ~/.ipython/profile_default/static/custom/custom.css ./custom.css
This will convert your notebook and create a symbolic link (again, unix-based only) so that your nbconverted notebooks will change when you modify your profile's custom.css
file.
In January I started my internship as a Data Analyst at Shopify. In the first few weeks of working I was tasked with my first Bayesian adventure: to implement the t-test in the context of the Bayesian framework.
First, here's a simple version of Bayes Theorem:
$$
P(A|B) = \frac{
In January I started my internship as a Data Analyst at Shopify. In the first few weeks of working I was tasked with my first Bayesian adventure: to implement the t-test in the context of the Bayesian framework.
First, here's a simple version of Bayes Theorem:
$$
P(A|B) = \frac{P(B|A)\cdot P(A)}{P(B)}
$$
Where:
But where does everything come together? Where do I go to get these probabilities in the real world? How does all this Bayesian stuff even work?
With some suggestions, I followed the ideas of John K. Kruschke of Indiana University's model outlined in the paper Bayesian estimation supersedes the t test (or the BEST paper). I think Kruschke's paper does an excellent job motivating the test and setting out how to do it.
What we want to do is for two given populations, estimate the means ($\mu_1$ and $\mu_2$), variances ($\sigma_1$ and $\sigma_2$), and normality ($\nu$) of the populations. Mathematically we want to find the posterior distribution, $P(\mu_1, \sigma_1, \mu_2, \sigma_2, \nu | D)$.
Invoking Bayes Theorem we know:
$$
\begin{equation}
\begin{split}
P(\mu_1, \sigma_1, \mu_2, \sigma_2, \nu | D) & = \frac{Likelihood \times Prior}{ Evidence }
\\ & = \frac{P(D | \mu_1, \sigma_1, \mu_2, \sigma_2, \nu) \cdot P(\mu_1, \sigma_1, \mu_2, \sigma_2, \nu)}{ P(D) }
\end{split}
\end{equation}
$$
where $D$ is the data we have observed. With the posterior distribution we can compute the probability of $\mu_1 > \mu_2$ and other measures of interest like effect size (a normalized measure of difference in populations) and the differences in variance of the groups.
This looks like it could get pretty ugly. Luckily, there are some great tools on our side.
I've spent a lot of time reading my textbooks and then reading papers trying to figure out a nice way to do all this Bayesian stuff above. What you end up with is this complicated sum of integrals and complicated distribution functions. Fortunately, Kruschke suggests using Markov chain Monte Carlo (MCMC), a method of random sampling and moving data between different states of our model. For this we used the Python library PyMC to do the heavy work.
MCMC is a method with the goal of generating an accurate representation of the posterior distribution of a model. Monte Carlo methods are ways of using random number generators and deterministic computations to solve mathematically complicated equations. MCMC then is the process of creating a Markov chain that has an equilibrium that is the same as the distribution we wish to study.
PyMC isn't the friendliest library when you aren't familiar with it, so I also consulted with Bayesian Methods for Hackers by Cam Davidson-Pilon. Chapters 1 and 2 should be enough to understand what comes next.
If you aren't using IPython at this point, I strongly recommend you get it because it's a fantastic tool. This isn't a tutorial on IPython, so I'll skip right to the fun stuff.
If you don't have data that you want to test (the Bayesian t is really nice for comparing population data), you can generate some using numpy:
import pymc as pm
import numpy as np
import pandas as pd
# Generate some random normal data
group1 = np.random.normal(15,2,1000)
group2 = np.random.normal(15.7,2,1000)
# Generate Pooled Data
pooled = np.concatenate((group1,group2))
Simply adjust the code above to create new samples to test. Here I want to test two distributions that are different, but might be similar enough to get picked up by the tradition t-test as being insignificantly different.
PyMC works by specifying the model you want to work with, and then you run MCMC on that model with your observed data fed into it. It's not overly complicated when you know your priors (if you don't, John D. Cook has a great page about them).
The BEST paper outlines some good starting values for our prior distributions. The general idea is that we don't want to weight our priors too heavily or else we might introduce bias into the test, but we also don't want fixed values because the data we are using could be extremely large or small. Luckily some nice starting values have been given.
Here's how I built our model:
# Setup our priors
mu1 = pm.Normal("mu_1",mu=pooled.mean(), tau=1.0/pooled.var()/1000.0)
mu2 = pm.Normal("mu_2",mu=pooled.mean(), tau=1.0/pooled.var()/1000.0)
sig1 = pm.Uniform("sigma_1",lower=pooled.var()/1000.0,upper=pooled.var()*1000)
sig2 = pm.Uniform("sigma_2",lower=pooled.var()/1000.0,upper=pooled.var()*1000)
v = pm.Exponential("nu",beta=1.0/29)
Now we want to setup our posterior distributions. Since this is where the data we already have comes in, they're a little bit different than the priors. We inform these distributions with our priors and our given observations and this will be used by the MCMC.
# Include our observed data into the model
t1 = pm.NoncentralT("t_1",mu=mu1, lam=1.0/sig1, nu=v, value=group1[:], observed=True)
t2 = pm.NoncentralT("t_2",mu=mu2, lam=1.0/sig2, nu=v, value=group2[:], observed=True)
We've finished constructing our model, now we just have to put it together and run MCMC on it. There are several different algorithms for MCMC, but PyMC's default algorithm works perfectly for this application.
# Push our priors into a model
model = pm.Model( [t1, mu1, sig1, t2, mu2, sig2, v] )
# Generate our MCMC object
mcmc = pm.MCMC(model)
Now we run the MCMC to generate our resulting posterior distributions. When running MCMC longer chain lengths result in more precise outcomes. Unfortunately we also have to worry about autocorrelation, which is clumping up of results that will bias our outcome. So here sampling 40000 times, we throw out (burn) the first 10000 results and keep every 2nd result (optionally we sometimes choose every k'th result, but this really depends on the tradeoff you wish to make between computing time and precision). This means that our sampling might take a long time.
# Run MCMC sampler
mcmc.sample(40000,10000,2)
When testing one might want to test whether or not the sampler converged to some value. PyMC lets us plot a trace for each parameter and examine convergence.
from pymc.Matplot import plot as mcplot
mcplot(mcmc)
The output of this should look more or less like the following:
Looking at the lower-left panel you can see the histogram is relatively flat. This is what we are looking for when checking convergence. If you find that your MCMC isn't converging you can trade additional running time for lower autocorrelation by upping the number of samples, burns and changing the iteration rate of your sampler. It isn't necessary to check convergence every time you run the test, but when debugging this is a powerful tool, especially when you think you're getting questionable results.
So now that we've got a working model that converges nicely, we want to actually get the data out of it. The reason that we set string representations of each object in our model was so that we can pull out the results from our sampler like this:
mus1 = mcmc.trace('mu_1')[:]
mus2 = mcmc.trace('mu_2')[:]
sigmas1 = mcmc.trace('sigma_1')[:]
sigmas2 = mcmc.trace('sigma_2')[:]
nus = mcmc.trace('nu')[:]
The results from MCMC aren't a singular value, but rather a collection of numbers. This is because we are sampling the distribution of our parameters, not computing them exactly. The result is that we have a measure of confidence on where we think a parameter should be. This allows us to do things like compute the distributions of parameters and the distributions of their differences.
The most interesting metrics for this test are the following:
diff_mus = mus1-mus2
diff_sigmas = sigmas1-sigmas2
normality = np.log(nus)
effect_size = (mus1-mus2)/np.sqrt((sigmas1**2+sigmas2**2)/2.)
We can calculate estimates of parameters using the expectation of their posteriors. So for estimating our means we can just take the means of our traces above:
print "mu_1", mus1.mean()
print "mu_2", mus2.mean()
yields:
mu_1 14.9868612418
mu_2 15.6532483445
which is very close to the actual values of those parameters. For most estimators, we can simply take the mean of our sampler distribution.
How confidence are we in these values? We can plot the kernel densities of our parameters to see exactly what the probability of a given value is. In IPython we just do the following:
from scipy.stats import gaussian_kde
figsize(16,9)
# prepare plotting area to fit both graphs
minx = min(min(mus1),min(mus2))
maxx = max(max(mus1),max(mus2))
# x values to plot on
xs = np.linspace(minx,maxx,1000)
# generate density estimates
gkde1 = gaussian_kde(mus1)
gkde2 = gaussian_kde(mus2)
# draw plots
plt.plot(xs,gkde1(xs),label='$\mu_1$')
plt.plot(xs,gkde2(xs),label='$\mu_2$')
plt.legend()
plt.show()
We have our estimates and we have distributions for them, let's see what questions we can answer and how certain we are about our answers.
In statistics the common way to go about answering a question is hypothesis testing, and since we're working in the Bayesian framework, we will be doing Bayesian Testing.
Suppose we want to answer the question "do these groups have a different mean"? We test the hypothesis $H_0: \mu_1=\mu_2$ vs $H_1: \mu_1 > \mu_2$. To do this we are computing $P(\mu_1-\mu_2 > 0 |D)$, which is easily computed because we've already computed the distribution of $\mu_1-\mu_2$. All we have left to do is compute the probability from our samples that $H_0$ is true, i.e. our 'p-value' is generated by:
p = (diff_mus > 0).mean()
Similarly we do the complement of this to test whether $H_1$ is true. In this case, $H_1$ has a higher probability.
We can visualize this probability through the following:
subplot(121)
minx = min(min(mus1),min(mus2))
maxx = max(max(mus1),max(mus2))
xs = np.linspace(minx,maxx,1000)
gkde1 = gaussian_kde(mus1)
gkde2 = gaussian_kde(mus2)
plt.plot(xs,gkde1(xs),label='$\mu_1$')
plt.plot(xs,gkde2(xs),label='$\mu_2$')
plt.legend()
subplot(122)
minx = min(diff_mus)
maxx = max(diff_mus)
xs = np.linspace(minx,maxx,1000)
gkde = gaussian_kde(diff_mus)
plt.plot(xs,gkde(xs),label='$\mu_1-\mu_2$')
plt.legend()
plt.axvline(0, color='#000000',alpha=0.3,linestyle='--')
plt.show()
Since this is a Bayesian procedure, what we have in fact tested is:
$$
P(H_0| D) \text{ and } P(H_1 | D)
$$
We accept the hypothesis with the highest probability associated with it (here, $H_1$). As a side note, we can test any number of hypothesis together and simply accept the one with the greatest probability, which can save a lot of time and work when we want to compare lots of different groups.
There are lots of other tests you can easily do using this framework. Hopefully this has shed some light on how to get into Bayesian testing and models.
]]>