In these exercises, you’ll build a Bayesian model from scratch. The path there will twist and turn, but you’ll get there in the end! :)
You are an analyst at a company selling a zombieism cure online. You know that a previous analyst analyzed an ad campaign, but his code is a mess! Let’s start over.
You’ve been tinkering with a model that can simulate how many people will click on an ad, given the number of ads shown (n_ads_shown
) and the underlying proportion of people who would click (proportion_clicks
). Here it is:
# A generative model for the number of clicks our
# zombie cure ad will generate
# Parameters
proportion_clicks <- 0.1
n_ads_shown <- 100
# Simulating data
clicked_on_ad <- c()
for(nth_ad_shown in 1:n_ads_shown) {
clicked_on_ad[nth_ad_shown] <- runif(1, min = 0, max = 1) < proportion_clicks
}
n_clicks <- sum(clicked_on_ad)
n_clicks
>> Run the generative model above a couple of times and take a look at the output. <<
It turns out that this generative model already has a name. It’s called the binomial process or the binomial distribution. In R you can use the rbinom
function to simulate data from a binomial distribution. The rbinom
function takes three arguments:
n
The number of times you want to run the generative modelsize
The number of trials. (For example, the number of zombies you’re giving the drug.)prob
The underlying proportion of success as a number between 0.0
and 1.0
.>> Replicate the result from task 1A using the rbinom
function: Simulate one count of the number of clicked ads out of a 100 viewed, where the underlying proportion of success is 10%. <<
A nice thing with rbinom
is that it makes it easy to rerun the generative model many times.
>> Change the code from 1B to run the simulation 1000 times, rather than just one time. Plot the resulting vector as a histogram (say, using hist()
) <<
Below is a code scaffold to get you started. It’s useful to start out with this as we will refer to these variable names going forward.
n_samples <- 100000
n_ads_shown <- ...
proportion_clicks <- ...
n_visitors <- rbinom(...)
To get more visitors to your website you are considering paying for an ad to be shown 100 times on a popular social media site. According to the social media site, their ads get clicked on 10% of the time.
Assume that 10% is a reasonable number, and assume that the binomial distribution is a reasonable generative model for how people click on ads.
>> Use the rbinom
function to generate a large sample that represents the probability distribution over what the number of visitors to your site is going to be. Visualize this probability distribution using hist()
. <<
Below is a code scaffold to get you started. It’s useful to start out with this as we will refer to these variable names going forward.
n_samples <- 100000
n_ads_shown <- ...
proportion_clicks <- ...
n_visitors <- rbinom(...)
You would like the ad campaign to result in at least 5 ad clicks and subsequent visits to your site.
>> Calculate the probability that you’ll get 5 or more clicks given the model you just used. <<
Tip: Sum up how many samples in n_visitors
are >=
than 5
and divide by the total number of samples in n_visitors
.
You’re not so sure that your ad will get clicked on exactly 10% of the time. Instead of assigning proportion_clicks
a single value you are now going to assign it a probability distribution representing (some of) this uncertainty. Here, this means assigning to it a large number of values drawn from a probability distribution.
Assume that it’s equally likely that proportion_clicks
could be as low as 0% or as high as 20%. These assumptions translate into a uniform distribution which you can sample from in R like this:
x <- runif(n = n_samples, min = 0.0, max = 0.2)
>> Replace the single value that is assigned to proportion_clicks
with n_samples
samples produced by runif
as above. Run the model again and take a look at the distribution of n_visitors
. <<
The distribution of n_visitors
looks very different compared to before. With the added uncertainty in proportion_clicks
the uncertainty over how many visitors we ’ll get also increased.
>> Are we now more or less certain that we’ll get 5 or more clicks? Again, calculate the probability that you’ll get 5 or more clicks given the model you just used. <<
You now have a generative model + prior uncertainty, that is, a Bayesian model and you are ready to do some Bayesian inference. You just need some data. Here you go:
9
When the ad was shown 100 times, 9 people clicked. Now let’s use this data to update what the model knows about proportion_clicks
, the underlying proportion of people who would click on the ad.
The model you put together in the last exercise resulted in two vectors: (1) proportion_clicks
that represents the uncertainty regarding the underlying proportion of clicks and (2) n_visitors
which represents the uncertainty regarding the number of visitors you would get. Together they form a joint probability distribution that represents the model’s prior assumptions.
>> Working with the code from last excercise, put proportion_clicks
and n_visitors
into a data frame and assign it to the variable prior
and plot
it. <<
prior$n_visitors
represents the prior uncertainty over how many visitors you would get because of the ad campaign. But now you know you got exactly 9 visitors.
>> Update prior
to include this information by conditioning on this data. That is, filtering away all rows where prior$n_visitors
isn’t equal to 9. Store the resulting data frame in posterior
. <<
You’ve done some Bayesian inference! Let’s look at the result and what the model has learned about proportion_clicks
.
>> Compare prior$proportion_clicks
and posterior$proportion_clicks
, for example by plotting them as histograms. <<
The whole distribution of samples in posterior$proportion_clicks
now represent the posterior (after the data) probability distribution over what proportion_clicks
could be.
>> Summarize posterior$proportion_clicks
. Calculate a point estimate, for example, by taking the median
. Calculate a 95% probability interval, for example, by using the quantile
function. <<
You’ve now come to the end of the exercises for part 1 and you’ve implemented a Bayesian model from scratch! If you still have some energy left, here are some bonus exercises for you.
At the beginning of the tutorial you were shown the output from a Bayesian model that analyzed data that consisted of 2 cured zombies out of 13 trials:
data <- c(1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0) # 1 is cured, 0 is not cured
The model you implemented in the exercises above is exactly the same model, with one difference: The prior distribution over the underlying proportion of success should be a uniform distribution from 0.0 to 1.0 (and not from 0.0 to 0.2).
>> Make changes to your model to recreate the result from the example from the beginning of the tutorial. (Not the whole plot, just the final posterior distribution) <<
The model you created in Bonus exercise 1 is useful and correct but computationally it’s a disaster (it’s slow slow slow!). Luckily, there’s a computational shortcut built into base R for exactly this model. If n_successes
is the number of successes and n_failures
is the number of failures then you can use the rbeta()
function to generate n_samples
from the posterior, like this:
posterior_prop_success <- rbeta(n_samples, 1 + n_successes, 1 + n_failures)
Why does this work, and what does it have to do with some “beta distribution”? It’s a story for another time… But it works!
>> Use the rbeta
trick to replicate the result from Bonus Exercise 1.<<
prop_model
At the beginning of the tutorial I demoed a function called prop_model
to analyze the zombie data. That model is just a wrapper around the rbeta
trick in bonus exercise 2 + some code to make a pretty plot. If you want to try out prop_model
you can find the implementation on my blog here: http://sumsar.net/blog/2018/12/visualizing-the-beta-binomial/
. To run it you need the tidyverse
and ggridges
packages. You can install them like this:
install.packages(c("ggridges", "tidyverse"))
>> Use the prop_model
function to replicate the result from Bonus Exercise 1. (Or see if you can make it break!) :) <<