In these exercises, you’ll build a Bayesian model from scratch. The path there will twist and turn, but you’ll get there in the end! :)

1. Simulating zombie cure ad clicks

You are an analyst at a company selling a zombieism cure online. You know that a previous analyst analyzed an ad campaign, but his code is a mess! Let’s start over.

1A

You’ve been tinkering with a model that can simulate how many people will click on an ad, given the number of ads shown (n_ads_shown) and the underlying proportion of people who would click (proportion_clicks). Here it is:

# A generative model for the number of clicks our 
# zombie cure ad will generate

# Parameters
proportion_clicks <- 0.1
n_ads_shown <- 100

# Simulating data
clicked_on_ad <- c()
for(nth_ad_shown in 1:n_ads_shown) {
  clicked_on_ad[nth_ad_shown] <- runif(1, min = 0, max = 1) < proportion_clicks
}
n_clicks <- sum(clicked_on_ad)
n_clicks

>> Run the generative model above a couple of times and take a look at the output. <<

1B

It turns out that this generative model already has a name. It’s called the binomial process or the binomial distribution. In R you can use the rbinom function to simulate data from a binomial distribution. The rbinom function takes three arguments:

  • n The number of times you want to run the generative model
  • size The number of trials. (For example, the number of zombies you’re giving the drug.)
  • prob The underlying proportion of success as a number between 0.0 and 1.0.

>> Replicate the result from task 1A using the rbinom function: Simulate one count of the number of clicked ads out of a 100 viewed, where the underlying proportion of success is 10%. <<

1C

A nice thing with rbinom is that it makes it easy to rerun the generative model many times.

>> Change the code from 1B to run the simulation 1000 times, rather than just one time. Plot the resulting vector as a histogram (say, using hist()) <<

2. How many clicks will we get? A generative model

Below is a code scaffold to get you started. It’s useful to start out with this as we will refer to these variable names going forward.

n_samples <- 100000
n_ads_shown <- ...
proportion_clicks <- ...
n_visitors <- rbinom(...)

2A

To get more visitors to your website you are considering paying for an ad to be shown 100 times on a popular social media site. According to the social media site, their ads get clicked on 10% of the time.

Assume that 10% is a reasonable number, and assume that the binomial distribution is a reasonable generative model for how people click on ads.

>> Use the rbinom function to generate a large sample that represents the probability distribution over what the number of visitors to your site is going to be. Visualize this probability distribution using hist(). <<

Below is a code scaffold to get you started. It’s useful to start out with this as we will refer to these variable names going forward.

n_samples <- 100000
n_ads_shown <- ...
proportion_clicks <- ...
n_visitors <- rbinom(...)

2B

You would like the ad campaign to result in at least 5 ad clicks and subsequent visits to your site.

>> Calculate the probability that you’ll get 5 or more clicks given the model you just used. <<

Tip: Sum up how many samples in n_visitors are >= than 5 and divide by the total number of samples in n_visitors.

3. How many clicks will we get? Prior uncertainty

You’re not so sure that your ad will get clicked on exactly 10% of the time. Instead of assigning proportion_clicks a single value you are now going to assign it a probability distribution representing (some of) this uncertainty. Here, this means assigning to it a large number of values drawn from a probability distribution.

3A

Assume that it’s equally likely that proportion_clicks could be as low as 0% or as high as 20%. These assumptions translate into a uniform distribution which you can sample from in R like this:

x <- runif(n = n_samples, min = 0.0, max = 0.2)

>> Replace the single value that is assigned to proportion_clicks with n_samples samples produced by runif as above. Run the model again and take a look at the distribution of n_visitors. <<

3B

The distribution of n_visitors looks very different compared to before. With the added uncertainty in proportion_clicks the uncertainty over how many visitors we ’ll get also increased.

>> Are we now more or less certain that we’ll get 5 or more clicks? Again, calculate the probability that you’ll get 5 or more clicks given the model you just used. <<

4. How many clicks will we get? Bayesian inference

You now have a generative model + prior uncertainty, that is, a Bayesian model and you are ready to do some Bayesian inference. You just need some data. Here you go:

9

When the ad was shown 100 times, 9 people clicked. Now let’s use this data to update what the model knows about proportion_clicks, the underlying proportion of people who would click on the ad.

4A

The model you put together in the last exercise resulted in two vectors: (1) proportion_clicks that represents the uncertainty regarding the underlying proportion of clicks and (2) n_visitors which represents the uncertainty regarding the number of visitors you would get. Together they form a joint probability distribution that represents the model’s prior assumptions.

>> Working with the code from last excercise, put proportion_clicks and n_visitors into a data frame and assign it to the variable prior and plot it. <<

4B

prior$n_visitors represents the prior uncertainty over how many visitors you would get because of the ad campaign. But now you know you got exactly 9 visitors.

>> Update prior to include this information by conditioning on this data. That is, filtering away all rows where prior$n_visitors isn’t equal to 9. Store the resulting data frame in posterior. <<

4C

You’ve done some Bayesian inference! Let’s look at the result and what the model has learned about proportion_clicks.

>> Compare prior$proportion_clicks and posterior$proportion_clicks, for example by plotting them as histograms. <<

4D

The whole distribution of samples in posterior$proportion_clicks now represent the posterior (after the data) probability distribution over what proportion_clicks could be.

>> Summarize posterior$proportion_clicks. Calculate a point estimate, for example, by taking the median. Calculate a 95% probability interval, for example, by using the quantile function. <<




You’ve now come to the end of the exercises for part 1 and you’ve implemented a Bayesian model from scratch! If you still have some energy left, here are some bonus exercises for you.

Bonus exercise 1: Return of the zombie cure data

At the beginning of the tutorial you were shown the output from a Bayesian model that analyzed data that consisted of 2 cured zombies out of 13 trials:

data <- c(1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0) # 1 is cured, 0 is not cured

The model you implemented in the exercises above is exactly the same model, with one difference: The prior distribution over the underlying proportion of success should be a uniform distribution from 0.0 to 1.0 (and not from 0.0 to 0.2).

>> Make changes to your model to recreate the result from the example from the beginning of the tutorial. (Not the whole plot, just the final posterior distribution) <<

Bonus excercise 2: A computational shortcut

The model you created in Bonus exercise 1 is useful and correct but computationally it’s a disaster (it’s slow slow slow!). Luckily, there’s a computational shortcut built into base R for exactly this model. If n_successes is the number of successes and n_failures is the number of failures then you can use the rbeta() function to generate n_samples from the posterior, like this:

posterior_prop_success <- rbeta(n_samples, 1 + n_successes, 1 + n_failures)

Why does this work, and what does it have to do with some “beta distribution”? It’s a story for another time… But it works!

>> Use the rbeta trick to replicate the result from Bonus Exercise 1.<<

Bonus excercise 3: The implementation of prop_model

At the beginning of the tutorial I demoed a function called prop_model to analyze the zombie data. That model is just a wrapper around the rbeta trick in bonus exercise 2 + some code to make a pretty plot. If you want to try out prop_model you can find the implementation on my blog here: http://sumsar.net/blog/2018/12/visualizing-the-beta-binomial/

. To run it you need the tidyverse and ggridges packages. You can install them like this:

install.packages(c("ggridges", "tidyverse")) 

>> Use the prop_model function to replicate the result from Bonus Exercise 1. (Or see if you can make it break!) :) <<