Get up to speed with Bayesian data analysis in R – Exercises, Pt. 1: The Basics of Bayes

In these exercises, you’ll build a Bayesian model from scratch. The path there will twist and turn, but you’ll get there in the end! :)

1. Simulating zombie cure ad clicks

You are an analyst at a company selling a zombieism cure online. You know that a previous analyst analyzed an ad campaign, but his code is a mess! Let’s start over.

1A

You’ve been tinkering with a model that can simulate how many people will click on an ad, given the number of ads shown (n_ads_shown) and the underlying proportion of people who would click (proportion_clicks). Here it is:

# A generative model for the number of clicks our 
# zombie cure ad will generate

# Parameters
proportion_clicks <- 0.1
n_ads_shown <- 100

# Simulating data
clicked_on_ad <- c()
for(nth_ad_shown in 1:n_ads_shown) {
  clicked_on_ad[nth_ad_shown] <- runif(1, min = 0, max = 1) < proportion_clicks
}
n_clicks <- sum(clicked_on_ad)
n_clicks

>> Run the generative model above a couple of times and take a look at the output. <<

1B

It turns out that this generative model already has a name. It’s called the binomial process or the binomial distribution. In R you can use the rbinom function to simulate data from a binomial distribution. The rbinom function takes three arguments:

n The number of times you want to run the generative model
size The number of trials. (For example, the number of zombies you’re giving the drug.)
prob The underlying proportion of success as a number between 0.0 and 1.0.

>> Replicate the result from task 1A using the rbinom function: Simulate one count of the number of clicked ads out of a 100 viewed, where the underlying proportion of success is 10%. <<

rbinom(n = 1, size = 100, prob = 0.1)

## [1] 9

1C

A nice thing with rbinom is that it makes it easy to rerun the generative model many times.

>> Change the code from 1B to run the simulation 1000 times, rather than just one time. Plot the resulting vector as a histogram (say, using hist()) <<

n_clicks <- rbinom(n = 1000, size = 100, prob = 0.1)
hist(n_clicks, col="lightblue")

2. How many clicks will we get? A generative model

Below is a code scaffold to get you started. It’s useful to start out with this as we will refer to these variable names going forward.

n_samples <- 100000
n_ads_shown <- ...
proportion_clicks <- ...
n_visitors <- rbinom(...)

2A

To get more visitors to your website you are considering paying for an ad to be shown 100 times on a popular social media site. According to the social media site, their ads get clicked on 10% of the time.

Assume that 10% is a reasonable number, and assume that the binomial distribution is a reasonable generative model for how people click on ads.

>> Use the rbinom function to generate a large sample that represents the probability distribution over what the number of visitors to your site is going to be. Visualize this probability distribution using hist(). <<

Below is a code scaffold to get you started. It’s useful to start out with this as we will refer to these variable names going forward.

n_samples <- 100000
n_ads_shown <- ...
proportion_clicks <- ...
n_visitors <- rbinom(...)

n_samples <- 100000
n_ads_shown <- 100
proportion_clicks <- 0.1
n_visitors <- rbinom(n_samples, size = n_ads_shown, 
                     prob = proportion_clicks)
hist(n_visitors, col = "salmon")

2B

You would like the ad campaign to result in at least 5 ad clicks and subsequent visits to your site.

>> Calculate the probability that you’ll get 5 or more clicks given the model you just used. <<

Tip: Sum up how many samples in n_visitors are >= than 5 and divide by the total number of samples in n_visitors.

sum(n_visitors >= 5) / length(n_visitors)

## [1] 0.97606

# or use the following shortcut
mean(n_visitors >= 5)

## [1] 0.97606

# So given the model it seems pretty likely that we'll get more than 5 visitors.

3. How many clicks will we get? Prior uncertainty

You’re not so sure that your ad will get clicked on exactly 10% of the time. Instead of assigning proportion_clicks a single value you are now going to assign it a probability distribution representing (some of) this uncertainty. Here, this means assigning to it a large number of values drawn from a probability distribution.

3A

Assume that it’s equally likely that proportion_clicks could be as low as 0% or as high as 20%. These assumptions translate into a uniform distribution which you can sample from in R like this:

x <- runif(n = n_samples, min = 0.0, max = 0.2)

>> Replace the single value that is assigned to proportion_clicks with n_samples samples produced by runif as above. Run the model again and take a look at the distribution of n_visitors. <<

n_samples <- 100000
n_ads_shown <- 100
proportion_clicks <- runif(n_samples, min = 0.0, max = 0.2)
n_visitors <- rbinom(n = n_samples, size = n_ads_shown, prob = proportion_clicks)
hist(n_visitors, col = "plum")

3B

The distribution of n_visitors looks very different compared to before. With the added uncertainty in proportion_clicks the uncertainty over how many visitors we ’ll get also increased.

>> Are we now more or less certain that we’ll get 5 or more clicks? Again, calculate the probability that you’ll get 5 or more clicks given the model you just used. <<

mean(n_visitors >= 5)

## [1] 0.75514

# We're less certain. The uncertainty in proportion_clicks is making itself
# known even here.

4. How many clicks will we get? Bayesian inference

You now have a generative model + prior uncertainty, that is, a Bayesian model and you are ready to do some Bayesian inference. You just need some data. Here you go:

When the ad was shown 100 times, 9 people clicked. Now let’s use this data to update what the model knows about proportion_clicks, the underlying proportion of people who would click on the ad.

4A

The model you put together in the last exercise resulted in two vectors: (1) proportion_clicks that represents the uncertainty regarding the underlying proportion of clicks and (2) n_visitors which represents the uncertainty regarding the number of visitors you would get. Together they form a joint probability distribution that represents the model’s prior assumptions.

>> Working with the code from last excercise, put proportion_clicks and n_visitors into a data frame and assign it to the variable prior and plot it. <<

n_samples <- 100000
n_ads_shown <- 100
proportion_clicks <- runif(n_samples, min = 0.0, max = 0.2)
n_visitors <- rbinom(n = n_samples, size = n_ads_shown, prob = proportion_clicks)
prior <- data.frame(proportion_clicks, n_visitors)
plot(prior, col = "#00000010") # <- Added some transparency to avoid overplotting :)

4B

prior$n_visitors represents the prior uncertainty over how many visitors you would get because of the ad campaign. But now you know you got exactly 9 visitors.

>> Update prior to include this information by conditioning on this data. That is, filtering away all rows where prior$n_visitors isn’t equal to 9. Store the resulting data frame in posterior. <<

n_samples <- 100000
n_ads_shown <- 100
proportion_clicks <- runif(n_samples, min = 0.0, max = 0.2)
n_visitors <- rbinom(n = n_samples, size = n_ads_shown, prob = proportion_clicks)
prior <- data.frame(proportion_clicks, n_visitors)
posterior <- prior[prior$n_visitors == 9, ]

4C

You’ve done some Bayesian inference! Let’s look at the result and what the model has learned about proportion_clicks.

>> Compare prior$proportion_clicks and posterior$proportion_clicks, for example by plotting them as histograms. <<

n_samples <- 100000
n_ads_shown <- 100
proportion_clicks <- runif(n_samples, min = 0.0, max = 0.2)
n_visitors <- rbinom(n = n_samples, size = n_ads_shown, prob = proportion_clicks)
prior <- data.frame(proportion_clicks, n_visitors)
posterior <- prior[prior$n_visitors == 9, ]
hist(prior$proportion_clicks, xlim = c(0, .2), col = "seagreen")

hist(posterior$proportion_clicks, xlim = c(0, .2), col = "tomato")

4D

The whole distribution of samples in posterior$proportion_clicks now represent the posterior (after the data) probability distribution over what proportion_clicks could be.

>> Summarize posterior$proportion_clicks. Calculate a point estimate, for example, by taking the median. Calculate a 95% probability interval, for example, by using the quantile function. <<

n_samples <- 100000
n_ads_shown <- 100
proportion_clicks <- runif(n_samples, min = 0.0, max = 0.2)
n_visitors <- rbinom(n = n_samples, size = n_ads_shown, prob = proportion_clicks)
prior <- data.frame(proportion_clicks, n_visitors)
posterior <- prior[prior$n_visitors == 9, ]
median(posterior$proportion_clicks)

## [1] 0.09479387

quantile(posterior$proportion_clicks, probs = c(0.025, 0.975))

##       2.5%      97.5% 
## 0.04906558 0.16048421

You’ve now come to the end of the exercises for part 1 and you’ve implemented a Bayesian model from scratch! If you still have some energy left, here are some bonus exercises for you.

Bonus exercise 1: Return of the zombie cure data

At the beginning of the tutorial you were shown the output from a Bayesian model that analyzed data that consisted of 2 cured zombies out of 13 trials:

data <- c(1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0) # 1 is cured, 0 is not cured

The model you implemented in the exercises above is exactly the same model, with one difference: The prior distribution over the underlying proportion of success should be a uniform distribution from 0.0 to 1.0 (and not from 0.0 to 0.2).

>> Make changes to your model to recreate the result from the example from the beginning of the tutorial. (Not the whole plot, just the final posterior distribution) <<

n_samples <- 100000
data <- c(1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0)
n_cured <- sum(data)
n_trials <- length(data)

prop_success <- runif(n_samples, min = 0.0, max = 1.0)
n_successes <- rbinom(n = n_samples, size = n_trials, prob = prop_success)
prior <- data.frame(prop_success, n_successes)
posterior <- prior[prior$n_successes == n_cured, ]
hist(posterior$prop_success, xlim = c(0, 1), col = "salmon")

Bonus excercise 2: A computational shortcut

The model you created in Bonus exercise 1 is useful and correct but computationally it’s a disaster (it’s slow slow slow!). Luckily, there’s a computational shortcut built into base R for exactly this model. If n_successes is the number of successes and n_failures is the number of failures then you can use the rbeta() function to generate n_samples from the posterior, like this:

posterior_prop_success <- rbeta(n_samples, 1 + n_successes, 1 + n_failures)

Why does this work, and what does it have to do with some “beta distribution”? It’s a story for another time… But it works!

>> Use the rbeta trick to replicate the result from Bonus Exercise 1.<<

n_samples <- 100000
data <- c(1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0)
n_successes <- sum(data)
n_trials <- length(data)
n_failures <- n_trials - n_successes 
posterior_prop_success <- rbeta(n_samples, 1 + n_successes, 1 + n_failures)
hist(posterior_prop_success, xlim = c(0, 1), col = "salmon3")

Bonus excercise 3: The implementation of `prop_model`

At the beginning of the tutorial I demoed a function called prop_model to analyze the zombie data. That model is just a wrapper around the rbeta trick in bonus exercise 2 + some code to make a pretty plot. If you want to try out prop_model you can find the implementation on my blog here: http://sumsar.net/blog/2018/12/visualizing-the-beta-binomial/

. To run it you need the tidyverse and ggridges packages. You can install them like this:

install.packages(c("ggridges", "tidyverse"))

>> Use the prop_model function to replicate the result from Bonus Exercise 1. (Or see if you can make it break!) :) <<