In these exercises, you’ll get to try out two packages that both implement fairly advanced statistical models and are using Bayes under the hood. The packages are rstanarm and Facebook’s prophet.

Except for these packages you’ll need the tidyverse package as well to use ggplot2 for plotting and dplyr for data manipulation. If you don’t have experience with these packages, don’t sweat it! You can peek at the solutions as much as you want. :)

1. Bayesian linear regression with rstanarm

rstanarm is a complete Bayesian replacement for many of the regression modeling functions that come with R. Instead of lm you have stan_lm, instead of glm you have stan_glm, etc. Two benefits of rstanarm are that you can supply custom priors and that you get out probability distributions over all parameters and predictions instead of just point values.

To try out rstanarm you’ll use a dataset on daily ice cream sales (ice_creams) as a function of the outdoor temperature in C˚ (daily_temperature). You can download the data here:

Download ice_cream_sales.csv

1A

>> Download the ice cream sales data, load it in (say with read_csv), and plot it. <<

1B

The ice_cream_sales data looks like it would be well modeled by linear regression. Perhaps we want to predict future ice cream sales based on weather reports. Who knows?

>> Fit a classical linear regression model to the ice_cream_sales data using the lm function with ice_creams as the outcome and daily_temperature as the predictor. Summarise the resulting fit using summary. <<

If you don’t have much experience with regression modeling in R, or with lm, just go ahead and peek at the solution.

1C

rstanarm makes it easy to run Bayesian versions of regression models. Just prepend stan_ to the lm function name and you’re ready to go! Almost… stan_lm requires you to to set an explicit prior guess for the mean \(R^2\), the variance explained by the predictors. A decent default would be the following:

stan_fit <- stan_lm(y ~ x, data = my_data, prior = R2(0.5, "mean"))

To get a summary and histograms of the posterior you can then do the following:

summary(stan_fit)
plot(stan_fit, plotfun = "hist")

>> Use stan_lm to fit a Bayesian linear regression to the ice_cream_sales data. Plot histograms of the posterior and do a summary of the fit. <<

1D

If you compare the summary of lm and stan_lm you’ll find that the estimates of the intercept ((Intercept)) and the slope (daily_temperature) are fairly similar. But for the Bayesian case, you get uncertainty estimates for the standard deviation (sigma) and \(R^2\) (R2) too! And as with all Bayesian models, you can also get out a sample that represents the posterior. With rstanarm, you can do that by converting the stan_lm fit to a data frame (using, for example, as_data_frame)

>> Extract the posterior samples from the stan_lm fit and take a look at them with head and plot. <<

1E

Looking at the joint posterior probability distribution produced by plot we see that the intercept and slope (daily_temperature) are correlated. A high slope means it’s probable that the intercept is lower and vice versa. This can be seen if we visualize the uncertainty in the regression line by plotting a subset of the regression lines defined by the posterior.

This can be done in ggplot2 like this (assuming you put the posterior samples in posterior):

posterior_subset <- sample_n(posterior, size = 20)
ggplot(ice_cream_sales, aes(daily_temperature, ice_creams)) +
  geom_point(color = "forestgreen") +
  geom_abline(aes(intercept = `(Intercept)`, slope = daily_temperature), data = posterior_subset, alpha = 0.5)

>> Try it out! <<

2. Bayesian generalized linear models with rstanarm

So far you’ve been analyzing the ice cream data using stan_lm, which assumes the data is normally distributed around the regression line. However, since the data is a daily count of ice creams it might be more suitable to use Poisson regression.

2A

In base R you can fit a Poisson regression using the generalized linear model function glm. Like this:

fit <- glm(ice_creams ~ daily_temperature, data = ice_cream_sales, family = "poisson")
summary(fit)

In rstanarm, you instead have stan_glm which here should work as a direct replacement (unlike stan_lm it doesn’t require you to explicitly define any priors).

>> Run a Bayesian Poisson regression using stan_glm, make a summary of the result, and plot posterior histograms (by setting the plotfun = "hist" argument). <<

2B

If you look at the posterior for the slope parameter (daily_temperature in the plot) you’ll see that most probably ice cream sales increases between 0.04 and 0.06 per degree C. But confusingly the unit is here the log number of ice creams, which is a bit hard to interpret.

But what about how many ice creams we would sell today? Using the posterior_predict function you can get the posterior predictive distribution. The arguments you need to give posterior_predict are object, a fitted rstanarm object, and newdata, a data frame with the predictors you want to generate predictions with. For example:

# Assuming the model was trained with a predictor named x
posterior_prediction <- posterior_predict(stan_fit, newdata = data.frame(x = 42))

>> Given the regression model you just fitted, generate a sample from the posterior predictive distribution given the current outdoor temperature. Store it in the variable ice_creams_demand and plot it using hist. <<

For temperature data, I recommend the Norwegian Meteorological Institute’s weather service.

2C

If you look at the posterior predictive probability distribution you can see the most likely number of ice creams you’re going to sell (that’s where the distribution peaks) and you get a sense of how uncertain this estimate is (how wide the posterior is spread out).

But how much money should we expect to make if we brought with us say 75 ice creams and tried to sell today? Let’s calculate that! First some required background info:

  • Each ice cream costs us €0.2 to purchase.
  • We sell each ice cream for €2.
  • Our freezer is broken so the ice creams we don’t sell are going to spoil.

>> For each sample in the posterior predictive ice_creams_demand calculate how much money you would make. The resulting vector can be interpreted as a probability distribution. Taking the mean of this vector to get the expected profit when bringing with you 75 ice creams. <<

Tip: As R is vectorized all of this can be done in a single line using multiplication, addition, mean and the pmin function.

2D

But how many ice creams should we bring if we would go out selling ice cream today?

>> Calculate the expected profit you’ll make if you bring anything from 1 to 200 ice creams. plot the expected profit as a function of the number of ice creams brought and calculate the most profitable number of ice creams to bring. <<

Tip: This can be done both with a for loop, or more functionally using sapply or purrr::map_dbl. which.max can be used to figure out the optimal number of ice creams to bring.

2E

Congrats! :) You’ve just performed a fairly sophisticated decision analysis on how many ice creams to bring. The validity of this analysis hinges on how well the model models reality. To check this is out of the scope here. However, we could at least check if the model was fitted correctly by rstanarm. There are many things that can go wrong when fitting a Bayesian model, but one quick sanity check is to look at the trace plots of the MCMC chain that produced the posterior samples. Ideally, these should look stationary and without discernible patterns. Generally, you want your trace plots to look like “hairy caterpillars”.

You can produce trace plots from a fitted rstanarm model like this:

plot(stan_fit, plotfun = "trace")

Or you can launch an interactive dashboard called ShinyStan that also allows you view the trace plots (+ a lot more!)

launch_shinystan(stan_fit)

>> Take a look at the traceplots for the Bayesian Poisson regression model to make sure that they look OK. <<



That concludes the rstanarm exercises. You’ve barely scratched the surface of rstanarm and there are many more models you can fit with it (stan_lmer, stan_glmer, stan_polr, etc.). But all are used in a similar way as stan_lm and stan_glm.


3. prophet: a time series model from Facebook

While rstanarm implements a large number of different models, prophet is focused on one case only: Modeling 1-dimensional time series data. Here is the time series you’ll work with in these exercises, the daily number of reported crimes in Chicago from 2009-06-20 to 2019-06-10 (source):

Download chicago_daily_crime_2009-2019.csv

3A

prophet accepts a data frame with one row per day with the following columns: ds, a column of Dates, and y a column with the outcome variable. If y is a count, it might also make sense to log transform it before giving it to prophet.

>> Read in the chicago_daily_crime_2009-2019.csv dataset, rename the columns to ds and y, and log10 transform y. plot the resulting data frame. <<

3B

prophet has a great tutorial, and your task is now to follow the R tutorial from reading in the data to plotting the components with prophet_plot_components. But using the chicago_daily_crime dataset instead.

>> Do the steps in the R tutorial found here, but using the chicago_daily_crime dataset. <<

3C

Looking at the scatter plot with the superimposed (blue) model predictions we see that there are many data points that are far above and far below the predicted values. Part of this might be explained by that crime reporting is impacted by certain holidays. Fortunately prophet supports adding the impact of special holidays to the model.

>> Follow the tutorial on adding US holidays to your prophet model. <<

Tip: In the tutorial it’s a bit confusing how they are initializing the model. As you are just adding “standard” US holidays, just initialize the model without the holidays argument:

m <- prophet()

3D

>> Again, like earlier, make a forecast and plot it using plot and prophet_plot_components. Did adding the holidays have any impact? <<

3E

As you have 10 years of data, the plots become a bit crowded and it’s hard to see what’s going on.

>> Use the dyplot.prophet function to display an interactive plot. Zoom in on the data and see if you can identify some of the holidays that have the larger “outliers” in reported crimes. <<

4. prophet: Predicting the number of reported crimes in Chicago today

As the name of the package might give away, one of prophet’s main use-cases is predicting the future. Let’s predict the number of reported crimes in Chicago today.

4A

The predictive_samples function can be used to produce samples from the posterior predictive distribution. Given a fitted prophet model and a one column data frame with dates (as produced by make_future_dataframe) it outputs a list with two matrices trend and yhat. Here yhat is a matrix with the posterior predictive samples with one date per row with each column being a sample from the posterior predictive.

>> Use make_future_dataframe to create a data frame with dates for forecasting one year ahead. Then use predictive_samples to sample from the posterior predictive. Extract the resulting yhat matrix and assign it to pred_post. <<

Tip: When using make_future_dataframe set include_history = FALSE to only include future dates and exclude the 10-year history.

4B

The pred_post matrix is now a large matrix where each row represents a probability distribution over how many crimes will be reported that day. Unfortunately, all numbers in pred_post are on the log10 scale. It would be more useful if the numbers in pred_post represented counts of reported crimes.

>> Tranform pred_post from the log10 scale back to original (un-logged) scale. <<

Tip: log10(1000) == 3 and 10^3 == 1000. R is vectorized, so applying a transformation to all elements in a matrix is a short one-liner.

4C

>> Take a look at some of the columns of the, now, un-logged pred_post by trying to plot a couple. What do the columns represent? <<

Tip: ggplot2 is great, but if you just want to plot a couple of columns from a matrix x on top of each other you could do something like this:

plot(x[,1], type = "l", col = "red")
lines(x[, 2], col = "green")
lines(x[, 3], col = "blue")

4D

Finally, let’s look at the prediction for today’s number of reported crimes in Chicago.

>> Figure out which row in pred_post predicts today’s number of reported crimes. Extract this row and take a look at it using hist. Finally, calculate the probability that there will be 750 or more reported crimes in Chicago today. <<

Tip: As pred_post does not include actual dates you’ll have to cross reference between future$ds and pred_post. For example, the following would get you the row number of "2019-07-01":

which(as.Date(future$ds) == "2019-07-01")


That concludes the prophet exercises. prophet is a fairly flexible package and there are many more options you can use to customize the model to better fit a time series.

Bonus Exercises

5. Predicting the future that never was with Google’s CausalImpact

Google’s CausalImpact is even more focused than prophet. It deals only with the case where there’s a change in the world that you believe had an impact on a time series, and you want to estimate: How would this time series have turned out if no change had happened? For example: What impact did the redesign of our site has on the number of purchases? Or, how did the emissions scandal of 2015 (a.k.a. Dieselgate) impact Volkswagen’s stock price?

Let’s take a look at this latter example, here’s Volkswagen’s stock value over time (+ the stocks of BMW and Allianz):

Download stock_prices.csv


This exercise is based on the well written case study on the Volkswagen scandal by Francesco Rinaldi: https://rinaldif.github.io/causal-impact/

5A

>> Load in the CausalImpact package and read the csv data into the variable stock_prices. Have a look at the data. <<

5B

While CausalImpact can work with data frames it will produce better plots if you give it a zoo time series. You can transform a data frame into a zoo series using the read.zoo function. Another benefit is that you can plot zoo series using the autoplot function in ggplot2.

>> Convert stock_prices to a zoo time series and plot it using autoplot. <<

Tip: Set facet = NULL when calling autoplot to view the time series in the same plot

5C

The emissions scandal broke on the 20th of September 2015 (that’s when Volkswagen’s stock drops like a stone in the plot). For CausalImpact to do it’s job it needs to know the pre.period, a period with reliable data before the event, and post.period, the period after the event for when we want to predict the impact. Here are reasonable periods:

pre.period <- as.Date(c("2011-01-03", "2015-09-14"))
post.period <- as.Date(c("2015-09-21", "2017-03-19"))

Here’s how to fit a CausalImpact model:

impact <- CausalImpact(data, pre.period, post.period)

>> Select the "volkswagen" column in stock_prices and fit a casual CausalImpact model to this data. Plot the resulting object using plot. <<

Tip: You can subset a zoo time series just like a data frame. Here’s how you would select the column "my_column" from the my_zoo_series:

my_zoo_series[, "my_column"]

5D

The blue dashed line in the plot shows CausalImpact’s prediction for what would have happened to the stock price if the emissions scandal wouldn’t have happened. But it’s not a very good prediction. The stock price is just predicted to be constant and the uncertainty (the light blue shade) is huge!

This is because we haven’t given CausalImpact much data to work with. One of the features of CausalImpact is that it can use parallel time series to better the prediction of the main time series.

>> Fit the CausalImpact impact model again, but this time give it all the data that’s in stock_prices. That is, also the columns "BMW" (another car manufacturer) and "Allianz" (another large German company). plot the result again. <<

5E

The prediction in this new plot looks more reasonable, and the prediction is also less uncertain. If you look at the panel labeled “pointwise” you see the difference between the prediction and what actually happened. By using the summary function on the CasualImpact object you can also get a summary of the result.

>> Looking at the plot and a summary of the result, what was the estimated impact on the stock price of Volkswagen at the beginning of 2017? <<

5F

CausalImpact has a cool feature: By passing the output = "report" to the summary function it will generate an automated report explaining the result.

>> Try it out! <<


That concludes the CausalImpact exercises. CausalImpact is a fairly specific package, with one specific use case. But it’s good at what it does! :)