In these exercises, you’ll get to try out two packages that both implement fairly advanced statistical models and are using Bayes under the hood. The packages are rstanarm
and Facebook’s prophet
.
Except for these packages you’ll need the tidyverse
package as well to use ggplot2
for plotting and dplyr
for data manipulation. If you don’t have experience with these packages, don’t sweat it! You can peek at the solutions as much as you want. :)
rstanarm
rstanarm
is a complete Bayesian replacement for many of the regression modeling functions that come with R. Instead of lm
you have stan_lm
, instead of glm
you have stan_glm
, etc. Two benefits of rstanarm
are that you can supply custom priors and that you get out probability distributions over all parameters and predictions instead of just point values.
To try out rstanarm
you’ll use a dataset on daily ice cream sales (ice_creams
) as a function of the outdoor temperature in C˚ (daily_temperature
). You can download the data here:
>> Download the ice cream sales data, load it in (say with read_csv
), and plot
it. <<
The ice_cream_sales
data looks like it would be well modeled by linear regression. Perhaps we want to predict future ice cream sales based on weather reports. Who knows?
>> Fit a classical linear regression model to the ice_cream_sales
data using the lm
function with ice_creams
as the outcome and daily_temperature
as the predictor. Summarise the resulting fit using summary
. <<
If you don’t have much experience with regression modeling in R, or with lm
, just go ahead and peek at the solution.
rstanarm
makes it easy to run Bayesian versions of regression models. Just prepend stan_
to the lm
function name and you’re ready to go! Almost… stan_lm
requires you to to set an explicit prior guess for the mean \(R^2\), the variance explained by the predictors. A decent default would be the following:
stan_fit <- stan_lm(y ~ x, data = my_data, prior = R2(0.5, "mean"))
To get a summary and histograms of the posterior you can then do the following:
summary(stan_fit)
plot(stan_fit, plotfun = "hist")
>> Use stan_lm
to fit a Bayesian linear regression to the ice_cream_sales
data. Plot histograms of the posterior and do a summary
of the fit. <<
If you compare the summary
of lm
and stan_lm
you’ll find that the estimates of the intercept ((Intercept)
) and the slope (daily_temperature
) are fairly similar. But for the Bayesian case, you get uncertainty estimates for the standard deviation (sigma
) and \(R^2\) (R2
) too! And as with all Bayesian models, you can also get out a sample that represents the posterior. With rstanarm
, you can do that by converting the stan_lm
fit to a data frame (using, for example, as_data_frame
)
>> Extract the posterior samples from the stan_lm
fit and take a look at them with head
and plot
. <<
Looking at the joint posterior probability distribution produced by plot
we see that the intercept and slope (daily_temperature
) are correlated. A high slope means it’s probable that the intercept is lower and vice versa. This can be seen if we visualize the uncertainty in the regression line by plotting a subset of the regression lines defined by the posterior.
This can be done in ggplot2
like this (assuming you put the posterior samples in posterior
):
posterior_subset <- sample_n(posterior, size = 20)
ggplot(ice_cream_sales, aes(daily_temperature, ice_creams)) +
geom_point(color = "forestgreen") +
geom_abline(aes(intercept = `(Intercept)`, slope = daily_temperature), data = posterior_subset, alpha = 0.5)
>> Try it out! <<
rstanarm
So far you’ve been analyzing the ice cream data using stan_lm
, which assumes the data is normally distributed around the regression line. However, since the data is a daily count of ice creams it might be more suitable to use Poisson regression.
In base R you can fit a Poisson regression using the generalized linear model function glm
. Like this:
fit <- glm(ice_creams ~ daily_temperature, data = ice_cream_sales, family = "poisson")
summary(fit)
In rstanarm
, you instead have stan_glm
which here should work as a direct replacement (unlike stan_lm
it doesn’t require you to explicitly define any priors).
>> Run a Bayesian Poisson regression using stan_glm
, make a summary
of the result, and plot
posterior histograms (by setting the plotfun = "hist"
argument). <<
If you look at the posterior for the slope parameter (daily_temperature
in the plot) you’ll see that most probably ice cream sales increases between 0.04 and 0.06 per degree C. But confusingly the unit is here the log number of ice creams, which is a bit hard to interpret.
But what about how many ice creams we would sell today? Using the posterior_predict
function you can get the posterior predictive distribution. The arguments you need to give posterior_predict
are object
, a fitted rstanarm
object, and newdata
, a data frame with the predictors you want to generate predictions with. For example:
# Assuming the model was trained with a predictor named x
posterior_prediction <- posterior_predict(stan_fit, newdata = data.frame(x = 42))
>> Given the regression model you just fitted, generate a sample from the posterior predictive distribution given the current outdoor temperature. Store it in the variable ice_creams_demand
and plot it using hist
. <<
For temperature data, I recommend the Norwegian Meteorological Institute’s weather service.
If you look at the posterior predictive probability distribution you can see the most likely number of ice creams you’re going to sell (that’s where the distribution peaks) and you get a sense of how uncertain this estimate is (how wide the posterior is spread out).
But how much money should we expect to make if we brought with us say 75 ice creams and tried to sell today? Let’s calculate that! First some required background info:
>> For each sample in the posterior predictive ice_creams_demand
calculate how much money you would make. The resulting vector can be interpreted as a probability distribution. Taking the mean of this vector to get the expected profit when bringing with you 75 ice creams. <<
Tip: As R is vectorized all of this can be done in a single line using multiplication, addition, mean
and the pmin
function.
But how many ice creams should we bring if we would go out selling ice cream today?
>> Calculate the expected profit you’ll make if you bring anything from 1 to 200 ice creams. plot
the expected profit as a function of the number of ice creams brought and calculate the most profitable number of ice creams to bring. <<
Tip: This can be done both with a for
loop, or more functionally using sapply
or purrr::map_dbl
. which.max
can be used to figure out the optimal number of ice creams to bring.
Congrats! :) You’ve just performed a fairly sophisticated decision analysis on how many ice creams to bring. The validity of this analysis hinges on how well the model models reality. To check this is out of the scope here. However, we could at least check if the model was fitted correctly by rstanarm
. There are many things that can go wrong when fitting a Bayesian model, but one quick sanity check is to look at the trace plots of the MCMC chain that produced the posterior samples. Ideally, these should look stationary and without discernible patterns. Generally, you want your trace plots to look like “hairy caterpillars”.
You can produce trace plots from a fitted rstanarm
model like this:
plot(stan_fit, plotfun = "trace")
Or you can launch an interactive dashboard called ShinyStan that also allows you view the trace plots (+ a lot more!)
launch_shinystan(stan_fit)
>> Take a look at the traceplots for the Bayesian Poisson regression model to make sure that they look OK. <<
That concludes the rstanarm
exercises. You’ve barely scratched the surface of rstanarm
and there are many more models you can fit with it (stan_lmer
, stan_glmer
, stan_polr
, etc.). But all are used in a similar way as stan_lm
and stan_glm
.
prophet
: a time series model from FacebookWhile rstanarm
implements a large number of different models, prophet
is focused on one case only: Modeling 1-dimensional time series data. Here is the time series you’ll work with in these exercises, the daily number of reported crimes in Chicago from 2009-06-20 to 2019-06-10 (source):
prophet
accepts a data frame with one row per day with the following columns: ds
, a column of Dates, and y
a column with the outcome variable. If y
is a count, it might also make sense to log transform it before giving it to prophet
.
>> Read in the chicago_daily_crime_2009-2019.csv
dataset, rename the columns to ds
and y
, and log10
transform y
. plot
the resulting data frame. <<
prophet
has a great tutorial, and your task is now to follow the R tutorial from reading in the data to plotting the components with prophet_plot_components
. But using the chicago_daily_crime
dataset instead.
>> Do the steps in the R tutorial found here, but using the chicago_daily_crime
dataset. <<
Looking at the scatter plot with the superimposed (blue) model predictions we see that there are many data points that are far above and far below the predicted values. Part of this might be explained by that crime reporting is impacted by certain holidays. Fortunately prophet
supports adding the impact of special holidays to the model.
>> Follow the tutorial on adding US holidays to your prophet
model. <<
Tip: In the tutorial it’s a bit confusing how they are initializing the model. As you are just adding “standard” US holidays, just initialize the model without the holidays
argument:
m <- prophet()
>> Again, like earlier, make a forecast and plot it using plot
and prophet_plot_components
. Did adding the holidays have any impact? <<
As you have 10 years of data, the plots become a bit crowded and it’s hard to see what’s going on.
>> Use the dyplot.prophet
function to display an interactive plot. Zoom in on the data and see if you can identify some of the holidays that have the larger “outliers” in reported crimes. <<
prophet
: Predicting the number of reported crimes in Chicago todayAs the name of the package might give away, one of prophet
’s main use-cases is predicting the future. Let’s predict the number of reported crimes in Chicago today.
The predictive_samples
function can be used to produce samples from the posterior predictive distribution. Given a fitted prophet
model and a one column data frame with dates (as produced by make_future_dataframe
) it outputs a list with two matrices trend
and yhat
. Here yhat
is a matrix with the posterior predictive samples with one date per row with each column being a sample from the posterior predictive.
>> Use make_future_dataframe
to create a data frame with dates for forecasting one year ahead. Then use predictive_samples
to sample from the posterior predictive. Extract the resulting yhat
matrix and assign it to pred_post
. <<
Tip: When using make_future_dataframe
set include_history = FALSE
to only include future dates and exclude the 10-year history.
The pred_post
matrix is now a large matrix where each row represents a probability distribution over how many crimes will be reported that day. Unfortunately, all numbers in pred_post
are on the log10
scale. It would be more useful if the numbers in pred_post
represented counts of reported crimes.
>> Tranform pred_post
from the log10
scale back to original (un-logged) scale. <<
Tip: log10(1000) == 3
and 10^3 == 1000
. R is vectorized, so applying a transformation to all elements in a matrix is a short one-liner.
>> Take a look at some of the columns of the, now, un-logged pred_post
by trying to plot
a couple. What do the columns represent? <<
Tip: ggplot2
is great, but if you just want to plot a couple of columns from a matrix x
on top of each other you could do something like this:
plot(x[,1], type = "l", col = "red")
lines(x[, 2], col = "green")
lines(x[, 3], col = "blue")
Finally, let’s look at the prediction for today’s number of reported crimes in Chicago.
>> Figure out which row in pred_post
predicts today’s number of reported crimes. Extract this row and take a look at it using hist
. Finally, calculate the probability that there will be 750 or more reported crimes in Chicago today. <<
Tip: As pred_post
does not include actual dates you’ll have to cross reference between future$ds
and pred_post
. For example, the following would get you the row number of "2019-07-01"
:
which(as.Date(future$ds) == "2019-07-01")
That concludes the prophet
exercises. prophet
is a fairly flexible package and there are many more options you can use to customize the model to better fit a time series.
CausalImpact
Google’s CausalImpact
is even more focused than prophet
. It deals only with the case where there’s a change in the world that you believe had an impact on a time series, and you want to estimate: How would this time series have turned out if no change had happened? For example: What impact did the redesign of our site has on the number of purchases? Or, how did the emissions scandal of 2015 (a.k.a. Dieselgate) impact Volkswagen’s stock price?
Let’s take a look at this latter example, here’s Volkswagen’s stock value over time (+ the stocks of BMW and Allianz):
Download stock_prices.csvThis exercise is based on the well written case study on the Volkswagen scandal by Francesco Rinaldi: https://rinaldif.github.io/causal-impact/
>> Load in the CausalImpact
package and read the csv data into the variable stock_prices
. Have a look at the data. <<
While CausalImpact
can work with data frames it will produce better plots if you give it a zoo
time series. You can transform a data frame into a zoo
series using the read.zoo
function. Another benefit is that you can plot zoo
series using the autoplot
function in ggplot2
.
>> Convert stock_prices
to a zoo time series and plot it using autoplot
. <<
Tip: Set facet = NULL
when calling autoplot
to view the time series in the same plot
The emissions scandal broke on the 20th of September 2015 (that’s when Volkswagen’s stock drops like a stone in the plot). For CausalImpact
to do it’s job it needs to know the pre.period
, a period with reliable data before the event, and post.period
, the period after the event for when we want to predict the impact. Here are reasonable periods:
pre.period <- as.Date(c("2011-01-03", "2015-09-14"))
post.period <- as.Date(c("2015-09-21", "2017-03-19"))
Here’s how to fit a CausalImpact
model:
impact <- CausalImpact(data, pre.period, post.period)
>> Select the "volkswagen"
column in stock_prices
and fit a casual CausalImpact
model to this data. Plot the resulting object using plot
. <<
Tip: You can subset a zoo
time series just like a data frame. Here’s how you would select the column "my_column"
from the my_zoo_series
:
my_zoo_series[, "my_column"]
The blue dashed line in the plot shows CausalImpact
’s prediction for what would have happened to the stock price if the emissions scandal wouldn’t have happened. But it’s not a very good prediction. The stock price is just predicted to be constant and the uncertainty (the light blue shade) is huge!
This is because we haven’t given CausalImpact
much data to work with. One of the features of CausalImpact
is that it can use parallel time series to better the prediction of the main time series.
>> Fit the CausalImpact
impact model again, but this time give it all the data that’s in stock_prices
. That is, also the columns "BMW"
(another car manufacturer) and "Allianz"
(another large German company). plot
the result again. <<
The prediction in this new plot looks more reasonable, and the prediction is also less uncertain. If you look at the panel labeled “pointwise” you see the difference between the prediction and what actually happened. By using the summary
function on the CasualImpact
object you can also get a summary of the result.
>> Looking at the plot and a summary
of the result, what was the estimated impact on the stock price of Volkswagen at the beginning of 2017? <<
CausalImpact
has a cool feature: By passing the output = "report"
to the summary
function it will generate an automated report explaining the result.
>> Try it out! <<
That concludes the CausalImpact
exercises. CausalImpact
is a fairly specific package, with one specific use case. But it’s good at what it does! :)