In these exercises, you’ll get to try out two packages that both implement fairly advanced statistical models and are using Bayes under the hood. The packages are rstanarm and Facebook’s prophet.

Except for these packages you’ll need the tidyverse package as well to use ggplot2 for plotting and dplyr for data manipulation. If you don’t have experience with these packages, don’t sweat it! You can peek at the solutions as much as you want. :)

1. Bayesian linear regression with `rstanarm`

rstanarm is a complete Bayesian replacement for many of the regression modeling functions that come with R. Instead of lm you have stan_lm, instead of glm you have stan_glm, etc. Two benefits of rstanarm are that you can supply custom priors and that you get out probability distributions over all parameters and predictions instead of just point values.

To try out rstanarm you’ll use a dataset on daily ice cream sales (ice_creams) as a function of the outdoor temperature in C˚ (daily_temperature). You can download the data here:

Download ice_cream_sales.csv

1A

>> Download the ice cream sales data, load it in (say with read_csv), and plot it. <<

library(tidyverse)
ice_cream_sales <- read_csv("datasets/ice_cream_sales.csv")
plot(ice_cream_sales)

1B

The ice_cream_sales data looks like it would be well modeled by linear regression. Perhaps we want to predict future ice cream sales based on weather reports. Who knows?

>> Fit a classical linear regression model to the ice_cream_sales data using the lm function with ice_creams as the outcome and daily_temperature as the predictor. Summarise the resulting fit using summary. <<

If you don’t have much experience with regression modeling in R, or with lm, just go ahead and peek at the solution.

lm_fit <- lm(ice_creams ~ daily_temperature, data = ice_cream_sales)
summary(lm_fit)

## 
## Call:
## lm(formula = ice_creams ~ daily_temperature, data = ice_cream_sales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.929 -12.658  -3.586  14.899  33.280 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.6975     9.6615   0.072    0.943    
## daily_temperature   2.8685     0.4579   6.265 9.87e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.1 on 48 degrees of freedom
## Multiple R-squared:  0.4498, Adjusted R-squared:  0.4384 
## F-statistic: 39.25 on 1 and 48 DF,  p-value: 9.869e-08

1C

rstanarm makes it easy to run Bayesian versions of regression models. Just prepend stan_ to the lm function name and you’re ready to go! Almost… stan_lm requires you to to set an explicit prior guess for the mean $R^2$, the variance explained by the predictors. A decent default would be the following:

stan_fit <- stan_lm(y ~ x, data = my_data, prior = R2(0.5, "mean"))

To get a summary and histograms of the posterior you can then do the following:

summary(stan_fit)
plot(stan_fit, plotfun = "hist")

>> Use stan_lm to fit a Bayesian linear regression to the ice_cream_sales data. Plot histograms of the posterior and do a summary of the fit. <<

library(rstanarm)
stan_fit <- stan_lm(ice_creams ~ daily_temperature, data = ice_cream_sales, prior = R2(0.5, "mean"))

summary(stan_fit)

## 
## Model Info:
## 
##  function:     stan_lm
##  family:       gaussian [identity]
##  formula:      ice_creams ~ daily_temperature
##  algorithm:    sampling
##  priors:       see help('prior_summary')
##  sample:       4000 (posterior sample size)
##  observations: 50
##  predictors:   2
## 
## Estimates:
##                     mean   sd     2.5%   25%    50%    75%    97.5%
## (Intercept)          1.9    9.6  -17.1   -4.7    2.0    8.2   20.6 
## daily_temperature    2.8    0.5    1.9    2.5    2.8    3.1    3.7 
## sigma               18.3    1.9   15.2   16.9   18.1   19.5   22.8 
## log-fit_ratio        0.1    0.1    0.0    0.1    0.1    0.2    0.3 
## R2                   0.3    0.1    0.2    0.3    0.3    0.4    0.5 
## mean_PPD            59.1    3.6   52.0   56.6   59.1   61.4   66.3 
## log-posterior     -218.3    1.2 -221.4 -218.8 -218.0 -217.4 -216.9 
## 
## Diagnostics:
##                   mcse Rhat n_eff
## (Intercept)       0.3  1.0  1215 
## daily_temperature 0.0  1.0  1237 
## sigma             0.0  1.0  1590 
## log-fit_ratio     0.0  1.0  1494 
## R2                0.0  1.0  1296 
## mean_PPD          0.1  1.0  4078 
## log-posterior     0.0  1.0  1151 
## 
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

plot(stan_fit, plotfun = "hist")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1D

If you compare the summary of lm and stan_lm you’ll find that the estimates of the intercept ((Intercept)) and the slope (daily_temperature) are fairly similar. But for the Bayesian case, you get uncertainty estimates for the standard deviation (sigma) and $R^2$ (R2) too! And as with all Bayesian models, you can also get out a sample that represents the posterior. With rstanarm, you can do that by converting the stan_lm fit to a data frame (using, for example, as_data_frame)

>> Extract the posterior samples from the stan_lm fit and take a look at them with head and plot. <<

posterior <- as_data_frame(stan_fit)
head(posterior)

## # A tibble: 6 x 5
##   `(Intercept)` daily_temperature sigma `log-fit_ratio`    R2
##           <dbl>             <dbl> <dbl>           <dbl> <dbl>
## 1         15.8               2.18  17.2          0.0117 0.254
## 2          2.34              2.75  20.7          0.213  0.269
## 3         -4.43              2.80  19.3          0.173  0.303
## 4         11.6               2.70  23.3          0.286  0.225
## 5        -10.3               3.66  19.8          0.301  0.401
## 6        -13.4               3.49  20.0          0.284  0.376

plot(posterior)

1E

Looking at the joint posterior probability distribution produced by plot we see that the intercept and slope (daily_temperature) are correlated. A high slope means it’s probable that the intercept is lower and vice versa. This can be seen if we visualize the uncertainty in the regression line by plotting a subset of the regression lines defined by the posterior.

This can be done in ggplot2 like this (assuming you put the posterior samples in posterior):

posterior_subset <- sample_n(posterior, size = 20)
ggplot(ice_cream_sales, aes(daily_temperature, ice_creams)) +
  geom_point(color = "forestgreen") +
  geom_abline(aes(intercept = `(Intercept)`, slope = daily_temperature), data = posterior_subset, alpha = 0.5)

>> Try it out! <<

posterior_subset <- sample_n(posterior, size = 20)
ggplot(ice_cream_sales, aes(daily_temperature, ice_creams)) +
  geom_abline(aes(intercept = `(Intercept)`, slope = daily_temperature), data = posterior_subset) +
  geom_point(color = "forestgreen")

Looking at this plot we see that a higher slope means the intercept (where the regression line hits zero) is lower. We also see that all the lines have a positive slope, which means it’s highly probable that there is a positive correlation between the daily temperature and ice cream sales. But maybe we didn’t need a statistical model to figure that out…

2. Bayesian generalized linear models with `rstanarm`

So far you’ve been analyzing the ice cream data using stan_lm, which assumes the data is normally distributed around the regression line. However, since the data is a daily count of ice creams it might be more suitable to use Poisson regression.

2A

In base R you can fit a Poisson regression using the generalized linear model function glm. Like this:

fit <- glm(ice_creams ~ daily_temperature, data = ice_cream_sales, family = "poisson")
summary(fit)

In rstanarm, you instead have stan_glm which here should work as a direct replacement (unlike stan_lm it doesn’t require you to explicitly define any priors).

>> Run a Bayesian Poisson regression using stan_glm, make a summary of the result, and plot posterior histograms (by setting the plotfun = "hist" argument). <<

stan_fit <- stan_glm(ice_creams ~ daily_temperature, data = ice_cream_sales, family = "poisson")

summary(stan_fit,  digits = 3)

## 
## Model Info:
## 
##  function:     stan_glm
##  family:       poisson [log]
##  formula:      ice_creams ~ daily_temperature
##  algorithm:    sampling
##  priors:       see help('prior_summary')
##  sample:       4000 (posterior sample size)
##  observations: 50
##  predictors:   2
## 
## Estimates:
##                     mean     sd       2.5%     25%      50%      75%   
## (Intercept)          3.041    0.076    2.893    2.992    3.041    3.092
## daily_temperature    0.049    0.003    0.042    0.047    0.049    0.051
## mean_PPD            59.071    1.507   56.180   58.060   59.020   60.120
## log-posterior     -302.884    1.024 -305.656 -303.235 -302.563 -302.171
##                     97.5% 
## (Intercept)          3.191
## daily_temperature    0.056
## mean_PPD            62.001
## log-posterior     -301.911
## 
## Diagnostics:
##                   mcse  Rhat  n_eff
## (Intercept)       0.002 1.005 1872 
## daily_temperature 0.000 1.005 2097 
## mean_PPD          0.029 1.000 2779 
## log-posterior     0.027 1.000 1473 
## 
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

plot(stan_fit, plotfun = "hist")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2B

If you look at the posterior for the slope parameter (daily_temperature in the plot) you’ll see that most probably ice cream sales increases between 0.04 and 0.06 per degree C. But confusingly the unit is here the log number of ice creams, which is a bit hard to interpret.

But what about how many ice creams we would sell today? Using the posterior_predict function you can get the posterior predictive distribution. The arguments you need to give posterior_predict are object, a fitted rstanarm object, and newdata, a data frame with the predictors you want to generate predictions with. For example:

# Assuming the model was trained with a predictor named x
posterior_prediction <- posterior_predict(stan_fit, newdata = data.frame(x = 42))

>> Given the regression model you just fitted, generate a sample from the posterior predictive distribution given the current outdoor temperature. Store it in the variable ice_creams_demand and plot it using hist. <<

For temperature data, I recommend the Norwegian Meteorological Institute’s weather service.

ice_creams_demand <- posterior_predict(stan_fit, newdata = data.frame(daily_temperature = 26))
hist(ice_creams_demand, col = "papayawhip")

2C

If you look at the posterior predictive probability distribution you can see the most likely number of ice creams you’re going to sell (that’s where the distribution peaks) and you get a sense of how uncertain this estimate is (how wide the posterior is spread out).

But how much money should we expect to make if we brought with us say 75 ice creams and tried to sell today? Let’s calculate that! First some required background info:

Each ice cream costs us €0.2 to purchase.
We sell each ice cream for €2.
Our freezer is broken so the ice creams we don’t sell are going to spoil.

>> For each sample in the posterior predictive ice_creams_demand calculate how much money you would make. The resulting vector can be interpreted as a probability distribution. Taking the mean of this vector to get the expected profit when bringing with you 75 ice creams. <<

Tip: As R is vectorized all of this can be done in a single line using multiplication, addition, mean and the pmin function.

ice_creams_brought <- 75
mean(ice_creams_brought * -0.2 + pmin(ice_creams_demand, ice_creams_brought) * 2)

## [1] 128.113

2D

But how many ice creams should we bring if we would go out selling ice cream today?

>> Calculate the expected profit you’ll make if you bring anything from 1 to 200 ice creams. plot the expected profit as a function of the number of ice creams brought and calculate the most profitable number of ice creams to bring. <<

Tip: This can be done both with a for loop, or more functionally using sapply or purrr::map_dbl. which.max can be used to figure out the optimal number of ice creams to bring.

# The candidate number of ice creams we could bring
n_ice_creams <- 1:200

# If you like for-loops
pred_profit <- c()
for(ice_creams_brought in n_ice_creams) {
  pred_profit[ice_creams_brought] <- 
    mean(ice_creams_brought * -0.2 + pmin(ice_creams_demand, ice_creams_brought) * 2) 
}

# Or if you like a more functional style
pred_profit <- sapply(n_ice_creams, function(ice_creams_brought) {
  mean(ice_creams_brought * -0.2 + pmin(ice_creams_demand, ice_creams_brought) * 2)  
})

plot(n_ice_creams, pred_profit, type = "l")

which.max(pred_profit) # <- This is how many ice creams we should bring

## [1] 87

2E

Congrats! :) You’ve just performed a fairly sophisticated decision analysis on how many ice creams to bring. The validity of this analysis hinges on how well the model models reality. To check this is out of the scope here. However, we could at least check if the model was fitted correctly by rstanarm. There are many things that can go wrong when fitting a Bayesian model, but one quick sanity check is to look at the trace plots of the MCMC chain that produced the posterior samples. Ideally, these should look stationary and without discernible patterns. Generally, you want your trace plots to look like “hairy caterpillars”.

You can produce trace plots from a fitted rstanarm model like this:

plot(stan_fit, plotfun = "trace")

Or you can launch an interactive dashboard called ShinyStan that also allows you view the trace plots (+ a lot more!)

launch_shinystan(stan_fit)

>> Take a look at the traceplots for the Bayesian Poisson regression model to make sure that they look OK. <<

plot(stan_fit, plotfun = "trace")

# or launch the shinyshiny stan dashboard with launch_shinystan(stan_fit)

Well, these look OK to me. The fact that they look OK, does not mean everything is OK. But if they don’t look OK then you know something is wrong.

That concludes the rstanarm exercises. You’ve barely scratched the surface of rstanarm and there are many more models you can fit with it (stan_lmer, stan_glmer, stan_polr, etc.). But all are used in a similar way as stan_lm and stan_glm.

3. `prophet`: a time series model from Facebook

While rstanarm implements a large number of different models, prophet is focused on one case only: Modeling 1-dimensional time series data. Here is the time series you’ll work with in these exercises, the daily number of reported crimes in Chicago from 2009-06-20 to 2019-06-10 (source):

Download chicago_daily_crime_2009-2019.csv

3A

prophet accepts a data frame with one row per day with the following columns: ds, a column of Dates, and y a column with the outcome variable. If y is a count, it might also make sense to log transform it before giving it to prophet.

>> Read in the chicago_daily_crime_2009-2019.csv dataset, rename the columns to ds and y, and log10 transform y. plot the resulting data frame. <<

chicago_daily_crime <- read_csv("datasets/chicago_daily_crime_2009-2019.csv") %>% 
  rename(ds = Date, y = n) %>% 
  mutate(y = log10(y))

## Parsed with column specification:
## cols(
##   Date = col_date(format = ""),
##   n = col_double()
## )

plot(chicago_daily_crime, type = "l")

It looks like crime has been going down since the early 10s. Or, at least, the number of reported crimes.

3B

prophet has a great tutorial, and your task is now to follow the R tutorial from reading in the data to plotting the components with prophet_plot_components. But using the chicago_daily_crime dataset instead.

>> Do the steps in the R tutorial found here, but using the chicago_daily_crime dataset. <<

library(prophet)
m <- prophet(chicago_daily_crime)
future <- make_future_dataframe(m, periods = 365)
forecast <- predict(m, future)
plot(m, forecast)

prophet_plot_components(m, forecast, weekly_start = 1)

More reported crimes on Fridays and more reported crimes during summer, it seems.

Every estimate in a Bayesian analysis is a probability distribution and automatically gets uncertainty estimates. But you might notice that the component plot does not include uncertainty estimates (there are no confidence intervals, for example). This is because prophet, by default, takes a computational shortcut that leaves this out. You can run the full Bayesian model by calling prophet like this:

m <- prophet(d, mcmc.samples = 4000)

But that is going to take ages with this dataset. Something to try during a lunch break, perhaps..?

3C

Looking at the scatter plot with the superimposed (blue) model predictions we see that there are many data points that are far above and far below the predicted values. Part of this might be explained by that crime reporting is impacted by certain holidays. Fortunately prophet supports adding the impact of special holidays to the model.

>> Follow the tutorial on adding US holidays to your prophet model. <<

Tip: In the tutorial it’s a bit confusing how they are initializing the model. As you are just adding “standard” US holidays, just initialize the model without the holidays argument:

m <- prophet()

m <- prophet()
m <- add_country_holidays(m, country_name = 'US')
m <- fit.prophet(m, chicago_daily_crime)

# Or if you like %>% pipes
m <- prophet() %>% 
  add_country_holidays(country_name = 'US') %>% 
  fit.prophet(chicago_daily_crime)

3D

>> Again, like earlier, make a forecast and plot it using plot and prophet_plot_components. Did adding the holidays have any impact? <<

future <- make_future_dataframe(m, periods = 365)
forecast <- predict(m, future)
plot(m, forecast)

prophet_plot_components(m, forecast, weekly_start = 1)

It seems like some of the highest and lowest datapoints now are captured by the prophet model after having added holidays into the mix.

3E

As you have 10 years of data, the plots become a bit crowded and it’s hard to see what’s going on.

>> Use the dyplot.prophet function to display an interactive plot. Zoom in on the data and see if you can identify some of the holidays that have the larger “outliers” in reported crimes. <<

dyplot.prophet(m, forecast)

Seems like not many cops and/or members of the public report crimes on Christmas day, for example.

4. `prophet`: Predicting the number of reported crimes in Chicago today

As the name of the package might give away, one of prophet’s main use-cases is predicting the future. Let’s predict the number of reported crimes in Chicago today.

4A

The predictive_samples function can be used to produce samples from the posterior predictive distribution. Given a fitted prophet model and a one column data frame with dates (as produced by make_future_dataframe) it outputs a list with two matrices trend and yhat. Here yhat is a matrix with the posterior predictive samples with one date per row with each column being a sample from the posterior predictive.

>> Use make_future_dataframe to create a data frame with dates for forecasting one year ahead. Then use predictive_samples to sample from the posterior predictive. Extract the resulting yhat matrix and assign it to pred_post. <<

Tip: When using make_future_dataframe set include_history = FALSE to only include future dates and exclude the 10-year history.

future <- make_future_dataframe(m, periods = 365, include_history = FALSE)
pred_post <- predictive_samples(m, future)$yhat

4B

The pred_post matrix is now a large matrix where each row represents a probability distribution over how many crimes will be reported that day. Unfortunately, all numbers in pred_post are on the log10 scale. It would be more useful if the numbers in pred_post represented counts of reported crimes.

>> Tranform pred_post from the log10 scale back to original (un-logged) scale. <<

Tip: log10(1000) == 3 and 10^3 == 1000. R is vectorized, so applying a transformation to all elements in a matrix is a short one-liner.

pred_post <- 10^pred_post

4C

>> Take a look at some of the columns of the, now, un-logged pred_post by trying to plot a couple. What do the columns represent? <<

Tip: ggplot2 is great, but if you just want to plot a couple of columns from a matrix x on top of each other you could do something like this:

plot(x[,1], type = "l", col = "red")
lines(x[, 2], col = "green")
lines(x[, 3], col = "blue")

plot(pred_post[,1], type = "l", col = "red")
lines(pred_post[, 2], col = "green")
lines(pred_post[, 3], col = "blue")

Each column represents a “possible future” as simulated from the posterior predictive distribution.

4D

Finally, let’s look at the prediction for today’s number of reported crimes in Chicago.

>> Figure out which row in pred_post predicts today’s number of reported crimes. Extract this row and take a look at it using hist. Finally, calculate the probability that there will be 750 or more reported crimes in Chicago today. <<

Tip: As pred_post does not include actual dates you’ll have to cross reference between future$ds and pred_post. For example, the following would get you the row number of "2019-07-01":

which(as.Date(future$ds) == "2019-07-01")

today_i <- which(as.Date(future$ds) == as.Date(Sys.time()))
reported_crimes_today <- pred_post[today_i, ]
hist(reported_crimes_today, col = "blue")

# What's the probability more than 750 crimes will be reported today?
mean(reported_crimes_today >= 750)

## [1] 0.627

That concludes the prophet exercises. prophet is a fairly flexible package and there are many more options you can use to customize the model to better fit a time series.

Bonus Exercises

5. Predicting the future that never was with Google’s `CausalImpact`

Google’s CausalImpact is even more focused than prophet. It deals only with the case where there’s a change in the world that you believe had an impact on a time series, and you want to estimate: How would this time series have turned out if no change had happened? For example: What impact did the redesign of our site has on the number of purchases? Or, how did the emissions scandal of 2015 (a.k.a. Dieselgate) impact Volkswagen’s stock price?

Let’s take a look at this latter example, here’s Volkswagen’s stock value over time (+ the stocks of BMW and Allianz):

Download stock_prices.csv

This exercise is based on the well written case study on the Volkswagen scandal by Francesco Rinaldi: https://rinaldif.github.io/causal-impact/

5A

>> Load in the CausalImpact package and read the csv data into the variable stock_prices. Have a look at the data. <<

library(CausalImpact)
stock_prices <- read_csv("datasets/stock_prices.csv")
head(stock_prices)

## # A tibble: 6 x 4
##   date       volkswagen   bmw allianz
##   <date>          <dbl> <dbl>   <dbl>
## 1 2011-01-02      104.   42.8    57.4
## 2 2011-01-09      103.   42.5    60.3
## 3 2011-01-16       93.1  40.2    61.8
## 4 2011-01-23       98.4  41.2    63.4
## 5 2011-01-30       94.2  41.0    67.0
## 6 2011-02-06      105.   45.5    67.3

5B

While CausalImpact can work with data frames it will produce better plots if you give it a zoo time series. You can transform a data frame into a zoo series using the read.zoo function. Another benefit is that you can plot zoo series using the autoplot function in ggplot2.

>> Convert stock_prices to a zoo time series and plot it using autoplot. <<

Tip: Set facet = NULL when calling autoplot to view the time series in the same plot

stock_prices <- read.zoo(stock_prices)
autoplot(stock_prices, facet = NULL)

5C

The emissions scandal broke on the 20th of September 2015 (that’s when Volkswagen’s stock drops like a stone in the plot). For CausalImpact to do it’s job it needs to know the pre.period, a period with reliable data before the event, and post.period, the period after the event for when we want to predict the impact. Here are reasonable periods:

pre.period <- as.Date(c("2011-01-03", "2015-09-14"))
post.period <- as.Date(c("2015-09-21", "2017-03-19"))

Here’s how to fit a CausalImpact model:

impact <- CausalImpact(data, pre.period, post.period)

>> Select the "volkswagen" column in stock_prices and fit a casual CausalImpact model to this data. Plot the resulting object using plot. <<

Tip: You can subset a zoo time series just like a data frame. Here’s how you would select the column "my_column" from the my_zoo_series:

my_zoo_series[, "my_column"]

pre.period <- as.Date(c("2011-01-03", "2015-09-14"))
post.period <- as.Date(c("2015-09-21", "2017-03-19"))
scandal_impact <- CausalImpact(stock_prices[, "volkswagen"], pre.period, post.period)
plot(scandal_impact)

## Warning: Removed 325 rows containing missing values (geom_path).

## Warning: Removed 1 rows containing missing values (geom_path).

## Warning: Removed 650 rows containing missing values (geom_path).

5D

The blue dashed line in the plot shows CausalImpact’s prediction for what would have happened to the stock price if the emissions scandal wouldn’t have happened. But it’s not a very good prediction. The stock price is just predicted to be constant and the uncertainty (the light blue shade) is huge!

This is because we haven’t given CausalImpact much data to work with. One of the features of CausalImpact is that it can use parallel time series to better the prediction of the main time series.

>> Fit the CausalImpact impact model again, but this time give it all the data that’s in stock_prices. That is, also the columns "BMW" (another car manufacturer) and "Allianz" (another large German company). plot the result again. <<

scandal_impact <- CausalImpact(stock_prices, pre.period, post.period)
plot(scandal_impact)

## Warning: Removed 325 rows containing missing values (geom_path).

## Warning: Removed 1 rows containing missing values (geom_path).

## Warning: Removed 650 rows containing missing values (geom_path).

5E

The prediction in this new plot looks more reasonable, and the prediction is also less uncertain. If you look at the panel labeled “pointwise” you see the difference between the prediction and what actually happened. By using the summary function on the CasualImpact object you can also get a summary of the result.

>> Looking at the plot and a summary of the result, what was the estimated impact on the stock price of Volkswagen at the beginning of 2017? <<

summary(scandal_impact)

## Posterior inference {CausalImpact}
## 
##                          Average        Cumulative    
## Actual                   113            8788          
## Prediction (s.d.)        160 (6.7)      12490 (521.1) 
## 95% CI                   [146, 173]     [11415, 13512]
##                                                       
## Absolute effect (s.d.)   -47 (6.7)      -3702 (521.1) 
## 95% CI                   [-61, -34]     [-4725, -2627]
##                                                       
## Relative effect (s.d.)   -30% (4.2%)    -30% (4.2%)   
## 95% CI                   [-38%, -21%]   [-38%, -21%]  
## 
## Posterior tail-area probability p:   0.00107
## Posterior prob. of a causal effect:  99.89328%
## 
## For more details, type: summary(impact, "report")

The stock price is estimated to have dropped $47 dollars (-30%) because of the emissions scandal as of early 2017.

5F

CausalImpact has a cool feature: By passing the output = "report" to the summary function it will generate an automated report explaining the result.

>> Try it out! <<

summary(scandal_impact, output = "report")

## Analysis report {CausalImpact}
## 
## 
## During the post-intervention period, the response variable had an average value of approx. 112.66. By contrast, in the absence of an intervention, we would have expected an average response of 160.12. The 95% interval of this counterfactual prediction is [146.35, 173.24]. Subtracting this prediction from the observed response yields an estimate of the causal effect the intervention had on the response variable. This effect is -47.46 with a 95% interval of [-60.57, -33.68]. For a discussion of the significance of this effect, see below.
## 
## Summing up the individual data points during the post-intervention period (which can only sometimes be meaningfully interpreted), the response variable had an overall value of 8.79K. By contrast, had the intervention not taken place, we would have expected a sum of 12.49K. The 95% interval of this prediction is [11.42K, 13.51K].
## 
## The above results are given in terms of absolute numbers. In relative terms, the response variable showed a decrease of-30%. The 95% interval of this percentage is [-38%, -21%].
## 
## This means that the negative effect observed during the intervention period is statistically significant. If the experimenter had expected a positive effect, it is recommended to double-check whether anomalies in the control variables may have caused an overly optimistic expectation of what should have happened in the response variable in the absence of the intervention.
## 
## The probability of obtaining this effect by chance is very small (Bayesian one-sided tail-area probability p = 0.001). This means the causal effect can be considered statistically significant.

That concludes the CausalImpact exercises. CausalImpact is a fairly specific package, with one specific use case. But it’s good at what it does! :)

Get up to speed with Bayesian data analysis in R – Exercises, Pt. 2: Bayesian Tools

Rasmus Bååth

1. Bayesian linear regression with `rstanarm`

1A

1B

1C

1D

1E

2. Bayesian generalized linear models with `rstanarm`

2A

2B

2C

2D

2E

3. `prophet`: a time series model from Facebook

3A

3B

3C

3D

3E

4. `prophet`: Predicting the number of reported crimes in Chicago today

4A

4B

4C

4D

Bonus Exercises

5. Predicting the future that never was with Google’s `CausalImpact`

5A

5B

5C

5D

5E

5F

Get up to speed with Bayesian data analysis in R – Exercises, Pt. 2: Bayesian Tools

Rasmus Bååth

1. Bayesian linear regression with rstanarm

1A

1B

1C

1D

1E

2. Bayesian generalized linear models with rstanarm

2A

2B

2C

2D

2E

3. prophet: a time series model from Facebook

3A

3B

3C

3D

3E

4. prophet: Predicting the number of reported crimes in Chicago today

4A

4B

4C

4D

Bonus Exercises

5. Predicting the future that never was with Google’s CausalImpact

5A

5B

5C

5D

5E

5F

1. Bayesian linear regression with `rstanarm`

2. Bayesian generalized linear models with `rstanarm`

3. `prophet`: a time series model from Facebook

4. `prophet`: Predicting the number of reported crimes in Chicago today

5. Predicting the future that never was with Google’s `CausalImpact`