Predicting values of intercept only models

statistics

prediction

A post on how to predict values of intercept-only models.

Published

March 18, 2024

I frequently need to calculate a single proportion or single mean with confidence intervals. My preferred way for getting these is to run intercept-only models, such as a logistic regression for proportions and standard regression for means. In this post I show to run these models and obtain the estimates with confidence intervals, using the same workflow from my tutorial blog posts.

Run the following setup code if you want to follow along.

Code

library(tidyverse)
library(marginaleffects)
library(brms)
library(tidybayes)
library(modelr)

options(
  mc.cores = 4,
  brms.threads = 4,
  brms.backend = "cmdstanr",
  brms.file_refit = "on_change"
)

Data

I found some survey data including a question about whether the respondent follows a vegan diet. The relevant column is called vegan and contains a 1 for ‘Yes’ and a 0 for “No”. With this data we can determine the proportion of vegans and the associated confidence interval.

Code

data <- read_csv("data.csv")
head(data)

id	date	vegan
800009	02-07-2018	0
800015	02-07-2018	0
800054	02-07-2018	0
800057	04-07-2018	0
800073	03-07-2018	0
800085	03-07-2018	0

Frequentist

My frequentist approach consists of running a logistic regression using the glm() function. The outcome variable is the vegan column and there are no predictors; only an intercept.

Code

model <- glm(vegan ~ 1, family = binomial(), data = data)

We can use the predictions() function from the marginaleffects package to obtain the proportion. By default, this function calculates the regression-adjusted predicted values for every observation in the original dataset. That’s not what we want; we want only one prediction. We can specify what we want to calculate predictors for using the newdata argument. My preferred way for specifying predictor values is using helper functions like datagrid(). With this function you can specify which predictors you want to include and for which values of each predictor you want to calculate predictions. The problem is that we don’t have any predictors, so what to specify? If we use the datagrid() function from marginaleffects, the answer is nothing.

Code

predictions(model, newdata = datagrid())

rowid	estimate	p.value	s.value	conf.low	conf.high	vegan
1	0.0120167	0	961.5991	0.0095016	0.0151874	0.0120167

This gives us the estimate of interest, as well as a 95% confidence interval.

Bayesian

Now let’s do the same thing but using a Bayesian approach, without using the marginaleffects package. Below we run a model using the brm() function from brms.

Code

model <- brm(
  vegan ~ 1,
  family = bernoulli(link = "logit"),
  data = data,
  prior = prior(student_t(5, 0, 1.5), class = "Intercept"),
  file = "models/model.rds",
  silent = 2
)

Warning: Rows containing NAs were excluded from the model.

To get predicted values, I’ll use the data_grid() fuction from the modelr package and the add_epred_draws() and median_qi() functions from the tidybayes package. The logic is to specify a data frame using data_grid() with predictor values and then add predicted values using add_epred_draws() to this data frame, which are then summarized using median_qi(). However, if we run the following code, we get an error.

Code

data_grid() |>
  add_epred_draws(model) |>
  median_qi()

Error in `data_grid()`:
! Must supply at least one of `...` and `.model`

That’s because the function data_grid() can’t be empty. Using datagrid() from the marginaleffects package also wouldn’t work.

To fix the error, we need to specify the model using the .model argument of the function.

Code

data_grid(.model = model) |>
  add_epred_draws(model) |>
  median_qi()

.row	.epred	.lower	.upper	.width	.point	.interval
1	0.0121273	0.0095111	0.0151626	0.95	median	qi

That works.

We could also create the data frame ourselves without using data_grid().

Code

tibble(.rows = 1) |>
  add_epred_draws(model) |>
  median_qi()

.row	.epred	.lower	.upper	.width	.point	.interval
1	0.0121273	0.0095111	0.0151626	0.95	median	qi

But it looks weird to me to create a data frame with 1 row and no values in it (although technically that’s what the data grid functions also do).

Summary

You can run intercept-only regression models to obtain estimates of single proportions or means. These estimates, together with their confidence intervals, can be obtained using prediction functions and telling them to predict values from empty data frames, which can be created using helper functions like datagrid() from the marginaleffects package and data_grid() from the modelr package.

This post was last updated on 2024-03-18.

--- title: "Predicting values of intercept only models" description: "A post on how to predict values of intercept-only models." date: 2024-03-18 categories: - statistics - prediction code-fold: show code-tools: true toc: true df-print: kable --- I frequently need to calculate a single proportion or single mean with confidence intervals. My preferred way for getting these is to run intercept-only models, such as a logistic regression for proportions and standard regression for means. In this post I show to run these models and obtain the estimates with confidence intervals, using the same workflow from my tutorial blog posts. Run the following setup code if you want to follow along. ```{r} #| label: setup #| message: false library(tidyverse) library(marginaleffects) library(brms) library(tidybayes) library(modelr) options( mc.cores = 4, brms.threads = 4, brms.backend = "cmdstanr", brms.file_refit = "on_change" ) ``` ## Data I found some survey [data](https://doi.org/10.17026/dans-x2z-n2bh) including a question about whether the respondent follows a vegan diet. The relevant column is called `vegan` and contains a 1 for 'Yes' and a 0 for "No". With this data we can determine the proportion of vegans and the associated confidence interval. ```{r} #| label: data #| message: false data <- read_csv("data.csv") head(data) ``` ## Frequentist My frequentist approach consists of running a logistic regression using the `glm()` function. The outcome variable is the `vegan` column and there are no predictors; only an intercept. ```{r} #| label: frequentist-model model <- glm(vegan ~ 1, family = binomial(), data = data) ``` We can use the `predictions()` function from the [marginaleffects](https://marginaleffects.com) package to obtain the proportion. By default, this function calculates the regression-adjusted predicted values for every observation in the original dataset. That's not what we want; we want only one prediction. We can specify what we want to calculate predictors for using the `newdata` argument. My preferred way for specifying predictor values is using helper functions like `datagrid()`. With this function you can specify which predictors you want to include and for which values of each predictor you want to calculate predictions. The problem is that we don't have any predictors, so what to specify? If we use the `datagrid()` function from marginaleffects, the answer is nothing. ```{r} #| label: frequentist-data-grid predictions(model, newdata = datagrid()) ``` This gives us the estimate of interest, as well as a 95% confidence interval. ## Bayesian Now let's do the same thing but using a Bayesian approach, without using the marginaleffects package. Below we run a model using the `brm()` function from [brms](https://paul-buerkner.github.io/brms/). ```{r} #| label: bayesian-model model <- brm( vegan ~ 1, family = bernoulli(link = "logit"), data = data, prior = prior(student_t(5, 0, 1.5), class = "Intercept"), file = "models/model.rds", silent = 2 ) ``` To get predicted values, I'll use the `data_grid()` fuction from the [modelr](https://modelr.tidyverse.org) package and the `add_epred_draws()` and `median_qi()` functions from the [tidybayes](http://mjskay.github.io/tidybayes/) package. The logic is to specify a data frame using `data_grid()` with predictor values and then add predicted values using `add_epred_draws()` to this data frame, which are then summarized using `median_qi()`. However, if we run the following code, we get an error. ```{r} #| label: bayesian-data-grid-wrong #| error: true data_grid() |> add_epred_draws(model) |> median_qi() ``` That's because the function `data_grid()` can't be empty. Using `datagrid()` from the marginaleffects package also wouldn't work. To fix the error, we need to specify the model using the `.model` argument of the function. ```{r} #| label: bayesian-data-grid data_grid(.model = model) |> add_epred_draws(model) |> median_qi() ``` That works. We could also create the data frame ourselves without using `data_grid()`. ```{r} #| label: bayesian-tibble tibble(.rows = 1) |> add_epred_draws(model) |> median_qi() ``` But it looks weird to me to create a data frame with 1 row and no values in it (although technically that's what the data grid functions also do). ## Summary You can run intercept-only regression models to obtain estimates of single proportions or means. These estimates, together with their confidence intervals, can be obtained using prediction functions and telling them to predict values from empty data frames, which can be created using helper functions like `datagrid()` from the marginaleffects package and `data_grid()` from the modelr package. *This post was last updated on `r format(Sys.Date(), "%Y-%m-%d")`.*