Code
library(tidyverse)
library(brms)
library(tidybayes)
<- read_csv("Howell1.csv")
data <- filter(data, age >= 18)
data
theme_set(theme_minimal())
<- "#1E466E"
color <- c("#93CFDB", "#1E466E") colors
brms
to model the difference between two groups.
April 24, 2023
In this blog post I will cover how to use brms
to analyze the difference between two groups. Interestingly, this might seem like a very simple analysis, but there are actually multiple ways to go about this. I’ll try to cover a few here.
The data we’ll use is the same as in the previous posts. This data contains the sex of the participant in a column called male
, in addition to their height, weight, and age. This means we can investigate, say, whether there’s a difference in height between men and women.
Run the following setup code if you want to follow along.
Probably the most common method to analyze the difference between two groups is to regress the outcome on dummy coded data. That means the group column contains the information about both groups, usually using a 0 and 1 to indicate which group is which. The male
column in the current data frame is dummy coded, with a 0 for females and a 1 for males. The R formula for regressing height on male is height ~ male
. Let’s see what priors we need to set for this by using the get_prior()
function.
prior class coef group resp dpar nlpar lb ub
(flat) b
(flat) b male
student_t(3, 154.3, 8.5) Intercept
student_t(3, 0, 8.5) sigma 0
source
default
(vectorized)
default
default
The output shows we need to set a prior on sigma, the Intercept, and on the male coefficient. The intercept refers to the heights of female participants and the male coefficient refers to the difference between males and females, at least that’s what you would expect in standard regression. We’ll see that things are a bit more complicated. Sigma refers, as always, to the standard deviation of the residuals.
Writing down this model shows that it’s actually the same as the simple regression model:
\[\displaylines{heights_i ∼ Normal(\mu_i, \sigma) \\ \mu_i = \alpha + \beta x_i}\]
Despite this similarity, there’s an issue here that we need to get into. Let’s demonstrate this issue by setting some priors and running the model while only sampling from the priors. This means the estimates we get for the intercept and male coefficient should match our priors.
model_dummy_priors_default <- brm(
height ~ male,
data = data,
family = gaussian,
prior = c(
prior(normal(165, 5), class = "Intercept"),
prior(normal(10, 5), coef = "male"),
prior(cauchy(5, 5), class = "sigma")
),
sample_prior = "only",
cores = 4,
seed = 4,
file = "./models/model-dummy-priors-default.rds"
)
model_dummy_priors_default
Family: gaussian
Links: mu = identity; sigma = identity
Formula: height ~ male
Data: data (Number of observations: 352)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 160.32 5.63 148.76 171.25 1.00 3054 2599
male 9.92 5.12 -0.25 19.93 1.00 2832 2696
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 22.69 187.60 0.62 90.80 1.00 3302 1765
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Looking at the estimates we see that the estimate for male is indeed around 10 but the intercept estimate is not close to what we set it too. In this case, the intercept has an estimate of around 160, even though we set it to 165. How can this be? The reason is that brms
internally centers the intercept so that it corresponds to the expected response when all predictors are held at their means. This is what the prior is set to, not to the mean of the heights of female participants. brms
also re-generates the true intercept, which is the one we see in the output, so the intercept in the output does correctly indicate the mean height of female participants. Confusing.
To prevent this from happening, we need to suppress the default intercept and explicitly add one as a coefficient. This means we need to update our prior for the intercept to refer to a coef
(instead of class
).
model_dummy_priors <- brm(
height ~ 0 + Intercept + male,
data = data,
family = gaussian,
prior = c(
prior(normal(165, 5), coef = "Intercept"),
prior(normal(10, 5), coef = "male"),
prior(cauchy(5, 5), class = "sigma")
),
sample_prior = "only",
cores = 4,
seed = 4,
file = "./models/model-dummy-priors.rds"
)
model_dummy_priors
Family: gaussian
Links: mu = identity; sigma = identity
Formula: height ~ 0 + Intercept + male
Data: data (Number of observations: 352)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 165.04 5.12 154.88 175.06 1.00 2924 2468
male 10.11 5.01 0.10 19.82 1.00 3452 2556
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 17.96 96.44 0.55 83.00 1.00 3111 2025
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
The estimates (i.e., our priors) show that the average height for women is 165.04 and that for men is 165.04 + 10.11 = 175.15.
But now there’s a new problem. Let’s take a look at how uncertain we should be at these estimates by predicting the means for both men and women and then calculating the width of the 95% quartile interval.
# A tibble: 2 × 9
male .row .epred .lower .upper .width .point .interval width
<dbl> <int> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
1 0 1 165. 155. 175. 0.95 median qi 20.2
2 1 2 175. 161. 189. 0.95 median qi 28.6
We see that the width for males is slightly wider than that for females. If you run the model multiple times at different seeds, you’ll see that this is something that happens consistently. The reason there is more uncertainty for the mean of male participants is that we’ve added a separate prior that only applies to males—the prior on the male coefficient. The mean of female participants is wholly determined by the data and the prior on the intercept, but for males it’s the data and the prior for the intercept + the prior for the male coefficient. This is not desirable.
On top of that it less intuitive to think about the priors as a mean for females and then a prior for how much males deviate from this mean. It seems nicer to just specify what we think the heights are for females and what the heights are for males, without thinking about the deviation.
Both of these issues can be solved by using index coding instead of dummy coding.
Index coding requires that we create a new column that is a factor or that contains string values to indicate the sex of the participant. Below I create a new column called sex
that contains the values male
and female
as string values.
Let’s see what the priors are that we need to set.
prior class coef group resp dpar nlpar lb ub source
(flat) b default
(flat) b sexfemale (vectorized)
(flat) b sexmale (vectorized)
student_t(3, 0, 8.5) sigma 0 default
We need a prior for sigma, as always, and two priors for the two sexes.
We should also alter how we describe this model. We can now drop the \(\beta\) parameter from the model and we should more explicitly indicate that we will model several \(\alpha\) parameters, one for each sex.
\[\displaylines{heights_i ∼ Normal(\mu_i, \sigma) \\ \mu_i = \alpha_{sex[i]}}\]
Next, I simply translated the previous priors to this new notation. The prior for the heights of female participants stays the same (a normal distribution with a mean of 165 and a standard deviation of 5). The prior for the heights of male participants is also still a normal distribution, but now with a mean of 175 (165 + 10) and the same standard deviation.
model_index_prior <- brm(
height ~ 0 + sex,
data = data,
family = gaussian,
prior = c(
prior(normal(165, 5), coef = "sexfemale"),
prior(normal(175, 5), coef = "sexmale"),
prior(cauchy(5, 5), class = "sigma")
),
sample_prior = "only",
cores = 4,
seed = 4,
file = "./models/model-index-prior.rds"
)
model_index_prior
Family: gaussian
Links: mu = identity; sigma = identity
Formula: height ~ 0 + sex
Data: data (Number of observations: 352)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sexfemale 165.01 4.99 155.25 174.56 1.00 3013 2701
sexmale 175.11 5.05 165.41 185.14 1.00 2823 2546
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 57.54 1744.14 0.56 96.43 1.00 3387 2055
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Let’s confirm that the uncertainty in the priors for the male and female average heights are the same.
# A tibble: 2 × 9
sex .row .epred .lower .upper .width .point .interval width
<chr> <int> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
1 female 2 165. 155. 175. 0.95 median qi 19.3
2 male 1 175. 165. 185. 0.95 median qi 19.7
They are. This means we can now run the model and also sample from the posterior.
Family: gaussian
Links: mu = identity; sigma = identity
Formula: height ~ 0 + sex
Data: data (Number of observations: 352)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sexfemale 149.62 0.41 148.81 150.40 1.00 4239 2995
sexmale 160.47 0.43 159.63 161.31 1.00 4245 2663
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 5.56 0.21 5.16 5.99 1.00 3837 2992
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
The estimates show, more clearly now, that the average height for women is 149.62 and that for men is 160.47.
We can plot the average heights per group.
In the previous section I’ve argued that we should use index coding so that we can more easily think about, and see the results, of the priors about the two groups. I realize, though, that we are often still interested in the difference between the groups. This is easy to obtain, though. We can simply take the posterior samples from each group and subtract them from each other.
In the code below I extract the draws of each parameter (b_sexfemale
and b_sexmale
) and calculate the difference score. I then extract the draws of the difference score and calculate the median and quartile interval.
y ymin ymax .width .point .interval
1 10.84426 9.729488 12.02288 0.95 median qi
Alternatively, we can also plot it as a distribution.
Just like running a correlation, testing a group difference consists of running a simple regression. However, having groups as a predictor means the regression is not so simple after all. You have to think more carefully about what the priors mean (particularly the intercept) and you have to deal with greater uncertainty for some estimates, depending on how you code the group predictor. I’ve shown that explicitly including the intercept and using index coding makes thinking about this scenario a bit easier.
---
title: "Bayesian tutorial: Two groups"
description: "The fourth of a series of tutorial posts on Bayesian analyses. In this post I focus on using `brms` to model the difference between two groups."
date: 2023-04-24
categories:
- statistics
- tutorial
- Bayesian statistics
- regression
code-fold: true
code-tools: true
toc: true
---
In this blog post I will cover how to use `brms` to analyze the difference between two groups. Interestingly, this might seem like a very simple analysis, but there are actually multiple ways to go about this. I'll try to cover a few here.
The data we'll use is the same as in the previous posts. This data contains the sex of the participant in a column called `male`, in addition to their height, weight, and age. This means we can investigate, say, whether there's a difference in height between men and women.
Run the following setup code if you want to follow along.
```{r}
#| label: setup
#| message: false
library(tidyverse)
library(brms)
library(tidybayes)
data <- read_csv("Howell1.csv")
data <- filter(data, age >= 18)
theme_set(theme_minimal())
color <- "#1E466E"
colors <- c("#93CFDB", "#1E466E")
```
## Dummy coding
Probably the most common method to analyze the difference between two groups is to regress the outcome on dummy coded data. That means the group column contains the information about both groups, usually using a 0 and 1 to indicate which group is which. The `male` column in the current data frame is dummy coded, with a 0 for females and a 1 for males. The R formula for regressing height on male is `height ~ male`. Let's see what priors we need to set for this by using the `get_prior()` function.
```{r}
#| label: get-prior-dummy
get_prior(height ~ male, data = data)
```
The output shows we need to set a prior on sigma, the Intercept, and on the male coefficient. The intercept refers to the heights of female participants and the male coefficient refers to the difference between males and females, at least that's what you would expect in standard regression. We'll see that things are a bit more complicated. Sigma refers, as always, to the standard deviation of the residuals.
Writing down this model shows that it's actually the same as the simple regression model:
$$\displaylines{heights_i ∼ Normal(\mu_i, \sigma) \\ \mu_i = \alpha + \beta x_i}$$
Despite this similarity, there's an issue here that we need to get into. Let's demonstrate this issue by setting some priors and running the model while only sampling from the priors. This means the estimates we get for the intercept and male coefficient should match our priors.
```{r}
#| label: model-dummy-priors-default
model_dummy_priors_default <- brm(
height ~ male,
data = data,
family = gaussian,
prior = c(
prior(normal(165, 5), class = "Intercept"),
prior(normal(10, 5), coef = "male"),
prior(cauchy(5, 5), class = "sigma")
),
sample_prior = "only",
cores = 4,
seed = 4,
file = "./models/model-dummy-priors-default.rds"
)
model_dummy_priors_default
```
Looking at the estimates we see that the estimate for male is indeed around 10 but the intercept estimate is not close to what we set it too. In this case, the intercept has an estimate of around 160, even though we set it to 165. How can this be? The reason is that `brms` internally centers the intercept so that it corresponds to the expected response when all predictors are held at their means. This is what the prior is set to, not to the mean of the heights of female participants. `brms` also re-generates the true intercept, which is the one we see in the output, so the intercept in the output does correctly indicate the mean height of female participants. Confusing.
To prevent this from happening, we need to suppress the default intercept and explicitly add one as a coefficient. This means we need to update our prior for the intercept to refer to a `coef` (instead of `class`).
```{r}
#| label: model-dummy-priors
model_dummy_priors <- brm(
height ~ 0 + Intercept + male,
data = data,
family = gaussian,
prior = c(
prior(normal(165, 5), coef = "Intercept"),
prior(normal(10, 5), coef = "male"),
prior(cauchy(5, 5), class = "sigma")
),
sample_prior = "only",
cores = 4,
seed = 4,
file = "./models/model-dummy-priors.rds"
)
model_dummy_priors
```
The estimates (i.e., our priors) show that the average height for women is `r round(fixef(model_dummy_priors)[1, 1], 2)` and that for men is `r round(fixef(model_dummy_priors)[1, 1], 2)` + `r round(fixef(model_dummy_priors)[2, 1], 2)` = `r round(fixef(model_dummy_priors)[1, 1] + fixef(model_dummy_priors)[2, 1], 2)`.
But now there's a new problem. Let's take a look at how uncertain we should be at these estimates by predicting the means for both men and women and then calculating the width of the 95% quartile interval.
```{r}
#| label: model-dummy-issue
tibble(male = c(0, 1)) %>%
add_epred_draws(model_dummy_priors) %>%
median_qi() %>%
mutate(width = .upper - .lower)
```
We see that the width for males is slightly wider than that for females. If you run the model multiple times at different seeds, you'll see that this is something that happens consistently. The reason there is more uncertainty for the mean of male participants is that we've added a separate prior that only applies to males---the prior on the male coefficient. The mean of female participants is wholly determined by the data and the prior on the intercept, but for males it's the data and the prior for the intercept + the prior for the male coefficient. This is not desirable.
On top of that it less intuitive to think about the priors as a mean for females and then a prior for how much males deviate from this mean. It seems nicer to just specify what we think the heights are for females and what the heights are for males, without thinking about the deviation.
Both of these issues can be solved by using index coding instead of dummy coding.
## Index coding
Index coding requires that we create a new column that is a factor or that contains string values to indicate the sex of the participant. Below I create a new column called `sex` that contains the values `male` and `female` as string values.
```{r}
#| label: data-preparation
data <- mutate(data, sex = if_else(male == 1, "male", "female"))
```
Let's see what the priors are that we need to set.
```{r}
#| label: get-prior
get_prior(height ~ 0 + sex, data = data)
```
We need a prior for sigma, as always, and two priors for the two sexes.
We should also alter how we describe this model. We can now drop the $\beta$ parameter from the model and we should more explicitly indicate that we will model several $\alpha$ parameters, one for each sex.
$$\displaylines{heights_i ∼ Normal(\mu_i, \sigma) \\ \mu_i = \alpha_{sex[i]}}$$
Next, I simply translated the previous priors to this new notation. The prior for the heights of female participants stays the same (a normal distribution with a mean of 165 and a standard deviation of 5). The prior for the heights of male participants is also still a normal distribution, but now with a mean of 175 (165 + 10) and the same standard deviation.
```{r}
#| label: model-index-prior
model_index_prior <- brm(
height ~ 0 + sex,
data = data,
family = gaussian,
prior = c(
prior(normal(165, 5), coef = "sexfemale"),
prior(normal(175, 5), coef = "sexmale"),
prior(cauchy(5, 5), class = "sigma")
),
sample_prior = "only",
cores = 4,
seed = 4,
file = "./models/model-index-prior.rds"
)
model_index_prior
```
Let's confirm that the uncertainty in the priors for the male and female average heights are the same.
```{r}
#| label: model-dummy-solution
tibble(sex = c("male", "female")) %>%
add_epred_draws(model_index_prior) %>%
median_qi() %>%
mutate(width = .upper - .lower)
```
They are. This means we can now run the model and also sample from the posterior.
```{r}
#| label: model-index
model_index <- brm(
height ~ 0 + sex,
data = data,
family = gaussian,
prior = c(
prior(normal(165, 5), coef = "sexfemale"),
prior(normal(175, 5), coef = "sexmale"),
prior(cauchy(5, 5), class = "sigma")
),
sample_prior = TRUE,
cores = 4,
seed = 4,
file = "./models/model-index.rds"
)
model_index
```
The estimates show, more clearly now, that the average height for women is `r round(fixef(model_index)[1, 1], 2)` and that for men is `r round(fixef(model_index)[2, 1], 2)`.
We can plot the average heights per group.
```{r}
#| label: groups-plot
library(marginaleffects)
draws <- model_index %>%
predictions(newdata = tibble(sex = c("male", "female"))) %>%
posterior_draws()
```
## Calculating a difference score
In the previous section I've argued that we should use index coding so that we can more easily think about, and see the results, of the priors about the two groups. I realize, though, that we are often still interested in the difference between the groups. This is easy to obtain, though. We can simply take the posterior samples from each group and subtract them from each other.
In the code below I extract the draws of each parameter (`b_sexfemale` and `b_sexmale`) and calculate the difference score. I then extract the draws of the difference score and calculate the median and quartile interval.
```{r}
#| label: difference-score
model_index %>%
spread_draws(b_sexfemale, b_sexmale) %>%
mutate(difference = b_sexmale - b_sexfemale) %>%
pull(difference) %>%
median_qi()
```
Alternatively, we can also plot it as a distribution.
```{r}
#| label: difference-score-plot
#| warning: false
draws <- model_index %>%
spread_draws(b_sexfemale, b_sexmale) %>%
mutate(difference = b_sexmale - b_sexfemale)
ggplot(draws, aes(x = difference)) +
geom_histogram(fill = color) +
labs(x = "Difference", y = "")
```
## Summary
Just like running a correlation, testing a group difference consists of running a simple regression. However, having groups as a predictor means the regression is not so simple after all. You have to think more carefully about what the priors mean (particularly the intercept) and you have to deal with greater uncertainty for some estimates, depending on how you code the group predictor. I've shown that explicitly including the intercept and using index coding makes thinking about this scenario a bit easier.