summ
— An alternative to summary
for regression modelsWhen sharing analyses with colleagues unfamiliar with R, I found that the output generally was not clear to them. Things were even worse if I wanted to give them information that is not included in the output like VIFs, robust standard errors, or standardized coefficients since the functions for estimating these don’t append them to a typical regression table. After creating output tables “by hand” on multiple occasions, I thought it best to pack things into a reusable function.
With no user-specified arguments except a fitted model, the output of summ
looks like this:
# Fit model
fit <- lm(Income ~ Frost + Illiteracy + Murder, data = as.data.frame(state.x77))
summ(fit)
## MODEL INFO:
## Observations: 50
## Dependent Variable: Income
##
## MODEL FIT:
## F(3,46) = 4.049, p = 0.012
## R-squared = 0.209
## Adj. R-squared = 0.157
##
## Standard errors: OLS
## Est. S.E. t val. p
## (Intercept) 5111.097 416.576 12.269 0 ***
## Frost -1.254 2.11 -0.594 0.555
## Illiteracy -610.715 213.138 -2.865 0.006 **
## Murder 23.074 30.94 0.746 0.46
Like any output, this one is somewhat opinionated—some information is shown that perhaps not everyone would be interested in, some may be missing. That, of course, was the motivation behind the creation of the function; the author was no fan of summary
and its lack of configurability.
Much of the output with summ
can be removed while there are several other pieces of information under the hood that users can ask for.
To remove the written output at the beginning, set model.info = FALSE
and/or model.fit = FALSE
.
summ(fit, model.info = FALSE, model.fit = FALSE)
## Standard errors: OLS
## Est. S.E. t val. p
## (Intercept) 5111.097 416.576 12.269 0 ***
## Frost -1.254 2.11 -0.594 0.555
## Illiteracy -610.715 213.138 -2.865 0.006 **
## Murder 23.074 30.94 0.746 0.46
Another, related bit of information available before the coefficient table relates to model assumptions (for OLS linear regression). When model.check = TRUE
, summ
will report (with the help of the car
package) two quantities related to linear regression assumptions:
In both cases, you shouldn’t treat the results as proof of meaningful problems (or a lack of meaningful problems), but instead as a heuristic for more probing with graphical analyses.
summ(fit, model.check = TRUE)
## MODEL INFO:
## Observations: 50
## Dependent Variable: Income
##
## MODEL FIT:
## F(3,46) = 4.049, p = 0.012
## R-squared = 0.209
## Adj. R-squared = 0.157
##
## MODEL CHECKING:
## Homoskedasticity (Breusch-Pagan) = Assumption not violated (p = 0.131)
## Number of high-leverage observations = 2
##
## Standard errors: OLS
## Est. S.E. t val. p
## (Intercept) 5111.097 416.576 12.269 0 ***
## Frost -1.254 2.11 -0.594 0.555
## Illiteracy -610.715 213.138 -2.865 0.006 **
## Murder 23.074 30.94 0.746 0.46
One of the problems that originally motivated the creation of this function was the desire to efficiently report robust standard errors—while it is easy enough for an experienced R user to calculate robust standard errors, there are not many simple ways to include the results in a regression table as is common with the likes of Stata, SPSS, etc.
Robust standard errors require the user to have both lmtest
and sandwich
packages installed. They do not need to be loaded.
There are multiple types of robust standard errors that you may use, ranging from “HC0” to “HC5”. Per the recommendation of the authors of the sandwich
package, the default is “HC3”. Stata’s default is “HC1”, so you may want to use that if your goal is to replicate Stata analyses.
summ(fit, robust = TRUE, robust.type = "HC3")
## MODEL INFO:
## Observations: 50
## Dependent Variable: Income
##
## MODEL FIT:
## F(3,46) = 4.049, p = 0.012
## R-squared = 0.209
## Adj. R-squared = 0.157
##
## Standard errors: Robust, type = HC3
## Est. S.E. t val. p
## (Intercept) 5111.097 537.808 9.504 0 ***
## Frost -1.254 2.867 -0.437 0.664
## Illiteracy -610.715 196.879 -3.102 0.003 **
## Murder 23.074 36.846 0.626 0.534
Robust standard errors can also be calculated for generalized linear models (i.e., glm
objects) though some debate whether they should be used for models fit iteratively with non-normal errors. In the case of svyglm
, the standard errors that package calculates are already robust to heteroskedasticity, so a robust = TRUE
parameter will be ignored with a warning.
You may also specify with cluster
argument the name of a variable in the input data or a vector of clusters to get cluster-robust standard errors.
Some prefer to use standardized coefficients in order to avoid dismissing an effect as “small” when it is just the units of measure that are small. Standardized betas are used instead when standardize = TRUE
. To be clear, since the meaning of “standardized beta” can vary depending on who you talk to, this option mean-centers the predictors as well but does not alter the dependent variable whatsoever. If you want to standardize the dependent variable too, just add the standardize.response = TRUE
argument.
summ(fit, standardize = TRUE)
## MODEL INFO:
## Observations: 50
## Dependent Variable: Income
##
## MODEL FIT:
## F(3,46) = 4.049, p = 0.012
## R-squared = 0.209
## Adj. R-squared = 0.157
##
## Standard errors: OLS
## Est. S.E. t val. p
## (Intercept) 4435.8 79.773 55.605 0 ***
## Frost -65.188 109.686 -0.594 0.555
## Illiteracy -372.251 129.914 -2.865 0.006 **
## Murder 85.179 114.217 0.746 0.46
##
## All continuous variables are mean-centered and scaled by 1 s.d.
You can also choose a different number of standard deviations to divide by for standardization. Andrew Gelman has been a proponent of dividing by 2 standard deviations; if you want to do things that way, give the argument n.sd = 2
.
summ(fit, standardize = TRUE, n.sd = 2)
## MODEL INFO:
## Observations: 50
## Dependent Variable: Income
##
## MODEL FIT:
## F(3,46) = 4.049, p = 0.012
## R-squared = 0.209
## Adj. R-squared = 0.157
##
## Standard errors: OLS
## Est. S.E. t val. p
## (Intercept) 4435.8 79.773 55.605 0 ***
## Frost -130.376 219.371 -0.594 0.555
## Illiteracy -744.502 259.829 -2.865 0.006 **
## Murder 170.357 228.435 0.746 0.46
##
## All continuous variables are mean-centered and scaled by 2 s.d.
Note that this is achieved by refitting the model. If the model took a long time to fit initially, expect a similarly long time to refit it.
In the same vein as the standardization feature, you can keep the original scale while still mean-centering the predictors with the center = TRUE
argument.
summ(fit, center = TRUE)
## MODEL INFO:
## Observations: 50
## Dependent Variable: Income
##
## MODEL FIT:
## F(3,46) = 4.049, p = 0.012
## R-squared = 0.209
## Adj. R-squared = 0.157
##
## Standard errors: OLS
## Est. S.E. t val. p
## (Intercept) 4435.8 79.773 55.605 0 ***
## Frost -1.254 2.11 -0.594 0.555
## Illiteracy -610.715 213.138 -2.865 0.006 **
## Murder 23.074 30.94 0.746 0.46
##
## All continuous variables are mean-centered.
In many cases, you’ll learn more by looking at confidence intervals than p-values. You can request them from summ
.
summ(fit, confint = TRUE, digits = 2)
## MODEL INFO:
## Observations: 50
## Dependent Variable: Income
##
## MODEL FIT:
## F(3,46) = 4.05, p = 0.01
## R-squared = 0.21
## Adj. R-squared = 0.16
##
## Standard errors: OLS
## Est. 2.5% 97.5% t val. p
## (Intercept) 5111.1 4294.62 5927.57 12.27 0 ***
## Frost -1.25 -5.39 2.88 -0.59 0.56
## Illiteracy -610.71 -1028.46 -192.97 -2.87 0.01 **
## Murder 23.07 -37.57 83.72 0.75 0.46
You can adjust the width of the confidence intervals, which are by default 95% CIs.
summ(fit, confint = TRUE, ci.width = .5, digits = 2)
## MODEL INFO:
## Observations: 50
## Dependent Variable: Income
##
## MODEL FIT:
## F(3,46) = 4.05, p = 0.01
## R-squared = 0.21
## Adj. R-squared = 0.16
##
## Standard errors: OLS
## Est. 25% 75% t val. p
## (Intercept) 5111.1 4830.12 5392.07 12.27 0 ***
## Frost -1.25 -2.68 0.17 -0.59 0.56
## Illiteracy -610.71 -754.47 -466.96 -2.87 0.01 **
## Murder 23.07 2.21 43.94 0.75 0.46
You might also want to drop the p-values altogether.
summ(fit, confint = TRUE, pvals = FALSE, digits = 2)
## MODEL INFO:
## Observations: 50
## Dependent Variable: Income
##
## MODEL FIT:
## F(3,46) = 4.05, p = 0.01
## R-squared = 0.21
## Adj. R-squared = 0.16
##
## Standard errors: OLS
## Est. 2.5% 97.5% t val.
## (Intercept) 5111.10 4294.62 5927.57 12.27
## Frost -1.25 -5.39 2.88 -0.59
## Illiteracy -610.71 -1028.46 -192.97 -2.87
## Murder 23.07 -37.57 83.72 0.75
Note that you can omit p-values regardless of whether you have requested confidence intervals.
summ
has been expanding its range of supported model types. glm
was a natural extension and will cover most use cases.
fitg <- glm(vs ~ drat + mpg, data = mtcars, family = binomial)
summ(fitg)
## MODEL INFO:
## Observations: 32
## Dependent Variable: vs
## Error Distribution: binomial
## Link function: logit
##
## MODEL FIT:
## Pseudo R-squared (Cragg-Uhler) = 0.589
## Pseudo R-squared (McFadden) = 0.422
## AIC = 31.351, BIC = 35.748
##
## Standard errors: MLE
## Est. S.E. z val. p
## (Intercept) -7.755 4 -1.939 0.053 .
## drat -0.562 1.333 -0.422 0.673
## mpg 0.477 0.2 2.38 0.017 *
For exponential family models, especially logit and Poisson, you may be interested in getting odds ratios rather than the linear beta estimates. summ
can handle that!
summ(fitg, odds.ratio = TRUE)
## MODEL INFO:
## Observations: 32
## Dependent Variable: vs
## Error Distribution: binomial
## Link function: logit
##
## MODEL FIT:
## Pseudo R-squared (Cragg-Uhler) = 0.589
## Pseudo R-squared (McFadden) = 0.422
## AIC = 31.351, BIC = 35.748
##
## Standard errors: MLE
## Odds Ratio 2.5% 97.5% z val. p
## (Intercept) 0 0 1.089 -1.939 0.053 .
## drat 0.57 0.042 7.769 -0.422 0.673
## mpg 1.611 1.088 2.387 2.38 0.017 *
Standard errors are omitted for odds ratio estimates since the confidence intervals are not symmetrical.
You can also get summaries of merMod
objects, the mixed models from the lme4
package.
library(lme4)
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
summ(fm1)
## MODEL INFO:
## Observations: 180
## Dependent Variable: Reaction
## Type: Mixed effects linear regression
##
## MODEL FIT:
## AIC = 1755.628, BIC = 1774.786
##
## FIXED EFFECTS:
## Est. S.E. t val. p
## (Intercept) 251.405 6.825 36.838 0 ***
## Days 10.467 1.546 6.771 0 ***
##
## p values calculated using Kenward-Roger d.f. = 17
##
## RANDOM EFFECTS:
## Group Parameter Std.Dev.
## Subject (Intercept) 24.74
## Subject Days 5.922
## Residual 25.592
##
## Grouping variables:
## Group # groups ICC
## Subject 18 0.483
Note that the summary omits p-values by default unless the package is installed for linear models. There’s no clear-cut way to derive p-values with linear mixed models and treating the t-values like you would for OLS models will lead to inflated Type 1 error rates. Confidence intervals are better, but not perfect. Kenward-Roger calculated degrees of freedom are fairly good under many circumstances and those are used by default when package is installed. See the documentation (?summ.merMod
) for more info.
I won’t run through any examples here, but svyglm
models are supported and provide near-equivalent output to what you see here depending on whether they are linear models or generalized linear models.
With the digits =
argument, you can decide how precise you want the outputted numbers to be. It is often inappropriate or distracting to report quantities with many digits past the decimal due to the inability to measure them so precisely or interpret them in applied settings. In other cases, it may be necessary to use more digits due to the way measures are calculated.
The default argument is digits = 3
.
summ(fit, model.info = FALSE, digits = 5)
## MODEL FIT:
## F(3,46) = 4.04857, p = 0.01232
## R-squared = 0.20888
## Adj. R-squared = 0.15729
##
## Standard errors: OLS
## Est. S.E. t val. p
## (Intercept) 5111.09665 416.57608 12.2693 0 ***
## Frost -1.25407 2.11012 -0.59432 0.55521
## Illiteracy -610.71471 213.13769 -2.86535 0.00626 **
## Murder 23.07403 30.94034 0.74576 0.45961
summ(fit, model.info = FALSE, digits = 1)
## MODEL FIT:
## F(3,46) = 4, p = 0
## R-squared = 0.2
## Adj. R-squared = 0.2
##
## Standard errors: OLS
## Est. S.E. t val. p
## (Intercept) 5111.1 416.6 12.3 0 ***
## Frost -1.3 2.1 -0.6 0.6
## Illiteracy -610.7 213.1 -2.9 0 **
## Murder 23.1 30.9 0.7 0.5
You can pre-set the number of digits you want printed for all jtools
functions with the jtools-digits
option.
options("jtools-digits" = 2)
summ(fit, model.info = FALSE)
## MODEL FIT:
## F(3,46) = 4.05, p = 0.01
## R-squared = 0.21
## Adj. R-squared = 0.16
##
## Standard errors: OLS
## Est. S.E. t val. p
## (Intercept) 5111.1 416.58 12.27 0 ***
## Frost -1.25 2.11 -0.59 0.56
## Illiteracy -610.71 213.14 -2.87 0.01 **
## Murder 23.07 30.94 0.75 0.46
Note that the return object has non-rounded values if you wish to use them later.
j <- summ(fit, digits = 3)
j$coeftable
## Est. S.E. t val. p
## (Intercept) 5111.096650 416.576083 12.2692993 4.146240e-16
## Frost -1.254074 2.110117 -0.5943151 5.552133e-01
## Illiteracy -610.714712 213.137691 -2.8653529 6.259724e-03
## Murder 23.074026 30.940339 0.7457587 4.596073e-01
When multicollinearity is a concern, it can be useful to have VIFs reported alongside each variable. This can be particularly helpful for model comparison and checking for the impact of newly-added variables. To get VIFs reported in the output table, just set vifs = TRUE
.
Note that the car
package is needed to calculate VIFs.
summ(fit, vifs = TRUE)
## MODEL INFO:
## Observations: 50
## Dependent Variable: Income
##
## MODEL FIT:
## F(3,46) = 4.049, p = 0.012
## R-squared = 0.209
## Adj. R-squared = 0.157
##
## Standard errors: OLS
## Est. S.E. t val. p VIF
## (Intercept) 5111.097 416.576 12.269 0 ***
## Frost -1.254 2.11 -0.594 0.555 1.853
## Illiteracy -610.715 213.138 -2.865 0.006 ** 2.599
## Murder 23.074 30.94 0.746 0.46 2.009
There are many standards researchers apply for deciding whether a VIF is too large. In some domains, a VIF over 2 is worthy of suspicion. Others set the bar higher, at 5 or 10. Ultimately, the main thing to consider is that small effects are more likely to be “drowned out” by higher VIFs.