Meta-analyses are generally considered to be the highest level of evidence and typically report mean differences between groups in outcomes of interest. Here we illustrate the importance of also understanding group differences in outcome variability and describe a method for approaching this meta-analytically.
Meta-analyses of randomized controlled trials (RCTs) of antidepressants in major depression find around a two-point greater reduction in symptom severity, as per the Hamilton Depression Rating Scale (HDRS), in the drug than placebo group (Hengartner, Jakobsen, Sørensen, & Plöderl, Reference Hengartner, Jakobsen, Sørensen and Plöderl2020; Jakobsen et al., Reference Jakobsen, Katakam, Schou, Hellmuth, Stallknecht, Leth-Møller, Iversen and Gluud2017). Imagine two such trials – of drug A and drug B – both showing the same two-point greater improvement in HDRS with drug treatment than placebo. In the trial with drug A, the combined standard deviation of the treatment and placebo groups (i.e. pooled SD) is four points, which gives a medium treatment effect size (Cohen's d = 0.5) (Hengartner & Plöderl, Reference Hengartner and Plöderl2018). Meanwhile, for drug B, the pooled SD is 10 points, which gives a small treatment effect size (d = 0.2) (Hengartner & Plöderl, Reference Hengartner and Plöderl2018). Patients and clinicians might conclude that drug B offers a poor treatment effect on average and opt for drug A instead.
However, what if we were more interested in the chance of a drug greatly improving symptoms than its average treatment effect? Consider that we are interested in the proportion of patients who show a large (d = 1.0) antidepressant response – previously defined as an eight-point improvement in HDRS over placebo (Hengartner & Plöderl, Reference Hengartner and Plöderl2018). In Fig. 1 below, we have plotted the distribution curves for drug A and drug B and have calculated (using EasyCalculation.com's Bell Curve Calculator; retrieved from: https://easycalculation.com/statistics/bell-curve-calculator.php.) the proportion of patients with at least an eight-point improvement. Whilst the mean benefit above placebo is two points for both drugs, the proportion of treated patients showing at least an eight-point greater improvement is around 7% for drug A and around 27% for drug B. The greater variability of drug B's treatment effect means that nearly three times more patients show a large effect size reduction in HDRS scores over placebo, with an odds ratio of almost five.
Whilst our example compares two individual trials to illustrate the relationship between mean effect, outcome variability, and the frequency of extreme observations, similar effects of outcome variability may also be seen at the meta-analytical level. This is especially important to consider for meta-analyses that do not find group differences in mean outcomes. Subgroups of patients may demonstrate clinically meaningful responses to treatment not captured in comparisons of mean outcomes, but which would be reflected in measures of group variability. Whilst the presence of subgroups may be observed in raw trial data (e.g. if a probability distribution is multimodal), such data are seldom available in research articles, especially meta-analyses.
The ratio of the variance of an outcome measure in one group to that in another (variance ratio, VR) can be used to compare group variability meta-analytically (Hedges & Nowell, Reference Hedges and Nowell1995). However, variance (the average of the squared differences of all data from the mean) is rarely reported in studies, whereas SD (the square root of the variance) is. Thus, the unbiased SD (SD adjusted for group differences in sample size) can be used to calculate the natural logarithm of VR, lnVR, as follows (McCutcheon et al., Reference McCutcheon, Pillinger, Mizuno, Montgomery, Pandian, Vano and Howes2021; see also Nakagawa et al., Reference Nakagawa, Poulin, Mengersen, Reinhold, Engqvist, Lagisz and Senior2015):
where, in the treatment t and control c groups, respectively: $\hat \sigma_t$ and $\hat \sigma_c$ are unbiased estimates of the population SD, S t and S c are the sample SDs, and n t and n c are the sample sizes.
Variance is often directly proportional to a fractional index of the mean (that is, when the mean is raised to the power of a fraction, e.g. x 2/3) (Taylor, Reference Taylor1961). Where this is the case, the SD scales with the mean, such that when meant exceeds meanc, SDt is greater than SDc, in proportion to the difference in means. This phenomenon has now been observed in hundreds of biological systems and is perhaps driven by some common ‘context-independent’ mechanism (Giometto, Formentin, Rinaldo, Cohen, & Maritan, Reference Giometto, Formentin, Rinaldo, Cohen and Maritan2015). Thus, observed differences in group variability could be due to, or exaggerated by, mean scaling of variance. To avoid mean scaling effects, the natural logarithm of the coefficient of variation ratio (CVR), lnCVR, can be used (McCutcheon et al., Reference McCutcheon, Pillinger, Mizuno, Montgomery, Pandian, Vano and Howes2021; see also Nakagawa et al., Reference Nakagawa, Poulin, Mengersen, Reinhold, Engqvist, Lagisz and Senior2015), where:
where $\bar{X}_t$ and $\bar{X}_c$ are sample means for the treatment and control groups, respectively. CVR circumvents scaling effects of the mean on SD by controlling for the mean differences between groups.
lnVR and lnCVR can be backtransformed to give VR or CVR, where a VR or CVR greater than one indicates increased variability in the treatment than control group, and vice versa. Modest variability ratios may be interpreted as percentage differences, e.g. a VR of 0.97 approximates a 3% group difference in variability (Winkelbeiner, Leucht, Kane, & Homan, Reference Winkelbeiner, Leucht, Kane and Homan2019); however, this approximation may be less accurate when VR or CVR is large. We have described these formulae as applicable to reviews of RCTs but they are equally useful for meta-analyses of other study types (such as case–control studies of striatal dopamine function in schizophrenia; Brugger et al., Reference Brugger, Angelescu, Abi-Dargham, Mizrahi, Shahrezaei and Howes2020).
No guidelines yet exist for the interpretation of variance ratios. We suggest this could be addressed using an approach similar to Jacob Cohen's interpretation of standardized mean differences, d (Cohen, Reference Cohen1988). Cohen proposed that a d requiring measurement to detect, such as the mean standing height difference between groups of 15- and 16-year-old women (~1 cm, d = 0.2), is small; a d just about perceptible to the naked eye, such as the mean height difference between groups of 14- and 18-year-old women (~2.5 cm, d = 0.5), is medium; and a d that is easily perceptible, such as the mean height difference between groups of 13- and 18 year-old women (~4 cm, d = 0.8), is large.
Although these benchmarks are a helpful guide, Cohen was clear that grading a given effect as small, medium, or large should be based on the question at hand. For instance, in a small-scale RCT of a new antidepressant medication, a d less than 0.2 might represent an unsatisfactory treatment effect. Meanwhile, in epidemiological studies, similar effect sizes can have profound effects at the population level. For example, Carey, Ridler, Ford, and Stringaris (Reference Carey, Ridler, Ford and Stringaris2023) suggest that a small (d = 0.14) increase in depressive symptom scores following the COVID pandemic may have resulted in as many as 160 000 excess cases of adolescent depression.
Using data analogous to Cohen (Reference Cohen1988) from the Centres for Disease Prevention and Control National Health and Nutritional Examination Survey (NHANES) 2001–2002 (CDC, 2002), we find that the SD of standing heights of age groups of US women is ~0.5 cm greater in 13- than 18-year-old women, for which VR ≈ 1.19; ~2.5 cm greater in 8- than 13-year-old women, for which VR ≈ 2.01; and ~3.3 cm greater in 8- than 25-year-old women, for which VR ≈ 2.63. We therefore propose that a VR or CVR around 1.2 is considered small, around 2.0 is considered medium, and around 2.6 is considered large. Whilst this provides general benchmarks, we echo Cohen's caution in relation to d: that what is deemed to be a small, medium, or large effect should ultimately depend on the issue under consideration.
There are some important considerations for the interpretation of VR and CVR. We must first distinguish between variability and heterogeneity. VR and CVR are measures of the dispersion of study data relative to their mean. In contrast, heterogeneity describes the extent to which effect sizes differ across studies included in a meta-analysis (Higgins, Thompson, Deeks, & Altman, Reference Higgins, Thompson, Deeks and Altman2003). Heterogeneity is commonly assessed using the Q or I 2 statistics, which indicate whether differences in effect sizes across studies are due to chance.
Second, a VR or CVR greater than one does not necessarily reflect greater variability in the patient/treatment group. It could, instead, reflect homogeneity in the comparator group, perhaps following recruitment of unrepresentatively healthy controls (i.e. volunteer bias) (Brugger et al., Reference Brugger, Angelescu, Abi-Dargham, Mizrahi, Shahrezaei and Howes2020). Similarly, exclusion of some patients could artificially reduce VR and CVR (McCutcheon et al., Reference McCutcheon, Pillinger, Mizuno, Montgomery, Pandian, Vano and Howes2021). This may be an unintended effect of study exclusion criteria (e.g. excluding patients with treatment resistance, polypharmacy, or comorbidities) or recruitment strategies (e.g. excluding severely unwell individuals by solely recruiting outpatients).
Third, a VR or CVR equal to one may conceal mathematically cancelling group deviations, such as a bimodal distribution of non- and ultra-good responders (McCutcheon et al., Reference McCutcheon, Pillinger, Mizuno, Montgomery, Pandian, Vano and Howes2021). It could also represent treatment-by-participant effects, which vary dependent on environmental, rater, and statistical factors (Winkelbeiner et al., Reference Winkelbeiner, Leucht, Kane and Homan2019), coincidentally giving equal variances in both groups (Plöderl & Hengartner, Reference Plöderl and Hengartner2019). It should also be appreciated that using total scores for a given scale might hide variability in specific symptom domains within the scale. This may be overcome by meta-analyzing the variability of individual items or subscale scores (McCutcheon et al., Reference McCutcheon, Pillinger, Mizuno, Montgomery, Pandian, Vano and Howes2021). Furthermore, CVR is only valid for positive values of scales with a true zero point, so cannot be used for interval rating scales nor mean change values unless they are first converted to a ratio scale (Homan et al., Reference Homan, Muscat, Joanlanne, Marousis, Cecere, Hofmann and Homan2021).
Finally, there may be more specific considerations related to the topic under study. Of note, in a meta-analysis of RCTs of antipsychotic medication in schizophrenia, McCutcheon et al. (Reference McCutcheon, Pillinger, Mizuno, Montgomery, Pandian, Vano and Howes2021) reported significantly lower variability of treatment response in patients receiving antipsychotics than in those receiving placebo. However, a re-analysis of the same 17 202 subjects found significantly higher variability of treatment response in patients receiving active treatment (McCutcheon et al., Reference McCutcheon, Pillinger, Efthimiou, Maslej, Mulsant, Young and Howes2022). The authors reason that the former meta-analysis relied upon an invalid assumption that treatment and placebo effects are positively correlated. This was corrected in the later meta-analysis, in which the authors use patient- and study-level data to show that treatment and placebo effects are, in fact, negatively correlated (McCutcheon et al., Reference McCutcheon, Pillinger, Efthimiou, Maslej, Mulsant, Young and Howes2022).
In summary, using variance ratios to synthesize group differences in variability provides important additional information to refine conclusions of meta-analyses of mean differences. This relatively novel meta-analytical approach has the potential to provide robust new insights into psychiatric and other illnesses, in terms of biology, treatment response, and other outcomes. Variability ratios of 1.2, 2.0, and 2.6 can be considered small, medium, and large, but interpretation must always consider the specific question at hand.
Data
For the purpose of open access, this paper has been published under a creative common licence (CC-BY) to any accepted author manuscript version arising from this submission.
Funding statement
No direct funding was received to support the preparation of this article. Professor Howes is supported by Medical Research Council-UK (MC_U120097115, MR/W005557/1 and MR/V013734/1) and Wellcome Trust (094849/Z/10/Z) grants, and the National Institute for Health and Care Research (NIHR) Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London. Dr Chapman is currently an NIHR Academic Clinical Fellow. The views expressed are those of the author(s) and not necessarily those of these funders, NHS, or Department of Health and Social Care. The funders had no input into the conceptualization, preparation, review, or approval of the manuscript, nor the decision to submit the manuscript for publication.
Competing interest
Professor Howes has received investigator-initiated research funding from and/or participated in advisory/speaker meetings organized by Angellini, Autifony, Biogen, Boehringer-Ingelheim, Eli Lilly, Elysium, Heptares, Global Medical Education, Invicro, Jansenn, Karuna, Lundbeck, Merck, Neurocrine, Ontrack/Pangea, Otsuka, Sunovion, Recordati, Roche, Rovi, and Viatris/Mylan. Professor Howes was previously a part-time employee of Lundbeck and has a patent for the use of dopaminergic imaging. Dr Chapman has no relevant disclosures.