1 Predicting Senate Races
In recent years, there has been an explosion of interest in election prediction models. Primarily, this has been driven by media outlets and popular forecasting websites like fivethirtyeight.com. However, although the most prominent forecasting efforts have been housed in media organizations, these models often build from, or are inspired by, research in political science.
Broadly speaking, academic election forecasting in the U.S. context can be divided into two approaches. First, there are static models that make a single prediction for a given election (e.g., Abramowitz Reference Abramowitz2008; Fair Reference Fair1978; Lewis-Beck and Tien Reference Lewis-Beck and Tien2008). These models are sometimes referred to as “fundamentals” models and primarily rely on economic indicators, incumbency status, and other factors that shape the general context of an election. To the extent, they incorporate polling data at all, they are based on proxies such as presidential approval (e.g., Erikson and Wlezien Reference Erikson and Wlezien2008) or snapshots taken well before Election Day (e.g., Campbell and Wink Reference Campbell and Wink1990).
A smaller body of research focuses on dynamic forecasting models that change over the course of the campaign as new polling data arrives. In particular, Linzer (Reference Linzer2013) introduces a dynamic Bayesian model forecasting the U.S. presidential election results for all 50 states. This model has served as a basis for presidential forecasts produced by major media outlets including The Economist and Daily Kos.Footnote 1 Another example is Jackman (Reference Jackman2005), which presents a somewhat related Bayesian model, although the goal is to aggregate polls rather than to make a prediction per se.
While the U.S. presidential election has received the most attention, a smaller body of research has focused on predicting legislative elections. Most examples in this domain do not seek to predict individual races, but rather the aggregate number of seats that swing to a specific party (e.g., Campbell Reference Campbell2018; Lockerbie Reference Lockerbie2012). Still, some previous work has built models of individual U.S. Senate races based on fundamental factors (Hummel and Rothschild Reference Hummel and Rothschild2014; Klarner Reference Klarner2008; Klarner Reference Klarner2012; Klarner and Buchanan Reference Klarner and Buchanan2006). However, to the best of our knowledge, there are no published models providing dynamic forecasts of individual senate races in political science.
The relative lack of attention to the Senate is probably the result of the intense popular interest in the presidential race. However, it also reflects the fact that predicting individual-level Senate elections is actually a more difficult task than it may first appear. To begin, there is far less polling data for any given Senate election relative to national races, especially early in the cycle. Some states with close races and large media markets may have dozens of polls, but in many others there are very few. In addition, senate races are relatively low salience to voters, especially early in the cycle. Even strong challengers can be unfamiliar to voters until the final weeks. As a consequence, public opinion can be far more dynamic as voters learn about their options in the lead-up to Election Day. In short, senate elections offer fewer polls and what polls exist can be noisy predictors.
A further problem is that local context and other “fundamentals” are often only weakly predictive of candidate performance. Although the general partisan dispositions in each state tends to heavily structure presidential outcomes, results in Senate elections are far less geographically determined. That is, knowing how a party candidate performed in one election is often a poor predictor of performance in subsequent years. A recent examples would be West Virginia where Democrat Joe Manchin won over 60% of the vote in 2012 and Republican Shelly Capito won over 62% just 2 years later. These kinds of dramatic partisan swings occur regularly, making “fundamentals” forecasts difficult. Klarner notes that his fundamentals-based model “has never performed well, being off by three seats in 2006 and five seats in 2008. U.S. Senate elections appear to be influenced by race-specific factors that are difficult to include in forecasting models” (Klarner Reference Klarner2013, p. 45). Meanwhile, Hummel and Rothschild (Reference Hummel and Rothschild2014) predicted only 83% of races correctly in-sample and performed similarly out-of-sample.
In combination, this means that for any single cycle it is difficult to provide accurate forecasts in the absence of polling. Yet, polling data itself is relatively sparse and subject to significant trends over the course of the election. And, of course, relying on unvarnished polling data can be inaccurate even where it is not just missing, making simple polling averages suboptimal.
On the other hand, there is one very important advantage to working in this setting relative to national elections; there are many more observations. While presidential elections offer only one observation every 4 years, the Senate has roughly 33 election outcomes attached to hundreds of polls every 2 years. In our dataset, which covers only the post-1992 period, we have 501 election results and over 7,900 published polls. This give us some hope that we can train a model that can learn from the past to predict future outcomes and, crucially, correctly calibrate our uncertainty.
Below, we present a hierarchical Dirichlet regression model in a Bayesian framework that enables us to combine polls and fundamentals to accurately forecast election outcomes at various time horizons. This model provides a structured balance between time-dependent opinion polls and state/candidate-level fundamentals. Unlike fundamentals-based models, ours updates throughout the election cycle to reflect recent polling trends. Yet unlike existing dynamic models, ours is trained on a set of historical election outcomes rather a single election cycle. The result is a model that provides uncertainty estimates that arise naturally from the induced posterior based on historical data and therefore provide a better sense of our true uncertainty. Experiments show that our model can achieve high levels of accuracy and correct coverage for various forecasting horizons.
The most important contribution we make in this article is proposing an accurate, dynamic model appropriate for subnational elections at the district level—something the discipline currently lacks.Footnote 2 However, we also advance the broader elections forecasting literature in two ways. First, the hierarchical structure we propose combines the unique strengths of poll-based dynamic models and fundamentals-based static models in a single framework. Existing dynamic models are fit to polling data from a single election cycle (Jackman Reference Jackman2005; Linzer Reference Linzer2013). To the extent historical data are used at all, they enter only through informative priors or hyperparameter selection. This makes it more difficult to understand whether the final predictive intervals accurately represents our uncertainty about the outcomes since the likelihood itself is actually fit to polls and not to election results. Our approach differs in that the dynamic component feeds into a higher-level model trained on historical election results. Since this higher-level model also includes fundamental factors (e.g., the partisan orientation of the state), final predictions are better calibrated to reflect our actual uncertainty about unobserved election outcomes and can weight polling and fundamentals to reflect their actual predictive performance at different time horizons.
Second, we introduce a Gaussian process (GP) framework for modeling trends in latent public opinion that is more appropriate for elections with fewer polls—a common feature outside of U.S. presidential races. Our GP approach offers a significant advantage in that we can model polling trends as a linear process where nonlinear deviations are allowed given sufficient data. This added structure offers a significant improvement in out-of-sample prediction relative to a random walk (Linzer Reference Linzer2013) while also relaxing strict linearity assumptions when needed. It has the further advantage that it allows us to derive the posterior for these time trends analytically, significantly reducing computation time for any one election. This in turn allows us to fit the full hierarchical model including hundreds of elections.
In the next section, we provide a basic intuition for our modeling framework before providing a more detailed presentation in Section 3. We then test the model using historical data in Section 4 and evaluate a true out-of-sample forecast for the 2020 election cycle in Section 5. We show that our approach achieves state-of-the art accuracy and coverage despite relying on few covariates. We conclude with a discussion of how our model could be improved in future iterations or adjusted for other election settings.
2 Intuition and Related Work
Before introducing the model, we want to focus on the core ideas that inform our approach. First, we suppose that polling for a candidate is a noisy measure of true underlying public opinion, $f(t)$ , at any given time t. That is, we assume that there is a true level of underlying support for each candidate that moves smoothly over time and that polling results imperfectly follow these trends.
While modeling smooth latent public opinion is consistent with previous efforts to aggregate polls (Jackman Reference Jackman2005; Linzer Reference Linzer2013; Stoetzer et al. Reference Stoetzer, Neunhoeffer, Gschwend, Munzert and Sternberg2019), we adopt a strategy that is more appropriate given the sparseness of polling in many senate elections. Our approach assumes a linear trend in the data with mild nonlinear deviations. This provides a sensible compromise between a simple linear model of public opinion and the trend-free smoothing procedures adopted in Jackman (Reference Jackman2005) and Linzer (Reference Linzer2013) (see also Stoetzer et al. Reference Stoetzer, Neunhoeffer, Gschwend, Munzert and Sternberg2019; Walther Reference Walther2015). Indeed, these other approaches can be viewed as special cases of our more general model where no linear trend is included.
Third, our modeling strategy assumes that latent public opinion is only one predictor of election outcomes. That is, latent public opinion is not assumed to translate directly into election outcomes as in Linzer (Reference Linzer2013). Instead, the model learns the degree to which public opinion accurately predicts elections relative to other “fundamental” factors including state-level voting history, candidate quality, and the like. This approach has two advantages. To begin, it allows us to easily train our model at different time horizons such that public opinion is weighted more heavily as the election approaches and polling becomes more predictive. More fundamentally, however, it allows us to explicitly model the inherent uncertainty in election outcomes that cannot be adequately predicted from polling and and contextual factors. That is, we assume that even if we knew public support for a candidate perfectly, there would still be uncertainty in the outcome due to turnout and other unmodeled factors. Our aim is to use historical data to calibrate our uncertainty and achieve correct coverage rates at various time horizons in a way that reflects this irreducible uncertainty.
Finally, the model is tuned to accurately predict elections not polls. Thus, while polling outcomes are included in the model, the key model parameters are not selected to reduce the error in predicting polls but in predicting vote share. We select hyperparameters intentionally that under-predict individual polling results but provide a better basis for predicting candidate vote share. The result is a parsimonious, but accurate and well calibrated model of elections.Footnote 3 The model takes in only four variables: polling data, Cook’s partisan voting index (Campbell Reference Campbell2018; MacWilliams Reference MacWilliams2015), party affiliation of candidates, and candidate quality (Jacobson Reference Jacobson1989; Jacobson and Carson Reference Jacobson and Carson2019). However, it still makes accurate predictions for races at various time horizons while maintaining correct coverage. Indeed, in the 2020 Election our model outperformed the model published in The Economist (Economist 2020) and provided comparable (and by some metrics superior) performance to the popular fivethirtyeight.com forecasts (Fivethirtyeight 2020).
In the next section, we introduce the model in stages. Section 3.1 provides important background information on Gaussian process regression, an approach that has appeared rarely in political science research. Section 3.2 then applies this framework to the task of projecting latent public support for each candidate. Section 3.3 then explains how this is combined with contextual factors in our Dirichlet regression model of vote share. We then briefly contrast our approach with other forecasting models in the literature in Section 3.4 before turning to our results in Section 4.
3 A Predictive Model of U.S. Senate Elections
Our proposed model has two components as depicted in Figure 1. First, we use candidate-level polling data to predict latent public support for candidate i on Election Day ( $t=0$ ), which we denote as $f_i(0)$ .Footnote 4 If we are predicting this before the election ( $t<0$ ), this quantity is predicted based on all polling data up to the current date as well as an informative prior reflecting the general electoral context. Note that the goal is not to create a point prediction, but to estimate a distribution on $f_i(0)$ that reflects our uncertainty about the trajectory of public opinion over the course of the election as well the inherent uncertainty in polling data itself. We refer to this as our candidate-level model.
Second, we then use predicted public support as inputs for an election-level model Footnote 5 with the goal of predicting the proportion of the vote divided among all candidates in a given race (that is, the entire vote share and not only the winner). We model this with a Dirichlet regression with year-level random effects using a training dataset of elections starting in 1992. Importantly, this Dirichlet regression takes in $f_i(0)$ as an input along with contextual factors. Thus, we are able to use historical data to estimate the the degree to which electoral context, public opinion, or some mix of the two are best able to predict vote shares at different time horizons.
The final output is a prediction for Senate elections that accounts for two levels of uncertainty. We have uncertainty about where latent public opinion will be on Election Day given the polling data we have observed so far. But we also have uncertainty as to how well public opinion and contextual factors predict election outcomes based on historical data.
3.1 Background on Gaussian Process Regression
Our model for latent public opinion over time is a linear trend with smooth nonlinear deviations. Here, we subsume both components into a single GP model of latent opinion. GPs offer a flexible Bayesian framework for nonlinear regression widely adopted in machine learning (Rasmussen and Williams Reference Rasmussen and Williams2006). GP models have not been used widely in political science, although they have appeared under the label Bayesian kriging (Gill Reference Gill2020; Monogan and Gill Reference Monogan and Gill2016). However, mathematically they can be considered a Bayesian variant of kernal regularized least squares (KRLS) (Hainmueller and Hazlett Reference Hummel and Rothschild2014; Mohanty and Shaffer Reference Mohanty and Shaffer2019).
To define a GP, consider a function $f\colon \mathcal {X} \to \mathbb {R}$ on some arbitrary domain $\mathcal {X}$ ; for our model of latent opinion, $\mathcal {X} = (-\infty , 0]$ is the span of times at which we may wish to predict. The defining property of a Gaussian process is that if $\mathbf {X} \subset \mathcal {X}$ is a finite vector of input locations, then the associated function values $f(\mathbf {X})$ has a multivariate normal distribution. The moments of this distribution are provided by a mean function $\ \mu (x) = \mathbb {E}[f(x) \mid x]$ and covariance function $K(x, x') = \text {cov}[f(x), f(x') \mid x, x']$ ; evaluating these pointwise provides the mean vector and covariance matrix for any desired vector of function values $f(\mathbf {X})$ . Modeling with the GP entails designing the mean and covariance functions to encoding the desired statistical properties of f such as correlations over the domain.
A critical property of GPs is that they enable exact, closed-form inference for regression for observations corrupted by additive Gaussian noise. Let $f \sim \mathcal {GP}(\mu , K)$ have a GP prior and suppose we obtain a vector of observed values $\mathbf {y}$ at locations $\mathbf {X}$ , where $y_i = f(x_i) + \varepsilon $ , $\varepsilon \sim \mathcal {N}(0,\sigma ^2)$ . Then the posterior belief of f given $D = (\mathbf {X}, \mathbf {y})$ is again a GP with updated mean and covariance function:
Hence, appealing to the definition above, the posterior predictive distribution of any function value $f(x^*)$ is normal:
This final point is important for our application. When modeling the latent opinion with a Gaussian process, our prediction for latent public opinion on Election Day, $f_i(0)$ , is a normal distribution that can be directly derived. This in turn becomes a normal prior for public opinion that is passed directly into the election-level model. This allows us to include our uncertainty about where public opinion will be on election day into the election-level model while at the same time significantly reducing computation time relative to Linzer (Reference Linzer2013).
3.2 Projecting Public Support via GP Regression
We next outline our approach for forecasting voter preferences throughout an election given polling results. Our approach entails building independent GP models for each race conditioned on available polling outcomes.Footnote 6 The model includes only polls, hyperparameters, and priors (discussed below).
Denote by $\mathcal {C}$ the set of all candidates in all races we wish to reason about. We will consider the unknown proportion of voters preferring candidate $i \in \mathcal {C}$ in a given race a function of time, writing $f_{i}\colon (-\infty , 0] \to [0,1]$ . Here, the domain of the function is time (measured in days), where the election is defined to occur at time $t = 0\,\text {days}$ . Let $\mathcal {T}_i$ be the set of times when opinion polls for candidate i were conducted.Footnote 7
We model the trend of voter preferences $f_{i}$ as a sum of an underlying linear trend $a_i + b_i t$ , with smooth nonlinear deviations from this trend, $\eta _i(t)$ . We place independent Gaussian priors on the intercept $a_i$ (i.e., the prior mean of the voter preferences on Election Day) and slope $b_i$ of the linear trend, and will place an independent, zero-mean GP prior on the nonlinear component $\eta _i$ . The covariance function K determines the correlation of deviations from the linear trend as a function of time and was taken to be identical across all races. Here, we used the Matérn covariance function with $\nu =\tfrac {3}{2}$ , which models isotropic, once-differentiable functions (Rasmussen and Williams Reference Rasmussen and Williams2006). This covariance function has two hyperparameters that we will estimate from training data: a length scale $\rho $ determining the scale of correlations, and an output scale $\lambda $ determining the pointwise variance of the process. Intuitively, we can think of $\rho $ as determining the “window” of days over which nonlinear deviations are estimated and $\lambda $ as controlling the degree of nonlinearity we expect such that higher values lead to more dramatic deviations.
The model can be summarized as:
where the covariance function for the nonlinear deviations is
The priors on the linear parameters are constructed to be broad for the slope (so that over a time period of roughly 100 days the linear trend could plausibly assume any possible value) and vaguely informative for the intercept; we will discuss the intercept mean parameter $\bar {a}_i$ shortly.
The above prior choices induce the following joint prior over the voter preference, as shown in (5). Notice that our model provides an automatic marginalization over the linear slope parameters, since the covariance function in our GP model has absorbed the hyperparameters controlling the prior distribution of the linear function parameters.
Our goal is to infer the latent voter preference trend from opinion poll outcomes, which are by their nature noisy. Our approach is to model the observed poll outcomes as binomially distributed, then approximate each binomial with a Gaussian for mathematical convenience. This will allow closed-form exact inference, yielding a posterior GP belief about underlying voter trends conditioned on available data. As discussed in Section 3.4, this step is an important innovation, allowing us to exactly solve for the posterior predictive distribution of $f_i(t)$ .
For a candidate $i\in \mathcal {C}$ with $S_i$ conducted opinion polls, let $\mathcal {D}_i=\{t_{is}, n_{is}, x_{is}\},(s=1,\dots ,S_i)$ denote the outcomes of all available polls involving that candidate. Here, $t_{is}$ is the time of the poll, $n_{is}$ is the sample size of the poll, and $x_{is}$ is the number of polled people expressing support for the candidate. Dropping subscripts momentarily, consider one such polling outcome $(t, x, n)\in \mathcal {D}$ . We make the natural assumption that the number of supporters x is binomially distributed given the sample size n and the true (unknown) voter support f at time t:
Unfortunately, it is not possible to condition a GP exactly on observations with a binomial likelihood. However, sample sizes for election polls tend to be large enough (often in the hundreds) that we can safely make a Gaussian approximation to the likelihood by moment matching. Here, we also explicitly consider an additional general noise term $\sigma ^2$ , which designates another level of noise stemming from the polling data. Let $p = x / n$ to be the observed proportion of support in a poll, so (8) could be approximated with
where we have substituted the estimated $\hat {p}$ for the true unknown proportion $f(t)$ in the variance (in our case, ${\hat {p}=p\ }$ after observation). This likelihood is now conjugate to our GP prior on f and allows exact inference.
Let us define the vector $\mathbf {p}$ to entail a set of polling outcomes observed at times $\mathbf {t}$ , $p_{s}=x_{s}/n_{s}$ , and further define $\mathbf {B}$ to be a $S \times S$ diagonal matrix with $B_{ss} = p_{s}(1 - p_{s})/n_{s}+\sigma ^2$ . This is the approximate noise variance for each of these measurements appearing in (9). Using the results in Section 3.1, the posterior predictive distribution of the voter preference at any time t is:
where $\boldsymbol \mu $ and $\mathbf {V}$ are the prior mean and covariance of $f(\mathbf {t})$ , respectively. Although we may make forecasts for any time t, we are especially interested in public opinion on Election Day, $f(t=0)$ . This will also be normal following Equation (10). For notational convenience below, we will again use subscripts for candidates and write $f_i(0) \sim \mathcal {N}\big (\mu _{f_i}, K_{f_i} \big )$ .
The candidate-level model is completed by choosing values for the intercepts $\{\bar {a}_i\}$ and the set of shared hyperparameters $\boldsymbol \omega = (\rho , \lambda , \sigma ^2)$ . Here, $\sigma ^2$ represents the level of unmodeled noise remaining in the polling data, $\lambda $ controls the degree to which the time trend deviates from linearity, and $\rho $ represents the “bandwidth” of the smoothing window for these nonlinear deviations.
We chose informative, but wide hyperpriors for $\{{a}_i\}$ so that projections could be made in races with few or zero polls but that polling data would quickly swamp the prior when plentiful. Since the standard deviation for the hyperprior is set at $0.1$ , any vote share within $\pm $ 30 points of the prior should be well supported.Footnote 8 To set $\{\bar {a}_i\}$ , we ran a simple regression in the training set with normalized vote share as the dependent variable and party, lagged partisan vote index (PVI), and level or prior experience as covariates.Footnote 9 While not an accurate model by itself, it proved to be an adequate prior.Footnote 10
For $\boldsymbol \omega $ , we adopt a leave-one-year-out (loyo) cross-validation approach using the training period from 1992 to 2016. The motivation is to choose hyperparameters that maximize predictive performance for election results even at the expense of choosing parameters that reduce fit for the polling data.
First, we define the search region of output scale and shared noise both to be $[0,0.05]$ . We search length scale with a minimum of 7 days and a maximum of 56 days.Footnote 11 Empirically, we generate potential $\boldsymbol \omega $ ’s for the validation procedure from a low-discrepancy Sobol sequence (Sobol Reference Sobol1979) in the search region, since it covers the space more efficiently than a grid. We fit the complete model, including the election-level model, for each of 100 values of $\boldsymbol {\omega }$ at each time horizon ( $\tau $ ) leaving out each year in turn.
For example, for choosing the hyperparameters for the model predicting 4 weeks in advance of the election, we used all of the polling data up to day $t=-28$ . We then fit the GP models and trained the election-level model described below leaving out each cycle in turn. We then generate out-of-sample predictions for vote shares and choose the hyperparameter setting that maximized the loyo log-likelihood averaged across all cycles.
The chosen hyperparameters for each time horizon are shown in Table 1, and examples of the resulting candidate-level models for one candidate (John McCain in 2016) are shown at various time horizons in Figure 2. (Supplementary Appendix E shows another example for Democrat Katie McGinty [PA-2016].) This approach yields models that begin as linear far from Election Day but become increasingly nonlinear as $\tau $ approaches zero. Note also that the uncertainty in $f_i(0)$ narrows considerably in run up to Election Day.
3.3 Election-Level Model
The goal of the candidate-level model (Section 3.2) is to project forward at any time horizon a predictive distribution of latent public support on Election Day, $f_i(0)$ . The election-level model in this section takes $f_i(0)$ as an input and combines it with additional contextual factors to generate a predictive distribution. Our method is based on Dirichlet regression, that allows prediction of the election vote shares for multiple candidates.Footnote 12
In our setting, the vast majority of races involve only two credible candidates.Footnote 13 Indeed, in the 1992–2018 period, we included more than 2 candidates in only 11 elections.Footnote 14 However, we retain the Dirichlet presentation here as being more general and races with third parties can be critical in any given cycle.
Relying on the Dirichlet likelihood contrasts with some work in political science for multi-party elections, which builds on the logistic normal distribution (or t-distribution) applied to log-ratios of the votes (Katz and King Reference Katz and King1999) or seemingly unrelated regression (Tomz, Tucker, and Wittenberg Reference Tomz, Tucker and Wittenberg2002). The primary criticism of Dirichlet regression is that it assumes that ratios of outcomes are independent (Aitchison Reference Aitchison1982; Katz and King Reference Katz and King1999; Philips, Rutherford, and Whitten Reference Philips, Rutherford and Whitten2016), which is unrealistic in more standard settings such as multi-party elections. However, while the outcome in U.S. Senate races is always compositional, the meaning of the categories do not correspond across races as these alternative models require. That is, the “third choices” are typically idiosyncratic to each race. So, for instance, the third-party candidate in the 2018 New Mexico race was Libertarian Gary Johnson while in the 2008 Minnesota race it was Independence Party candidate Dean Barkely. In other cases, even the major party candidate labels can be confused. So, for instance, in the 2012 Maine election Cynthia Dill was the official Democratic nominee while Independent Angus King garnered a significant amount of support from Democrats and caucused with the party once he joined the senate. Indeed, in some instances the category meanings are unstable even when there are only two choices. For example, the 2016 California race featured two Democrats. We therefore retain the Dirichlet regression approach despite the independence assumption since modeling the dependence between the choice categories is impossible when the choice set itself changes from observation to observation.Footnote 15
We model the parameters in the Dirichlet distribution as a linear function of voter preferences and “fundamentals.” This is similar in nature to other generalized linear models where a linear combination of terms is passed to a link function. In this case, each candidate is represented by a “concentration parameter,” $\alpha _i$ , which we model as a linear combination of both $f_i(0)$ and other covariates. The unique feature of the Dirichlet regression is that each race is characterized by a vector of concentration parameters, $\mathbf {\alpha }$ , where we have one $\alpha _i$ for each candidate. When the concentration parameter for candidate i is relatively large, she is expected to earn a higher proportion of the vote. Furthermore, the predictive density is more concentrated around this expected value when the individual components of $\mathbf {\alpha }_j$ are large.
More formally, consider arbitrary race j with $m_j$ candidates and a specific candidate i. We assume a simple linear model that maps the voter preference $f_{i}(0)$ to the underlying concentration parameters $\alpha _{i}$ . Although there are many possible covariates we could include, we found that very few actually improved out-of-sample predictive performance.Footnote 16 We therefore include only the lagged PVI generated by Cooks political report (Campbell Reference Campbell2018; MacWilliams Reference MacWilliams2015), and an indicator for the experience of the candidate where a one indicates the candidate has held elected office before and it is coded zero otherwise (Jacobson Reference Jacobson1989; Jacobson and Carson Reference Jacobson and Carson2019). We also include a year-level random effect to accommodate unmodeled electoral “swings” associated with specific election cycles. PVI and the year random effects are reverse coded by party.
More formally, collect $\boldsymbol {\alpha }_j=(\alpha _{1j}+\tilde {\alpha },\dots ,\alpha _{m_j j}+\tilde {\alpha })$ from all candidates in the race ( $\tilde {\alpha } \ge 0$ ). The base parameter $\tilde {\alpha }$ here is introduced for two reasons. First, it can reduce variance of samples and thus stabilize the MCMC sampling. Second, $\tilde {\alpha }$ encodes the prior belief on how equally the vote shares should be distributed without any additional information. We assume that the actual vote share vector $\mathbf {y}_j=(y_{1j},\dots ,y_{m_j j})^\top $ is distributed with a Dirichlet distribution $\mathbf {y}_j\sim \text {Dir}(\mathbf {y}_j; \boldsymbol {\alpha }_j)$ , where $\alpha _{ij}$ is a linear function of $f_i(0)$ and contextual predictors. We also need to integrate over the distribution of $f_i(0)$ ; in our case, the distribution of $f_i(0)$ is a truncated Gaussian. In total, we assume the election outcomes follow the following data generating process:
Here, we allow party to be equal to 1 for Democratic candidates, $-1$ for Republicans, and 0 for independents.Footnote 17 This allows for PVI and the year random effects to have opposite effects by party.
The model is completed by placing proper but vague priors across all parameters. The priors for the $\theta $ parameters are set to be wide based on the scale of the relevant variable. Specifically, we set independent truncated Gaussian priors on the $\theta $ coefficients and the year-level random effects.
We can combine all of these parameters together in $\boldsymbol \Theta = \big (\{\theta \}, \{\gamma _{\text {year}}\}, \sigma _{\text {year}}^2, \tilde {\alpha }\big )$ and let $\mathbf {z}_j$ be the vector of contextual factors for election j. We obtain the posterior $p(\boldsymbol \Theta \mid \{\mathbf {y}_j \},\{\mathbf z_j\}, \{f_i(0)\})$ using MCMC estimation. Specifically, we use no-U-turn sampling in Stan (Carpenter et al. Reference Carpenter, Gelman, Hoffman, Lee, Goodrich, Betancourt, Brubaker, Guo, Li and Riddell2017; Hoffman and Gelman Reference Hoffman and Gelman2014).
With this posterior, the final predictive distribution of future election outcomes with new $\{\mathbf {f}_i^*(0)\}, \{\mathbf {z}_j^*\}$ will be defined by (13)–(16) marginalized by the posteriors:
A final issue is how to handle the dynamic nature of our forecasting task. While we have the complete set of polls for elections in our training set, when making real-time forecasts we have only the polls up to the current date. Training the model on the complete set of polls (all the way up to Election Day) is likely to lead to higher weight being assigned to polling data and poor predictive performance at remote time horizons. For instance, the coefficients for the Dirichlet regression component in the election model may put too much confidence on the polling. As noted above, this same issue applies to hyperparameter selection for the candidate-level model.
To address this concern, we train the complete model at various time horizons denoted by $\tau $ . For any threshold, we discard all data where $|t|<\tau $ . Thus, when $\tau =28$ , we ignore all polls in the training data closer than 28 days to the election. This again helps calibrate the model for the levels of accuracy we can expect at various horizons. Table F.1 in Supplementary Appendix F shows the summaries for the posteriors of the model parameters at horizons ranging from $\tau =0$ to $\tau =56$ (8 weeks before the election). As expected, the $\theta $ parameter associated with $f_i(0)$ increases as Election Day approaches while the fundamental parameter become relatively less important.
Figure 3 shows the posterior prediction for Senator John McCain in 2016 for various time horizons. Note that the outcome (marked with the vertical blue line) is near the center of the posterior for all horizons, but that the prediction becomes more concentrated as Election Day nears. This reflects both more certainty in $f_i(0)$ and changing weights in the Dirichlet regression.
3.4 Discussion
Before turning to our results, we briefly contrast our approach with existing forecasting models in the literature. Most importantly, we combine a dynamic, poll-based model with an election-level model trained on historical data to make predictions about individual senate races. Some existing poll-based models are dynamic (e.g., Jackman Reference Jackman2005; Linzer Reference Linzer2013) while others create district-level forecasts based on historical election results (e.g., Klarner Reference Klarner2012). However, to the best of our knowledge this is the first published model to explicitly combine these two approaches.Footnote 18
Second, we introduce a GP framework for modeling trends in public opinion. Although related to the random walk model in Linzer (Reference Linzer2013), it differs in two crucial ways. To begin with, the GP model allows us to model polling trends as a linear process with nonlinear deviations, which (as we show below) offers significant improvements in predictive performance when polling data is sparse. Further, by adopting the Gaussian approximation to the binomial likelihood in Equation (9), we can exactly derive the posteriors for each candidate. This computational efficiency allows us to build the election-level model and facilitates our loyo cross-validation approach.
Finally, it is also worth considering the computational resources required by the model. Assuming that the hyperparameters have been selected, running the complete model is quite fast. A standard run with 5,000 MCMC iterations takes roughly 5 min on an Intel i7-CPU machine (running three chains in parallel). The GP component is very fast because results can be computed exactly without sampling, usually completing in under one minute. This contrasts with, for instance, a stan implementations of the Linzer (Reference Linzer2013) model, which takes approximately 30 min for a given election cycle. Thus, during any one election, the computational load is very reasonable.
The computational bottleneck with our approach is in the loyo cross-validation procedure for choosing hyperparameters. As described above, we ran the loyo validation for the 1992–2016 period with 100 hyperparameter settings at seven forecasting horizons. With three MCMC chains for each model, this results in 27,300 posteriors. Thus, even the 5-min run time is cumulatively computationally intensive requiring the use of a computing cluster. This exercise only needs to be done once in advance of any specific election cycle, but is nonetheless time consuming. We return to this point in our concluding discussion.
4 Empirical Evaluation
In this section, we investigate our model using historical polling data and vote shares in U.S. Senate elections from 1992 to 2018. Throughout our model building process, we held out the 2018 election as a test case and it was not involved in any hyperparameter tuning, variable selection, or other decisions. Therefore, we can assess the model’s predictive performance using the 1992–2016 period, but also approximate its true out-of-sample performance using the 2018 data. In the next section, we report predictions for 2020 actually made in advance of Election Day.
4.1 Data and Evaluation Criteria
We obtained opinion polls and election outcomes of all senate election races from 1992 to 2018 from www.fivethirtyeight.com and from CNN. On average 16 polls were conducted for each race, although some races such as the 2016 Florida election had over 80. Most of the surveys are conducted 2 weeks to 4 months prior to election, with a median number of respondents of 635. Over 470 entities have conducted these polls, but several active pollsters collectively contribute over half of them, including Rasmussen Reports, Mason-Dixon Polling, Public Policy Polling, SurveyUSA, YouGov, SurveyMonkey, Quinnipiac, and Zogby Interactive. To guarantee credibility, we eliminated polls sponsored by parties or candidates since unfavorable polls from these sources are not released.
We also acquired the partisan voting indices in every election cycle for each election state. The Cook PVI is a measurement of how strongly a U.S. congressional district or state leans toward the Democratic or Republican Party, compared to the nation as a whole. For example, PVI for California in 2018 is 10.76, indicating a strong preference for Democratic candidates, while PVI for the pro-republican state Texas in 2018 is $-7.02$ . For each candidate, we coded partisan affiliation and past experience (whether or not they held office). Where not provided in the CNN data, we coded these manually using ballotpedia.com.
To evaluate performance, we examine both the forecasting precision and the validity of our model. Hence, we consider the following measures: the averaged root-mean-squared-error (RMSE) between the expectation of the Dirichlet posterior samples and actual vote shares, the prediction accuracy of winners for election defined by higher winning probability (we calculate the winning probability of each candidate as the proportion of samples with the highest vote shares in Dirichlet posteriors), the coverage rate of actual vote shares in 95% credible intervals for Dirichlet posteriors, the averaged multinomial predictive likelihood and the averaged log-scaled Dirichlet predictive likelihood (LL). RMSE and prediction accuracy focus on the precision of the forecasting ability, while coverage rate focuses on the validity of the claimed credible intervals. The two likelihood measures serve as out-of-sample evaluation criteria for both vote share (Dirichlet) and final outcome (multinomial) that also reflect the uncertainty in the full posterior.
4.2 Baselines
We compare the performance of our combined GP and Dirichlet regression (GP+DR) model against three benchmarks. First, we compare our model to the dynamic Bayesian model in Linzer (Reference Linzer2013). This model was developed for predicting state-level results in presidential elections, but we adjusted it for predicting Senate races. Intuitively, this model is a dynamic Bayesian random walk (BRW) model similar to the nonlinear component in the model described above except that latent public opinion is assumed to be a random walk. We use the same informative prior for $\{{a}\}$ as used above and use the tuning parameters and basic estimation procedures as described in Linzer (Reference Linzer2013).Footnote 19
Second, we consider a baseline Dirichlet Regression model that uses a Bayesian linear regression model to forecast voter preferences. We refer the second baseline as LM+DR. To ensure a fair comparison, we choose the same priors for the linear coefficients as those in GP priors. We also chose the $\sigma ^2$ hyperparameter using the same cross-validation approach described above. This model is, in essence, the same as what we describe above without allowing for deviations from linearity. Finally, we examine the performance of the GP model in isolation excluding the Dirichlet regression portion of the hierarchy. Note that while we frame these as competitors to our favored model, both of these baselines are also novel.
4.3 Results
First, we present results from a loyo cross validation exercise where each election cycle from 1992 to 2016 was held out. This has the advantage that we can use the complete set of election outcomes to validate the model. However, since we followed an identical procedure when choosing our hyperparameters above, there may still be some risk of overfitting.Footnote 20 We therefore also present results for the 2018 election separately which serves as a stronger out-of-sample test.
We simulate a real forecasting scenario and examine the model’s forecasting ability at various horizon $\tau $ ’s. Specifically, we consider horizons of 4 months, 3 months, 6 weeks, 4 weeks, 2 weeks, 1 week and Election Day, where $\tau =56,42,28,21,14,7,0$ . As noted above, Table 1 summarizes hyperparameters learnt for the candidate-level model used throughout this exercise.
Table 2 shows the results for the loyo cross validation exercise for the 1992–2016 period. The results show that GP+DR model on average outperforms the other baselines across metrics. The closest competitor is actually the LM+DR model, which performs quite well in terms of coverage and accuracy. This is explained in part by the fact that the GP model itself is mostly linear at distant horizons and when there is little polling data. However, the nonlinear component in the GP does provide measurable improvements over the linear version in the final lead up to the election when the hyperparameters most enable nonlinear deviations (see Figure 2). In Supplementary Appendix G, we use a paired t-test to show that this improvement in accuracy is statistically significant when $\tau \le 21$ .
Cells reports fit statistics at various simulated time horizons using a leave-one-year-out cross validation. RMSE is root mean squared error for the point predictions, while the 95% coverage is the percent of vote shares that fall within the predicted 95% credible intervals. Predictive accuracy measures the percent of races predicted correctly across cycles. Average predicted log-likelihoods (APLL) are predicted using the Dirichlet likelihood (for vote share predictions) and the multinomial likelihood (for winner predictions).
We then predict the 2018 cycle, which was not used in our model development or cross validation, and find a nearly identical pattern. (Full results for 2018 are shown in Supplementary Appendix H.) The RMSE for the Election Day forecast was 0.053, 0.055, 0.060, and 0.075 for the GP+DR, LM+DR, BRW, and GP models respectively. Meanwhile the predictive accuracy was 0.951, 0.932, 0.898, and 0.936.Footnote 21
Figure 4 shows the predictions, 95% predictive credible intervals, and outcomes for the 2018 senate elections with $\tau =7$ . The results show that all election outcomes fell within the 95% credible range and that on average the forecast tracked the actual election outcomes very closely. Moreover, the elections where the model is incorrect at a 7-day range are also among the closest contests in that cycle (Arizona and Nevada). Finally, the width of the credible interval can vary significantly depending on the number and recency of polls for that election. For instance, the credible intervals for Wyoming are very large reflecting the fact that we had only one poll. This contrasts with, for instance, Missouri where dozens of polls were reported.
5 Predicting the 2020 Cycle
Finally, we turn to the task of predicting the 2020 senate elections. For this cycle, we again acquired all data from the fivethirtyeight.com website. Following the procedures outlined above, we exclude all partisan polls and and date each poll based on the first day it was fielded. We did not include any third-party candidates,Footnote 22 and we exclude the Georgia special election and the Louisiana senate race do to the potential for a runoff after November.Footnote 23 We used the same hyperparameters as shown in Table 1, but refit the Dirichlet regression using the complete 1992–2018 training period.
The final predictive densities for the Democratic candidates are shown in Figure 5 (we show only one party since we modeled only two candidates in each state). The model predicted that the Democrats were favored to win in four Republican-held seats (CO, ME, AZ, and NC) and to lose Alabama. However, the election outcomes were predicted to be very close in many states including MS, AK, MT, SC, GA, IA, NC, IA, AZ, ME, and CO (states here are ordered by the degree to which they favor the Democratic candidates).Footnote 24
In all, the forecast was accurate, missing only two election outcomes. One miss was North Carolina, which our model predicted as being a narrow Democratic victory and turned out to be a narrow Republican victory. The only serious miss was Maine, where pre-election polling was dramatically off.Footnote 25 Maine was also the only case where the result fell outside of our 95% predictive CI, giving us 96.9% coverage.
We can compare this performance to the Economist and fivethirtyeight.com models, although it is important to note that their methods are not public. These results are show in Table 3. Our model outperformed the Economist model on all metrics. In addition to NC and ME, The Economist also missed Iowa and (the plurality winner in) Georgia. The 95% out-of-sample coverage rate was 90.6% as, in addition to Maine, their model also missed New Jersey and West Virginia.
It is not as easy to directly compare performance to the fivethirtyeight.com forecasts as they predict non-normalized voter share (not two-party vote share), provide only 80% predictive intervals, and actually produce three predictions. Thus, for instance, the RMSE metric is not on the same scale as our model which predicts the normalized vote share (excluding write-ins, third-party votes, etc.). However, the results in Table 3 indicate that our model performed at comparable levels of accuracy and coverage as their forecasts, although ours is perhaps slightly conservative in having a 87.5% coverage rate for the 80% CIs. Notably, our model made the same winner prediction for all of the 2020 elections as the “Delux” model, while their other variants missed the plurality winner in Georgia. We also have lower RMSE than all three variations. In all, we consider this to be evidence that our model is at least as accurate as fivethirtyeight.com while having the advantage of being a public and transparent methodology that can be studied and improved upon by other forecasters.
6 Conclusion
In this article, we offer a novel approach to dynamic election prediction that combines both poll-based and fundamentals-based forecasting. Although the model itself is somewhat complex, in the end it includes only a few variables: polling, PVI, experience, and party. The novelty here is not in what factors go into the model, but how they are combined to create accurate, well-calibrated predictions.
Our approach contains two basic stages. The first step is to treat polling data as a probabilistic representation of latent public support for a candidate, where this latent support has a linear and nonlinear trend. By fitting a model to this trend, we can accurately predict forward to where public opinion will be on Election Day. Second, we then incorporate predictions about this latent position into a Dirichlet regression that uses historical data and a few simple features about the election to estimate the degree to which polling can be used to predict elections based on historical data. A final innovation is that we train the data completely at different time horizons to ensure that our final predictions reflect an appropriate level of uncertainty.
While we believe that this model improves upon other Senate forecasting models in the literature, it could be refined in several ways. First, we might better extend it to handle unusual cases like runoff elections or special elections (e.g., the 2020 Georgia special election) or the potential for instant runoffs in states adopting ranked-choice voting. We could, in theory, also extend the model to account for “house effects” of various polling firms or weight more accurate firms more highly in the candidate-level model. Likewise, we could try alternative variables to include in the construction of the candidate-level prior or in the election-level model.Footnote 26 However, adding such complications should be done with caution as they may lead to overfitting. Many variables (e.g., money raised or incumbency status) should be reflected in the polling data. Once we have conditioned on latent public support, the list of accurate predictors of outcomes is much smaller. Finally, retrospectively it is relatively easy to identify which third-party candidates should be included in a predictions model since they appear regularly in the polling data and receive a considerable vote share. However, future work might improve upon our efforts by more clearly defining a rule for when to include minor candidates based on ex ante conditions.
A further shortcoming is that our model does not allow online-updating of hyperparameters: forecasters have to learn from scratch customized hyperparameters for every new horizon. In Table 1, the learnt length scale and noise standard deviations are somewhat constant across horizons, while the learnt output scales shrink at earlier horizon. When computation capability is limited, practitioners may use the same optimal hyperparameters across horizons and warp the output scale according to the forecasting horizon.
A third extension would be to adjust the model to handle elections at different levels. This model would be relatively straightforward to extend to, for example, gubernatorial races. However, more significant adjustments may be needed for lower (e.g., races for the U.S. House of Representatives) or higher (presidential) elections. Lower-level races are unusual in that there is even less polling data available for most races, which may require heavier reliance on contextual factors or cycle factors such as generic ballots. Meanwhile, presidential races usually offer many more polls, but the election-level training data is necessarily very sparse at the national level and the state-level outcomes (state-level results) are much more correlated. Researchers wishing to extend this basic approach to those settings should think carefully about how to construct the election-level and candidate-level models to account for these important differences. It will also be important to consider how well our approach to, for instance, cross validation will work given smaller sample sizes.
Finally, it is important to remember that while we have taken steps to gauge the accuracy of the model, there is no way feasible way to assess its true long-term out-of-sample performance until we observe more election outcomes. We created a held-out prediction for 2018 and a true prediction for 2020, but there is always the risk that idiosyncratic features of these election cycles are driving the results. It will be important to re-evaluate the model’s performance in future cycles.
Acknowledgments
We are grateful to Harry Enten at CNN for providing data and for David Carlson for help and collaboration in an earlier attempt at modeling elections. We also appreciated the help of the Political Analysis editorial staff as well as the comments from our reviewers.
Funding
YC and RG were supported by the National Science Foundation (NSF) under award number IIS–1845434.
Data Availability Statement
Replication code for this article has been published in Code Ocean, a computational reproducibility platform that enables users to run the code, and can be viewed interactively at https://doi.org/10.24433/CO.4154884.v1 (Chen, Garnett, and Montgomery Reference Chen, Garnett and Montgomery2021a). A preservation copy of the same code and data can also be accessed via Harvard Dataverse at https://doi.org/10.7910/DVN/GNHESM (Chen, Garnett, and Montgomery Reference Chen, Garnett and Montgomery2021b).
Supplementary Material
For supplementary material accompanying this paper, please visit https://doi.org/10.1017/pan.2021.42.