Summary and Critique By Claude Bingham
Because Bayesian models are considered to be strong simulations of scenarios, a group of researchers decided to test whether combining multiple models into an ensemble changed the forecasting accuracy. The researchers pointed out that political scientists rarely use simulations to predict future events, preferring to construct models that attempt to validate theories based on past events. This experiment attempted to show that future events could be tested in a similar way.
To do so, they developed a variation of Bayesian statistics called, Ensemble Bayesian Model Averaging (EMBA), that pools information from multiple forecasts to create an ensemble prediction, similar to a weighted component average. To get the weighted average, a validation period was used to ascertain relative accuracy for each component model. The aim is not to show which model is the most valid, but to show that more model variations gets a more accurate result.
The researchers tested this method in multiple scenarios. The first was a prediction of violent insurgency. Using data, for 29 countries in the 12 calendar months of 2010, three models were constructed. There was a machine learning period from January 1999, to December 2007. Then, the validation period was set from January 2008 to December 2009. The tested period was 2010. One model proved so inaccurate that it was weighted at 0.00. The other two models were rated at .85 and .15. When comparing the results of 2010, the EMBA model reduced error in the prediction by .43 (43%). The EMBA model also showed higher percentage of correct observations and lower average squared deviation of the predicted probability from the true event. This last one means the results of any model calculation for any observation was closer to the actual observation that rival models.
In the example of presidential election forecasts, it was shown that having too many models with high correlations would actually harm the accuracy and validity of the ensemble. While The EMBA was closer on average and had less deviation from actually observed results, it was never the most accurate model. However, it was also never the farthest away.
Finally, two models were used to test accuracy of Supreme Court decision predictions. One was subject matter experts, and the other was an statistical algorithm model based on case factors. When combined, the EMBA outperformed how both individual methods performed separately.
I appreciated the use of multiple case scenarios with vastly different parameters and outcome types was greatly appreciated. Additionally, the way the researchers pointed out flaws in their method based on the scenario greatly helps plan for statistical defects should someone use this model. I would have liked to see a scenario with one-off characteristics; it would also have been helpful. I do understand that one-off events are hard to organize into standardized variables, but as an intelligence professional, many events that truly matter manifest as one-offs.
The original research can be viewed here: https://pages.wustl.edu/montgomery/ebma