Condorcet’s Jury Theorem

Repurposing an Obscure 18th Century Political Science Theorem for Machine Learning

Jun 14, 2024

We’ve finally arrived at the end of the NFL Prediction Series — thanks for hanging in there! Today we’re going to combine the three models developed over the last few posts into a single voting model. In the process we’ll learn about Condorcet’s Jury Theorem^[1][2] and how it applies to machine learning. Let’s dive in!

So What is it?

Condorcet’s Jury Theorem is a mathematical theorem exploring idea of a group of individuals, i.e. a jury, arriving at the correct decision by majority vote. In this “courtroom”, only binary results are allowed: either the defendant is innocent or guilty. You also need an odd number of jurors so that there is no possibility of a hung jury (no ties). In the end, the theorem posits that if certain criteria are met, the jury can become infallible … sounds intriguing, doesn’t it?!

To put the requirements and results in black & white:

Condorcet’s Jury Theorem requires the following three criteria be met
- Unconditional Independence: the jurors arrive at their vote independently
- Unconditional Competence: each juror is greater than 50% accurate
- Uniformity: all jurors have the same accuracy
If you meet all three of these assumptions, your jury has Growing Reliability and Crowd Infallibility
- Growing Reliability: the jury’s accuracy increases as the number of jurors increases
- Crowd Infallibility: the jury’s accuracy increases to 100% as the number of jurors goes to infinity

Why is it Cool?

I really like this framework because, although it was birthed in the 18th century as a political science idea, it can be easily repurposed for machine learning by making the following translations:

Juror —> Prediction Model
Jury —> Ensemble Model
Vote —> Prediction

And in fact, the ensemble models that you may be familiar with, like a Random Forest, leverage this concept to great effect.

The power of this concept is that if you have many independent models that are at least 50% accurate, then the accuracy of your ensemble classifier will increase as you add more models. So no matter the difficultly of your prediction task, in theory, you can improve by finding more and more independent models!

The Model

As I stated earlier, in this post we’ll be combine the Elo Model, the Implied Vegas Model, and the Decision Rating Model into a single ensemble voting classifier. But before we do so, we need to see how well these models fit the requirements of Condorcet’s Jury Theorem. Because if they do, our ensemble classifier should be to pick games more accurately than any of the individual models alone.

Unconditional Competence

This requirement is easy for us to prove because we’ve already done it! As a refresher, below is each of the three models’ prediction accuracy. I’ve tacked on the Vegas Model for comparison, and the results against the spread (ATS) for fun!

All of them are above 50%, even against the spread, so me meet this requirement!

Uniformity

By quick inspection of the table, we can see that the Elo and Implied Vegas Models are extremely close. The Decision Rating Model lags them by ~2%, which I think is probably close enough. Or at least that’s my hope!

Unconditional Independence

This is, by far, the trickiest requirement of the theorem. How do we know if a model is truly independent? Is it enough for the models to have a different architecture? Are different data sources also required? There is frustratingly little information available to answer these questions so I’ve had to forge my own path and here’s where I’ve arrived on the question.

Different Data Sources: It’s certainly desirable to have different data sources for your models. The ultimate goal of modeling is to convert raw data into information / insights, so 1) the more varied your information and 2) the more efficiently you can wring the information from your raw data, the better.
- But you need to make sure that your data sources are actually different — the information extracted from the raw data needs to provide different insights. A trivial example is creating two stocking picking models, one based on the S&P 500 index and another based on the Nikkei 225 index. You may think these are independent because they’re different datasets, from different time zones, composed of different companies, but the fact remains that they are correlated due to the interconnectedness of the global economy. This is a silly example but you might be able to imagine more pernicious versions of this thought exercise.
- A useful example for football is using game scores for one model and EPA for another — EPA and game scores may seem different but they are extremely correlated so you’re not buying much information independence with just these data sources!
Different Model Architectures: As referenced in the previous point, we want to wring as much information / insights out of the raw data as possible and the model architecture is largely how we do that. A tree-based model is going to analyze the data differently than a neural network. And a linear model will analyze differently than both of them. So in theory, completely different model architectures should provide independence.

Bringing it back to our Ensemble Classifier, I think the independence of our three models is modest but not overwhelming.

They all use similar inputs. The Implied Vegas Model the most different by only using Vegas closing lines. The Elo and Decision Rating Model essentially are using game outcomes as there input data.
Where I think they derive the most independence is in their architecture. The IVM is a simple linear-system-of-equations solution, the Decision Rating Model looks for differences between process and results for each team, and the Elo Model values each team based on its results and level of competition.

Ensemble Classifier Setup

In the end, we’ve got to test the model to see if it’s any good. And in order to test it, we must first set it up. This will be our simplest model to date: for the NFL games from 2003 to 2023 each model will provide a vote and the side with the most votes wins!

The Results

The jury has finished deliberating and their results are shown below! You can see that the Ensemble Classifier has improved accuracy over the individual models, but this improvement is very small. I think this is a combination of (1) the DRM being <64% accurate (Uniformity violated) and (2) the models lacking true independence (Unconditional Independence violated).

An interesting way to look at how each of the models is voting is to create a Venn diagram of the overlap of their predictions, shown below for the win / lose binary option we are considering. You can see that the voting is heavily skewed in the Elo & Implied Vegas Models’ direction as evidenced by the large number of solo “Visiting Team Wins” predictions by the Decision Rating Model. From this diagram, we can viscerally see the impacts of the Uniformity assumption’s violation — DRM is picking by itself frequently because it’s not as accurate — and the violation of the Unconditional Independence — the Elo and Implied Vegas Models “think” very much alike — despite different data sources and architecture!

One last interesting thing we can look at is the case of a unanimous verdict: the accuracy shoots up to >70%! …Unfortunately the Vegas Model does even better than that on this sample because, logically, these are the easy games to pick. Typically, these are the games where one team is a big favorite (think Harlem Globetrotter vs Washington Generals).

The Final Dive

In this post, we used the three models developed in this series to build an ensemble classifier by leveraging Condorcet’s Jury Theorem. Although the Ensemble Classifier only had a marginal improvement in accuracy, by learning more about the assumptions that underpin ensemble models we were able to pinpoint why the improvement is so marginal. For example, if we can find a more accurate, yet still independent, model to replace the Decision Rating Model, that change should significantly improve the Ensemble Classifier’s accuracy.

Though, as it stands now, the nominal increase in accuracy of our Ensemble Classifier likely isn’t worth the expense on maintaining three different models. If it were me, I would stick with the either the Elo or Implied Vegas Model alone because either model provides 99.5% of the performance of the Ensemble Classifier.

Thanks again for reading along! I hope you had fun and learned throughout this series, even if you don’t like football! Until next time!

Thank you for reading The Reef Data Lab. This post is public so feel free to share it.

The Reef Data Lab