Monday, November 7, 2016

The US Presidential election is tomorrow. Here's the final run of my Monte Carlo Presidential Election Simulator (code here):

Election 2016 Monte Carlo Simulation
Run date: Monday Nov 7 2016 14:59:07
There are 1 days until the election. 
Collecting survey data for the great states of NE, ND, MS, ID, SD, IN, MA, KY, LA, AR, RI, MT, OR, GA, AL, OH, MI, IA, HI, DC, PA, WI, MD, CT, FL, NM, WV, UT, KS, CO, TX, NJ, DE, OK, MO, SC, NV, AZ, NY, VT, ME, IL, TN, WY, VA, WA, AK, MN, NH, CA, NC. 
Swing States:
Probability of Clinton winning OH: 10.44%
Probability of Clinton winning IA: 0.02%
Probability of Clinton winning PA: 99.72%
Probability of Clinton winning WI: 97.13%
Probability of Clinton winning FL: 88.54%
Probability of Clinton winning CO: 99.84%
Probability of Clinton winning NV: 69.71%
Probability of Clinton winning VA: 100.00%
Probability of Clinton winning NH: 100.00%
Probability of Clinton winning NC: 99.94%
12 states have no polls, so were assigned 2012 outcomes 
Clinton election probability: 99.97%
Trump election probability: 0.03%
Average electoral votes for Clinton: 323
Average electoral votes for Trump: 215

The results have been consistent throughout the election season. At no point did Trump have any probability of winning. Florida has moved back and forth; as of today, it is strongly predicted to go to Clinton.

One note: this simulation has consistently predicted greater certainty of a Clinton win (99.97%) than other electoral college simulators such as the one at FiveThirtyEight's Election Forecast (69.2%), Daily Kos Election 2016 (88%). It turns out that there's a reason for that: some of these sites put their finger on the scale!

From this article, How bad is it for Donald Trump? Let's do the math:

These histograms—and the chances of Clinton winning—are different from what each model is actually reporting as their national-level forecast because, like us, none of the other forecasters assume that state election outcomes are independent. If the polls are wrong, or if there’s a national swing in voter preferences toward Trump, then his odds should increase in many states at once: Nevada, Ohio, Florida, and so forth.

This is saying that the data and modeling are providing a clear result, but the forecasters are modifying the results because, they say, the separate state models shouldn't be treated as independent. So they modify the results to include other states' results.

That's a nonsensical thing to do for a couple of reasons. First, all it does is regress results to the mean. But of course, there's no precise way to guess how much one state's vote should influence another, so this isn't modeling; it's guessing, straying away from the data and the science to influence the result in an unprincipled way.

Worse, though, such adjustments forget that the data already contains the non-independent data. When voters answer pollsters' questions, they already know about national polls, national news, candidates' news, and as much statewide and nearby state news as they're interested to know. Of course they factor all of this available information into their answers to pollsters about their candidate choice. There's no need to try to add it in later.

The Huffington Post's 2016 President Forecast, which provides the polling data for my simulation, is closer at 98.2%. That could be because we're using the same data, but also likely means they aren't including national polls or other non-state data in their state probability estimates. Both Huffington Post and my simulation predict 323 Electoral College votes for Clinton.

Monday, June 20, 2016

Hillary Clinton will be the next president

On November 8, 2016, Hillary Clinton will be elected the 45th president of the United States. As sure as the sun will rise tomorrow, you can be certain of it. This fact isn’t something you’ll read about in the media for the simple reason that there’s no news in the sun rising tomorrow. The media reports contests and races, not the ordinary and obvious. So how can we be sure that Clinton winning is so obvious, given the daily barrage of head-to-head polling and pundit analysis? Simple: history and math.

To understand why, you just need to understand a few facts about US elections and how to interpret polls. 

1. Electoral College

First, remember that presidential elections in the US are decided by the electoral college, not by popular vote. This means that there are 51 winner-take-all contests. That is, there are 50 states plus the DC in which the winner of the most votes receives all of the electoral college votes. So, for example, whoever wins the most votes in Hawaii wins all of Hawaii’s 4 votes.

What’s interesting about the electoral college system is that it means that presidential elections are decided by how the electoral college votes combine to create a winner. There are only a finite number of ways that electoral college votes can combine to exceed 270, the winning number of votes, for one candidate.

2. Swing states

But we don’t have to consider every possible way that 50 states & DC can combine. It’s much simpler because most we already know how most of the states will vote. 

Hawaii always votes for the democratic candidate and will this year, too. Of course it’s possible that it could switch and vote republican this year. But we have no history to suggest such an outcome and no polling or news to believe it will switch this year. Similarly, California will give its 55 votes to Clinton. Texas will give its 38 votes to the Republican candidate, which is, at the moment, Trump. That just leaves a few states in which there are contests to watch because their outcomes are less certain. We call these “swing” or “toss-up” states.

As a result, when we study how the electoral college votes can combine to exceed 270, we actually only have to consider the ways the swing state votes can combine. That fact makes the problem much simpler to analyze. It also means that there’s much more known already about the outcome than just knowing that there are 51 separate elections on November 8.

3. State Relative Sizes

New Hampshire is considered a swing state, but only has 4 electoral college votes. Florida has 29. This means that there are many electoral college combinations in which New Hampshire or other small swing states go Republican or Democratic. But it just doesn’t matter because Florida (29), Pennsylvania (20), and Ohio (18) are so comparatively dominant.

The small relative sizes of some of the swing states again further simplifies the problem. We can’t ignore them completely because small states could combine to outweigh a large state. But it means that there is less variation and more known in advance of the election about how it will turn out.

4. Cumulative Probabilities

Now this is where it really gets interesting. The media loves to report the close polls, but here’s some news: they’re not actually close.

Again, there’s much more known about the outcomes than you would think from seeing the horse-race media reports and reading close polls. It turns out that their outcomes, even with close polling, aren’t actually very uncertain. Here’s why.

A poll captures an estimate of a random variable. That is, we think there’s some true value of how a state prefers a candidate, say 52% for Clinton, but we don’t know the exact value. The poll includes an estimate of how sure we are of the result, usually loosely called the “margin of error.” Statistically, it looks like this curve, centered on 52%:

A normal distribution centered on 52%, with probabilities above 50% shaded. [Diagram source]

The error is shown by the width of the curve. If we were certain the value was 52%, this curve would so thin that it would contain only 52%. But with our uncertainty, the true value could be 53% or 48%, or, with less likelihood, 44% or 60%. 

Given the poll, we want to estimate the probability of one candidate winning. It’s not the same as the poll result. If the polling is 52% Clinton and 48% Trump, it doesn’t mean that Clinton has a 52% probability of winning. Instead, we use the curve above. It captures are our certainty about various outcomes; we include that information to calculate the probability. For example, the most likely outcome is that the vote is 52% Clinton. With a little less likelihood, it could be 56% Clinton, so we should increase our probability to include that outcome. Similarly, it could be 48% Clinton, in which case she loses, so that should reduce the probability of her winning. There’s a tiny probability that we’re really off and Clinton wins with 60% of the vote; that possibility should increase our estimate a little.

Mathematically, the correct way to estimate the probability of a candidate winning is to calculate the area under the normal curve above the required value. For our election simulation, we’ll assume two candidates, so the winning value is 50%. How likely is the candidate to exceed 50% is answered by the adding all of the probabilities above 50%. The diagram above shades in grey all of the values above 50% for a candidate who is estimated by polling to win 52% of the vote. The probability of that candidate winning is the percentage of area of the shaded part compared to all of the area under the curve. In this case, the shaded areas occupies 69% of the total area. The candidate has a 69% probability of winning the election.

That’s amazing! A poll showing Clinton 52% and Trump 48% doesn’t mean she’s barely likely to win the election with a 52% probability. It means she’s very likely to win with about 70% probability.

Now you might argue that this increase in probability is a mathematical or modeling effect. No, the math is accurately capturing that while there’s uncertainty about the exact outcome, that uncertainty is still leaning toward the positive side. There may be different outcomes, but they’re more likely positive for our 52% candidate. Including all of them in the probability estimate means the positive outcome is much more probable. Intuitively, it’s like balancing: You don’t have to lean very far to move a lot of weight to one side.

So again, close polling isn’t close; we know more about the outcome than the polls would seem to indicate. 

5. Polling Accuracy

Fourth, while any given poll may have a margin of error of say 4%, we can combine polls and reduce the error. Polling is expensive, so polling companies only call people until they have a representative sample with the desired margin of error. But we can increase certainty for our simulations by aggregating polls. With a few thousand samples, the errors go to near zero. Aggregating across pollsters reduces their systemic bias.

Now at some point, reducing the margin of error to near zero does become a modeling over-reach because we can never be sure our sample is truly representative of everyone who will actually get off their sofas and vote. And of course, every poll lacks the final and complete information that voters will take into voting booths on November 8th. So aggregating the polling data helps make the best predictions today, or more precisely, given the information we have at the times of the polls.

Summarizing What We Know

The media reports margins of errors in polls that make them look uncertain. But polls are easily combined to increase certainty. Close polls appear closer than they really are; even a small shift in polling indicates a large increase in the probability that one side will win. We really only have to care about polls in a few states, the swing states. Many of them are small so their polling and outcomes matter less. Most states will vote the way they did last election. There are only a few ways to combine the swing states to reach an electoral college winning total of 270. All of this means that far more is known about the outcome of the election well in advance.

Monte Carlo Simulation

Finally, we can combine all the available data into a Monte Carlo Simulation to calculate the actual probability of a Clinton or Trump victory.

Monte Carlo analysis is a method for calculating probabilities when no simpler, closed-form, formula is known. It allows us to calculate probabilities for complex systems such as presidential elections. It works by simulating the system under study enough times that the percentages of outcomes can be considered estimates of the true probabilities. The name Monte Carlo comes from the town with famous casino because the simulations are like repeated throwing dice to see what happens.

Specifically, the process is:
• Read polling data from the Huffington Post API for each state.
• Transform the polling data into probabilities. Assume that where there's no polling data, the state will vote as it did in 2012.
• Use the probabilities to simulate an election for a state, repeating for all 50 states and DC to create a trial election. The electoral college votes are counted to determine the winning candidate.
• Run 25,000 or more simulated elections, counting how many times each candidate wins.

The percentage of these wins is then the final probability of the candidate winning. So if Clinton wins 92% of 50,000 trial elections, we conclude she has a 92% chance of winning in November.

What Monte Carlo simulation does is allow you to find certainty from many uncertain components, such as close state polls. It could be that the overall simulation of combining uncertain components also leads to uncertain outcomes. But what's so interesting in the case of the presidential election is that the simulation shows that the outcome of the 2016 Presidential is certain. Hillary Clinton will win.

I wrote a 2016 election Monte Carlo Simulation. The code is available here. Here’s an example run:

Election 2016 Monte Carlo Simulation
Run date: Monday Jun 20 2016 11:07:02
There are 141 days until the election.

Collecting survey data for the great states of IL, WY, IN, WA, NJ, SD, TX, NV, VT, UT, HI, CO, KS, ND, MS, MA, ID, MT, AR, AK, OR, NE, LA, ME, TN, SC, DC, NM, WV, OK, CT, RI, AZ, KY, DE, AL, MN, NY, GA, MI, MD, CA, MO, NC, IA, WI, FL, VA, OH, NH, PA.

Swing States:
NV has no polls yet, so it is assigned to Clinton based on 2012 outcome.
CO has no polls yet, so it is assigned to Clinton based on 2012 outcome.
Probability of Clinton winning NC: 75.69%
Probability of Clinton winning IA: 99.86%
Probability of Clinton winning WI: 100.00%
Probability of Clinton winning FL: 97.44%
Probability of Clinton winning VA: 98.13%
Probability of Clinton winning OH: 84.77%
Probability of Clinton winning NH: 99.78%
Probability of Clinton winning PA: 100.00%
35 states have no polls, so were assigned 2012 outcomes

Clinton election probability: 100.00%
Trump election probability: 0.00%
Average electoral votes for Clinton: 346
Average electoral votes for Trump: 192

We see that there’s a lot of polling already happening in the swing states. So given the information we have today, 141 days before the election, the data is showing that the swing states aren’t actually in play. Even this early in the election, before the candidates are officially nominated, we can already figure out how the states and DC will vote, how their electoral college votes will combine to create 270 votes for a winner, and therefore who will be the next president. Further, with statistics we can determine not only the outcome, but how sure we are of it. Answer: President Clinton. Certainty: same as sun rising tomorrow. 

Now suppose we inject uncertainty into the system. We can hold the polls at some minimum margin of error, just to see how much it matters on who will win if the polls vary. Here’s a run with a 7% margin of error:

Election 2016 Monte Carlo Simulation
Run date: Monday Jun 20 2016 11:13:45
There are 141 days until the election.

Collecting survey data for the great states of CT, RI, VT, MA, SC, NV, TX, IN, AR, AZ, CO, IL, AL, HI, ND, MT, WV, SD, KY, DE, MS, NM, WY, OR, LA, WA, KS, NE, TN, DC, NJ, UT, AK, CA, MN, MO, GA, MI, MD, NY, FL, NH, PA, OH, NC, IA, VA, WI, ME, ID, OK.

Swing States:
NV has no polls yet, so it is assigned to Clinton based on 2012 outcome.
CO has no polls yet, so it is assigned to Clinton based on 2012 outcome.
Probability of Clinton winning FL: 59.45%
Probability of Clinton winning NH: 65.03%
Probability of Clinton winning PA: 75.68%
Probability of Clinton winning OH: 56.14%
Probability of Clinton winning NC: 53.89%
Probability of Clinton winning IA: 64.88%
Probability of Clinton winning VA: 62.09%
Probability of Clinton winning WI: 82.44%
35 states have no polls, so were assigned 2012 outcomes

Clinton election probability: 92.62%
Trump election probability: 7.38%
Average electoral votes for Clinton: 311
Average electoral votes for Trump: 227

Ok, now we’re only 93% sure Clinton will win. More of the larger swing states have flipped sides in more of the elections. Nevertheless, in 25,000 elections simulated with the above probabilities, Clinton won 93% of them. Still pretty certain...

Onward to November 8, 2016

Given the information we have today, we can be certain that Clinton will become the next President. As the election approaches, I’ll keep running the simulation as more polls fill in. At this point, before the candidates have been officially nominated, this simulation assumes a Clinton/Trump contest. But that may change. Maybe Trump doesn’t receive the nomination. Or maybe a 3rd party candidate appears who earns enough votes that the simulation will have to be updated to include their impact. 

Finally, the certainty demonstrated by this simulation is predicated on people actually voting. The one variable that could undermine it is voter complacency. Vote. 

Monte Carlo Simulation code:

Sunday, May 15, 2016

Here's my latest project, pirNet. It's a set of devices you put around the house that detect and displays motion. So you can tell where people are in the house or, if you mount them low, watch the pets wander from room to room. The display is abstract so it looks cool, but if you know which squares of LEDs correspond to which rooms, you can read it like a map.

pirNet in action

Put devices in different rooms. When motion is detected in one room, it's displayed in all of the rooms.

Directions and source code is here: