Overview
Now that the 2020 election is officially over and Biden was elected as the President of the United States, it is important that I reflect on my prediction model. I am excited to see how I could learn from my model for future models that I create.
Model Recap and Predictions
Let’s first recap on my prediction model to get a better picture of what it was.
My prediction model was an ensemble model that predicted Trump’s popular vote share for each state. .
Given that the Time For Change Model was an inspiration, I decided to focus my model on historical republican vote share as Trump was the incumbent for the 2020 election and incumbency was used as a predictor in the Time For Change Model.
I decided to separate America into three categories - red states, blue states, and battleground states - for my model to adjust for overfitting. The grouping were based on how FiveThirtyEight grouped states.
- Red States - AK, IN, KS, MO, AL, AR, ID, KY, LA, MS, ND, OK, SD, MT, TN, WV, WY, SC, UT, NE
- Blue States - CO, VA, CA, CT, DE, HI, IL, MD, MA, NJ, NY, OR, RI, VT, WA, ME, NM, NH
- Battleground states - FL, IA, OH, GA, NC, MI, MN, PA, WI, NV, AZ, TX
My model used the following data:
- Historical polling, approval, and turnout data was based on data given from class
- Historical and present economic data was sourced from the Bureau of Economic Analysis
- Present polling and approval data was based on FiveThirtyEight’s forecast data
In my model, I decided to classify approval, Q2 GDP growth, and change in turnout as fundamentals.
Thus, my ensemble model weighted the poll model (using only polls) by 0.96 and the fundamental model (using only fundamentals) by 0.04 as I weighted the model based on FiveThirtyEight’s reasoning that polls are better predictors as the election nears since fundamentals become more noisy instead.
- Trump vote share = 0.96(Poll) + 0.04(Fundamental)
My final prediction using the ensemble model was that Biden was projected to win 310 electoral votes while Trump was projected to win 228 electoral votes, predicting Biden would become president-elect of the United States.
Patterns and Accuracy
Overall, I am pretty satisfied with how my model turned out. While I did miss a few states and this is my first election forecast, I was quite happy that I predicted some battleground states correctly.

Above is a comparison between my predictions and the actual results of the 2020 election. As you can see, the states that I got wrong were battleground states. However, I would like to say that the predictive intervals for the battleground states did capture the true result.
- The states that I predicted incorrectly were Arizona, Nevada, Florida, Wisconsin, Georgia, and Ohio.
- This means that my Classification Accuracy is 88%.
- My prediction overpredicted 4 electoral college votes for Biden.
Moreover, let’s take a look into the plot above, which plots the actual two-party vote share for Trump against my predictions for Trump. The blue points represent states Biden won and the red points represent states Trump won.
- The correlation between actual and predicted two-party vote share for each state is 0.88, which is fairly strong.
- The root mean squared error of all my state predictions is approximately 5.02 percentage points. While this isn’t too bad, some states were very off from my predictions.
- It is interesting to note that there is a fair share of both overpredicting and underpredicting Trump’s vote share in each state, evidenced by the plot above.
Furthermore, the map above shows the difference between Trump’s actual and predicted two party vote share in each state. A negative difference means that Trump was overpredicted for that particular state while a positive difference means that Trump was underpredicted for that particular state.
- It is interesting to note that states where Trump was greatly overpredicted or greatly underpredicted are states that are not battleground states. This makes sense because states that are traditionally red or blue and not battleground typically have less polling as there is a small chance that those states will flip. This is why we may see a state like Alaska with little polling where Trump is greatly overpredicted there.
- The above histogram shows the error distribution for my prediction model and the errors appear to be normally distributed around 0.
Hypotheses for why my model was inaccurate
Now that we have went over my prediction model, it is important to look at possible hypotheses for the inaccuracies seen in my model. My model seemed to incorrectly predict the results for battleground states in particular and it is important we pay attention to the reasons why. Below are my hypotheses for explaining the overall inaccuracies of my model:
- One hypothesis to explain the inaccuracy of my model was that it failed to take into account the recent voting trends in particular states. For example, Georgia and Texas have been trending blue recently but my model failed to take note of this. This could be because my model relied more heavily on historical popular vote share and polling and so since Georgia and Texas were traditionally red states, my model would predict the same for 2020.
- Moreover, my model failed to consider the recent voting trends in populous counties. For example, Miami-Dade County became much more red in 2020 than in 2016 and heavily helped Trump win Florida again. Likewise, Fulton County in Georgia heavily favored Joe Biden in 2020, which played a significant role in turning Georgia blue. Thus, it is important that prediction models take into consideration trends not just in states but in counties as well since some counties alone can significantly impact the overall result for the state it is in.
- While my model took into consideration the expected increase in the overall turnout rate for the 2020 election, my model failed to take into consideration the change in turnout rates for different demographics.
- For example, Stacy Abrams played a crucial role in black voter-turnout in Georgia in favor for Biden. The same goes for the large Latinx turnout in Arizona and Nevada, which also helped Biden. However, there were also many Latinos that voted for Trump particularly in South Texas and Florida.
- Given the large turnout rates for some of these groups, they can thus play a significant role in forecasting the election.
- Another hypothesis is that my model relied heavily on inaccurate polls. Some polls in 2020 were fairly inaccurate because they were non-representative of voters and there was non-response bias, particularly from conservatives.
- On average, polls were off by 2.5 points in battleground states and blue states.
- Given the inaccuracy of polls, this may explain why my model had inaccuracies, especially since I weighted the poll model by 0.96 in my ensemble model.
- Another hypothesis is that using the state Q2 GDP growth rate as a fundamental variable may have hurt Trump more than it was supposed to, especially in battleground states and traditionally red states. This is because economic predictors were very noisy this year due to a recession caused by Trump’s handling of the Covid pandemic.
- Since the 2020 economy was an anomaly, it probably would be best to not use economic predictors in my model.
- I would also mention that my model used 2020 Q2 GDP growth rate, which may not be reflective of the current economy as the election was taken place during Q3, and GDP growth rates are drastically different in Q3 from Q2 in 2020.
Proposed tests to test hypotheses
- To test the first hypothesis that states and counties have partisan shifts (which can impact the accuracy of my model), I can look at recent voting trends in such states and counties.
- These states are likely battleground states and the counties are likely in battleground states. Moreover, we can look at how states and counties voted in the 2016 presidential election, the 2018 midterm election, and the 2020 election.
- We can thus analyze any trends using regressions and correlations and if we see any trends where certain states and counties are shifting towards blue or red, that is something to take note of.
- One example of a trend that we may see is how southern Texas counties have been voting towards more red overtime in comparison to the 2008 election.
- To test the second hypothesis, we can run a linear regression between the popular vote share for a presidential candidate (say the incumbent) and the change in voter turnout for different demographics.
- Through this regression, we may have a better idea of not only how changes in turnout rates from different demographics may impact election forecasting but also how they may affect democratic or republican popular vote share. * Furthermore, it may also make sense to run the regression on a per county basis since it was evidenced from 2020 that certain counties see greater turnout rates from particular demographics than other counties.
- To test the third hypothesis, one test that can be used is create a predictive linear model for the popular vote share for a candidate only using recent polls.
- Given that historical polls may not be as predictive for today’s elections, it may make sense to only use recent polls like from 2016 onward. This might be because there was never really a president that had the character of Trump and so there may be non-response bias among republicans as some republicans may be afraid to alert pollsters that they will vote for Trump.
- Moreover, it would also make sense to use polls that are high in quality, which can be measured using FiveThirtyEight poll grades. My prediction model did not filter out for high quality polls, so the quality of polls may impact the results from election forecasting as they may be more representative of society.
- I would also mention that polls need to do a better job in reaching out to hard to reach demographics like Hispanic Americans, and so I would be interested to use more polls that target these demographics for predictive models.
- To test the fourth hypothesis, we can use my prediction model but not include economic predictors as part of the fundamental.
- As mentioned before, the 2020 economy was an anomaly, so it is best to not use economic predictors.
- Moreover, it would be interesting to use economic predictors to predict the 2024 election and other future elections given that the economic predictors during those elections are not all over the place. If we do use economic predictors to predict those future elections, it makes sense to leave out economic variables from 2020 then.
Changes to my model
Now that we have a better grasp of understanding my model and where it went wrong, the following are changes I would like to do to my model:
I would use recent polling(with high grades from FiveThirtyEight) instead of historical polling in my model. This is because many states are recently having partisan shifts in vote share, and so historical polling may not reflect that. Additionally, the 2020 Election map did not differ greatly from 2016, so using recent polling may be more accurate for forecasting. Plus, polls with high grades may mean they are higher in quality and thus should be more representative of society.
Instead of accounting for overall change in turnout rate, it may make sense to focus on expected change in turnout rate for particular demographics, such as Hispanics and African Americans. This is because many battleground states were partially determined by the turnout from these demographic groups, evidenced by the Cuban vote in Miami-Dade County and the African American vote in Fulton County.
I would not use Q2 GDP growth rate or any economic predictor for this model because the 2020 economy is an anomaly and so historical economic predictors may not be good for predicting the 2020 election.
I would also like to potentially make a prediction model that generates county-level predictions instead of state-level predictions. This is because it appears that the results of many states are determined by specific counties and so it would be noteworthy to make a prediction model that generates county-level predictions to see if it is more accurate in forecasting.
Conclusion
I really enjoyed making my prediction model and learning from it. I now have a better grasp in making prediction models and I am eager to see how future elections will differ from the 2020 election and use my skills in the future. I want to thank my teaching fellow for Gov 1347, Sun Young Park, as well as Professor Ryan D. Enos and Soubhik Barari, for their teaching and help throughout the semester.