By Kabir Khanna and Anthony Salvanto
come from a model, not just from a survey result. The difference between the two is that while a survey attempts to measure the whole of something by drawing a microcosm of it -- a sample -- a model takes that information and combines it with other factors about the people in the district, and the district itself more generally, to try to come up with an even better estimate of what's going on.
We use a procedure called multilevel regression and post-stratification (MRP), which is a fancy way of describing how it combines the information described above, and blends data about both individuals' choices, and also wider district and national factors (the "levels" in "multilevel"). We worked on this model in collaboration with Professors Ben Lauderdale, Jack Blumenau, and Doug Rivers, as well as YouGov's Data Science team. We extend the models they developed for the U.K. election in 2015 and the U.S election in 2016. Here's a primer on how it works. (For a more technical description, see the many writings of Andrew Gelman.)
The first step is to talk to many people across the country. There are over 60 congressional districts where the race between Democrats and Republicans is either likely to be competitive or might become close. That's a generously large number – not all of these districts will flip – but it's within the range of the possible. We surveyed close to 6,000 voters across all these possibly competitive districts. We also collected nearly 25,000 interviews across the country, including in places outside the competitive districts, to learn about voters everywhere else. We asked them whether they are planning to vote Democratic or Republican in the House election in the district where they live.
The next step is to figure out how the way people vote depends on their measurable characteristics. These characteristics include age, gender, race, education, who they voted for in 2016, where they live, and so on. Each voter has a certain combination of these, which we'll call their "profile" for shorthand. For example, one voter profile is someone who is 60 years old, female, white, college educated, a resident of New Jersey's 7th congressional district, and a Republican voter in 2016. If you change any of these characteristics, you get a different profile. For each of the many possible profiles, we calculate how many intend to vote Democratic and how many intend to vote Republican this year.
A nice feature of MRP is that it efficiently combines information about similar types of voters no matter where they live in the country. This is particularly helpful in districts where we got fewer interviews. So if we know a lot about, let's say, white working class voters, we can assume they have some commonalities, because opinions rarely stop at district or even state boundaries, especially as our politics becomes more nationalized. If we have a good estimate of this subgroup across the country, we can improve our estimate of the subgroup for any particular district. We also add local factors, which are important too, such as past party vote in the district and whether or not there's an incumbent running this year.
The next step is estimating how many people with each voter profile live in each congressional district, again using Census data and other auxiliary data (We augment these data by imputing 2016 turnout and vote choice using Current Population Survey and YouGov data, plus the knowledge of how many people voted for each party in each congressional district and state in 2016.) .That helps us determine each party's vote share in a district. In each district, we multiply the number of people of a given profile by the proportions of voters with that profile choosing the Democrats and Republicans. When we add up these numbers across all voter profiles in the district, we get an estimate of each party's vote share there.
Each of the steps in this process has some statistical uncertainty, which we incorporate into our final estimates. In the final step, we simulate the congressional elections 1,000 times and count how many seats each party wins in each simulation. Our seats estimate comes from the average simulation, and the range of results gives us a sense of what's possible (We report a 90% confidence interval, by using the range between the 5th and 95th percentile of simulated results.).
There are a couple of important caveats about this method. First, we are estimating how the race for the House stands as of now, with the expectation that things will change in the coming months. There is still plenty of time left before the midterms, during which more primaries will occur and voters will become more familiar with the candidates running in their districts.
Second, we are estimating which voters are likely to show up at the polls in November based on recent historical patterns (this is known as a likely voter model). We estimate the proportion of voters of each profile who will turn out using a similar model based on individual-level and geographic variables. Down the road, we'll consider the implications of different turnout scenarios for which party wins the House. For example, if Democrats can replicate the turnout patterns in a presidential year or come close to it, they will improve their chances of winning the House. Republicans, on the other hand, are hoping for a turnout pattern more like a typical midterm year.