[This is the third (and final?) installment of a three part series of posts investigating the possibility of using the YourMorals data to make inferences about the general population. In this installment, we finally make the leap.]
In my previous posts, I’ve discussed the many potential difficulties in using an entirely self-selected internet sample for inferences about general population parameters (whether or not a particular state or congressional district scores higher than another in terms of its moral foundations) as opposed to the intra-individual comparisons that are the bread and butter of psychologists (like how the foundations tend to correlate with ideology). I think that I have shown that the raw data are unsuitable for talking about the general population. The sample is demographically unrepresentative (see here) and somewhat attitudinally unrepresentative (see here).
What we need is a method to correct for the biases in the sample. Enter Multilevel Regression with Poststratification (or Mr. P as he is affectionately known to statisticians).*
[For those uninterested in the technical aspects of modeling, skip down to the maps below]
Multilevel Regression with Poststratification proceeds in (basically) three steps. First ,we construct a model to obtain the expected values of the variable of interest as a function of variables that we know the underlying population values for (typically this means only items that show up in the Census – geography, age, education, income, race, gender, maybe a few others).
Second, we use the model to predict the expected value of the variable of interest for each combination of variables (or cells) in the model. For example, if the model used four regions (Northeast, South, Midwest, West), three categories of age (18-30, 31-60, 61+), three categories of education (HS, College, Graduate), two categories of income (Less than $50k, More than $50k), two categories of race (white, non-white), and gender as predictors, there would be a total of 4*3*3*2*2*2 = 288 cells. From a 18 year old male with a HS degree or less who makes less than $50k is white and lives in Maine to a 75 year old female with a PhD making more than $50k is Asian and lives in New Mexico. All individuals in the sample fit into one (and only one) of the cells defined by the combinations of predictors in the model. Many of the cells will be empty. In cells where there is no data, the model borrows statistical power from the other cells to come up with an expected value for every cell.
Finally, we weight each of the estimated cell values by the proportion of individuals in the population to come up with predictions for the geographic regions of interest. In this case, we would have predictions for each region.**
The map above (click on the image for a better view) plots the predictions from the MRP estimates of the difference between the liberal and conservative foundations for each congressional district. Districts in which YourMorals users valued the two “liberal” foundations (H and F) more than the three conservative foundations (I,A,P) are shown in dark green, and those districts cluster in the regions that we know to be the most liberal parts of the country: the North East, and the West Coast (excluding the agricultural parts of California). The districts within which YourMorals users gave the most conservative pattern (IAP > HF) are shown in red, and these districts fall overwhelmingly within the South.
One way to test the validity of these measures of district level foundations is to compare them to observable characteristics of the districts. One ready comparison we can make is to the district’s share of the vote for Obama in the 2008 presidential election.
The figure below shows the simple bivariate relationship between each foundation and vote for Obama. In every case, we see the expected relationships. Districts that scored more highly on the Harm and Fairness foundations were more likely to go for Obama in the election. On the other hand, there is a strong negative relationship between a district’s score on the Purity foundation and its vote for Obama.
Interestingly in many cases (especially the harm and authority foundations), there appears to be a significant “kink” in the fitted line at the midpoint of some of the foundations. Districts that score highly on the harm foundation (or those with low scores on the authority foundation) are associated with increased showings for Obama, but that relationship dissipates after the mid point of the foundation is crossed (all foundation scores are measured in standard deviation units). This is an aspect of the data that deserves further attention, but I’ve run out of time and space to do so here.
Multiple regression estimates confirm the overall story we see here in the bivariate plots. The estimated foundation scores seem to meaningfully correlate with real world phenomena. This bodes well for the validity of the measure and method.***
So, with a little work, it appears as if we can have our cake and eat it too when it comes to the YourMorals data. Scores on the foundations (after adjusting for the biases in the sample) are significantly related to district voting behavior.
*For a more detailed explanation of the methods involved see here. Also, Andrew Gelman and Jennifer Hill’s excellent book, Data Analysis using Regression and Multilevel/Hierarchical Models.
**I made an editorial decision not to include the details of the model and etc. as it didn’t seem of general interest. I’m more than happy to talk about it, but the post was getting wordy as it was.
*** The careful reader might well protest that the relationship we see in the figures presented is merely the product of the correlations with demography picked up from the MRP method. Since I used district demographics to adjust the scores obtained from the convenience sample, is it possible that the positive findings are simply a reflection of secondary correlations between the moral foundation scores and demographics? One easy way to test for this is to include demographic controls in the model. I was happy to see that none of the findings are substantially changed by including district demographics on the right hand side of the regression.