Americans Checking Trump's Twitter En Masse Signals A Change In His Approval Rating

As President Trump received criticism for how he responded to the death of Senator John McCain, I noticed that his approval rating took a small but noticeable dip. Coincidence or correlation? This presidency has upended the traditional understanding of how Americans respond to political events. Is it possible to identify any signal or patterns in Trump’s approval ratings, or has fake news penetrated so deeply that the American public is responding to untracked rumors?

In this post, I link google search trends for “Trump twitter” to his approval ratings. Scroll to the bottom if you want to skip the statistical details :)


Approval ratings: FiveThirtyEight compiles approval polls and calculates a daily weighted average to estimate Trump’s true approval rating. The data are available on FiveThirtyEight’s GitHub. I used the “All polls” model output. It is important to note that building a model on another model’s output is akin to making copies of a copy—it adds noise (and possibly bias). This is a key limitation of these data.

Google search trends: I used the gtrendsR package to extract Google trends data for the search terms “Trump” and “Trump twitter.” Unfortunately, daily trends are only available if querying the past three months from present; any other queries provide weekly data. This is the major limitation of the search trends data. While it is possible to interpolate between the weekly data points or aggregate the daily approval ratings, more granular data generally allows for better models. The “interest index” provided by Google is relative to all other searches and the maximum value in the set.

Exploratory Analysis

Let’s quickly take a look at these datasets. First is the FiveThirtyEight estimated approval rating for President Trump. Looks like it ranges from about 37% to 47%. I’ve included the confidence bands in this plot, but ignored them in the analysis.

Next are the Google search trends. I included some state-level data, which show remarkably similar trends despite very different political leanings. Thus I only used the Overall US “Trump Twitter” search data for the rest of the analysis. I chose the search term “Trump Twitter” over “Trump” since this is how many people (including myself) access the President’s Twitter account (rather than navigating through Twitter). A quick comparison of “Trump” vs. “Trump twitter” (not shown) suggested to me that “Trump twitter” may be more related to his approval rating.

When do people go to the President’s Twitter? A reasonable guess: when they hear he tweeted something controversial.

Lastly, here is a combined look at Approval rating and “Trump Twitter” search trends. Both have been normalized so we can look at them on a similar scale. It looks like there’s some relationship between the signals! We can see that spikes in search activity often correspond to dips or spikes in approval.


Based on the last plot, it seems reasonable to build a regression model where a unit change in search activity leads to some change in approval. Or better yet, a lagged regression model where, for example, a unit change in search activity the week before leads to some change in approval.

I fit a number of ARIMA (autoregressive integrated moving average) time series lagged regression models to the data and they were lackluster. In general, these models identified some effect of search activity, but with very small coefficients.

Below is one such model, an ARIMA(0,1,1). I trained it on a subset of the data and forecasted the most recent period. The grey line represents the true approval rating and the blue line is the prediction based on search activity.


It doesn’t look great. While there may be some use to this model, I think a regression model is likely inappropriate for this problem because:

  • There can be a positive or negative effect on approval from a bump in search activity.

  • The magnitude of the effect is highly context-dependent. For example, a ridiculous but harmless tweet may generate a lot of traffic but not affect Pres. Trump’s approval rating.

A Random Walk with Pres. Trump

Okay, maybe we can’t predict Pres. Trump’s approval rating. Maybe his approval rating is just a random walk process, where at each point in time, his approval goes up or down by a draw from some distribution (e.g., a normal distribution).

If this were the case and we looked at changes in approval, we would see a symmetric squiggle. For a normal random walk, we would see the squiggle contained in a uniform band.


We definitely do not see a normal random walk. We can see “bursts” where there are large changes in approval. If we make a histogram of the changes in approval, we can see there is a “long left tail” and a non-symmetric distribution.

To me, this suggests Trump’s approval is not a random walk.

The Changepoints Paradigm

Here is another way to think about approval ratings:

  • Let’s assume that Trump’s approval is constant plus some noise in some intervals.

  • There are external shocks/bursts to his approval that lead to some new equilibrium.

This is a changepoints approach to modeling our problem. Changepoints detection has a long history in statistics, but has recently heated up. The most basic changepoint model assumes independent normal observations with a change in mean (see Hinkley 1970). I used a non-parametric changepoint detection method based on empirical distributions (Haynes et al. 2017). So let’s change our model:

  • Assume that Trump’s approval follows some (fixed) distribution in intervals.

  • And let’s add another assumption: only large shocks above some threshold lead to a new equilibrium.

Looking at the search trends data, it looks like a search interest index above 40 may qualify as a large shock. I wrote an algorithm that attempts to identify only the beginning of a “shock,” since there may be sustained activity afterwards.

Putting this model together, we have these shocks identified by a vertical line and my best guess for the reason for the search traffic. The horizontal lines represent Trump’s mean approval between the identified changepoints. Click on the graphic to enlarge!

Looks pretty good! The “shocks” in search activity are generally aligned with changepoints in approval rating. A back-of-the-envelope estimate of RMSE using the search “shocks” to identify changepoints in approval is 13.9 days. This suggests there is room for improvement, perhaps by attempting to measure sentiment on social media. But not everyone tweets—and almost everyone googles!


  • This analysis suggests that when the Google search interest index for “Trump Twitter” exceeds 40 (see note below), we should expect to see a change in Trump’s approval rating either up or down.

  • Political scientists have long theorized the “bully pulpit” (i.e., the fact that media will report the president’s speeches, press releases, etc.) serves as an “agenda setting” tool for presidents. It is unprecedented that a president use the bully pulpit to damage their own popularity and agenda. This analysis suggests that is the case for Pres. Trump.

Note on Google Trends Search: Due to how trends are adjusted, the time period that you select (and any comparisons to other search terms) will determine what this threshold value is. It is approximately the April 8th 2018 value.

Code for this analysis can be found on my GitHub.

How Injuries Affect Fitness via VO2Max

In April 2017, I tore the meniscus in my right knee and in July 2017 I finally had surgery to repair it. It took me about a week to walk again, but due to atrophy in my right quad, it was months before I could run again. Now more than a year later, I'm curious how this injury affected my fitness. How does it compare to my peak months climbing North America's highest mountains?

VO2Max: A Better Measure of Fitness

Back in the 1990s, baseball fans and teams alike used two primary metrics to measure player performance: batting average (BA) and earned run average (ERA). By the late 1990s and early 2000s, "sabermetricians" (i.e., baseball nerds) developed better metrics to measure player performance and value: first came on-base percentage (OBP) and walks/hits per inning pitched (WHIP), then a bunch of intermediate "sabermetrics," and today we have Wins Above Replacement (WAR) and weighted on-based average (wOBA). The beginning of this transition was the subject of the book and film "Moneyball." WAR is such a good measure of player value that even old-school broadcasters on ESPN are using it.

When we think about personal aerobic fitness, we are often stuck using a basic metric akin to batting average: pace (min/miles). Pace has a lot of problems: we go slower uphill and faster downhill. Our pace varies with altitude. Sometimes we deliberately train at a faster pace or a slower pace. In fact, low-intensity endurance training is frequently recommended to be the base of an aerobic athlete's training regiment. While Strava now calculates Grade-Adjusted Pace, this does not give us a measure of fitness when we aren't trying to maximize our pace. Wouldn't it be great if we had a barometer of fitness for any activity? Enter \(VO_{2}Max\).

 \(VO_{2}Max\) is defined as the maximal rate of oxygen uptake and consumption during exercise. It is measured in mL/(kg·min). In some sense, it's a measure of the horsepower of the engine that is our aerobic system. The more oxygen we are able to use per min, the faster we are able to go. While elite athletes may hit a ceiling trying to over-optimize \(VO_{2}Max\), the rest of us will likely experience fitness gains when increasing our \(VO_{2}Max\).

Fortunately, most fitness and smart watches that track heart rate (including the Apple Watch) now provide an estimate of \(VO_{2}Max\). How good are these estimates? It depends on the equipment and manufacturer, but using a heart rate strap, the estimates may be as close as 2%. The methodology, as far as I can tell, has been developed by a single company: Firstbeat. My best guess is that estimate depends on the relationship between heart rate and pace. I would love to reverse-engineer the calculation, so stayed tuned for that! 

The Data

I have a Suunto Ambit2 fitness watch with a heart rate strap. I try to use the heart rate strap on runs and occasionally on hikes, climbs, and bike rides. Here's a plot of my activities using the Ambit2. Larger circles are more mileage.


For the purposes of this analysis, I chose to only look at the running data, since the estimates of \(VO_{2}Max\) seem to be quite different for other activities. For example, while "trekking," I frequently carry a heavy pack.

Suunto has an online platform called "MovesCount," which provides an estimate of \(VO_{2}Max\). Unfortunately, there is no export function for these estimates! To download these data, I first found a list of my MoveIDs. I then used Selenium in Python to access each "move" (necessary because MovesCount is Javascript-heavy) and then fed the resulting XML into BeautifulSoup and scraped the data. This was a massive pain, but I guess Suunto is trying to protect customer data.

VO2Max Over Time

Here is the plot of my \(VO_{2}Max\) from 2015 to 2018. I've annotated it with some climbs and the red line is the date of my surgery. Somewhat surprisingly, there is not an immediate dip—a short October run had an estimated \(VO_{2}Max\) of 48—but there is a dip after a winter of neither bike commuting nor running.


My fitness is highest right before major climbs and often crashes immediately after, perhaps because of fatigue and recovery. It seems to generally increase after those climbs. And the good news is it seems to be increasing again!

So in conclusion: it appears the the surgery itself did not impact my fitness, but a winter of minimal aerobic exercise did. Ohio winters…

Possible Issues with the Estimated VO2Max

Anecdotally, I have noticed that the estimated \(VO_{2}Max\) is higher for more intense runs. Are there other "external" factors that affect these estimates? Short answer: yes.


The above scatterplot matrix shows correlation between variables. We can see estimated \(VO_{2}Max\) is correlated with average heart rate (r = 0.47), temperature (r=-0.34), and % of the run in the "hard" heart rate zone (r=0.42). So just working harder means a higher estimate of \(VO_{2}Max\) . And higher temperature means a lower estimate of \(VO_{2}Max\). We can also see a strong relationship between % hard and average heart rate.

So the methodology appears to have some issues—ideally, we would see a consistent estimate of \(VO_{2}Max\) between a low-intensity and high-intensity workout for an equal level of fitness. And we're getting docked in hot weather.

On the flip side, the correlation between average heart rate and estimated \(VO_{2}Max\) is not that strong, which is encouraging. Additionally, the estimated \(VO_{2}Max\) over time don't appear to vary too wildly.

How should we view these estimates? I think they do capture trends of my true \(VO_{2}Max\), but there are some problems with the method that should be addressed. There is room for improvement and hopefully we will see it as fitness and smart watches gain additional sensors. Perhaps I'll give it a shot!

A Fully Bayesian Approach to the Job Search

The job search is a long and random process. So long that my statistical skills may need some sharpening! To put them to use, I developed a fully Bayesian model to estimate how long it will take me to get a job!


I am estimating how many applications it will take before I get a job using a Bayesian model. With some simplification, there are two major steps to the application process: (1) getting a first-round interview, and (2) getting a job offer after the interview. There may actually be many interviews between the first-round interview and landing an offer, but I will likely not get enough data in those intervening steps to build a worthwhile model.

For this model, I am considering each application submitted to be a Bernoulli trial where a success is getting a first-round interview. I'm going to assume this is a fixed probability \(P(\text{First-round Interview})= p_1\). Then, if there is a success in this first trial, there is a second Bernoulli trial where a success is getting a job offer. Again I'm going to assume a fixed, but different parameter for this conditional probability of success, which I will call \(P(\text{Job Offer}|\text{First-round Interview}) = p_2\).

Note that I have data for both Bernoulli steps. For the first step, I have \(n_1=9\) trials and \(y_1=1\) success. For the second step, I have \(y_1=1\) trial (this is the number of first-round interviews I've had - we could also call this \(n_2\)) and \(y_2=0\) successes.

Honestly, I have no idea what my success probability is. So I chose a uniform prior for \(p_1\) on [0,1], which leads to a neat posterior. The posterior distribution of \(p_1|y_1, n_1\) is \(Beta(y_1+1,n_1-y_1+1)\) where \(n_1\) is the number of applications submitted. The same holds for \(p_2\) with a uniform prior; the posterior for \(p_2|y_1, y_2\) is \(Beta(y_2+1,y_1-y_2+1)\).

Posterior Distributions

Let's visualize these densities.


Nothing crazy here. The median of the posterior for \(p_1\) (chance of getting a first-round interview) is 0.16, with a 90% credible interval of 0.04 to 0.39. So the uniform prior is nudging that upwards from the \(\hat{p}\) we would get from \(\frac{y_1}{n_1}\).

The median of the posterior for \(p_2\) (chance of getting a job offer after first-round interview) is 0.29, which is not really saying much since our 90% credible interval is 0.03 to 0.78. We just don't have much data, which is doubly sad :( The prior in this case is giving me the benefit of the doubt, saying that I have a roughly one-quarter chance of getting a job offer even though I've received no job offers!

So how many jobs do I need to apply for?

Let's return back to the original question. We have \(P(\text{First-round Interview})= p_1\) and \(P(\text{Job Offer}|\text{First-round Interview}) = p_2\). Then \(P(\text{First-round Interview & Job Offer}) = p_1*p_2\). Let's call this joint probability \(q\). Now we can conceive of a set of Bernoulli trials with this joint probability. The expected number of trials required to achieve one success comes from the geometric distribution and is \(\frac{1}{q}\).

Okay… so how does this help me? Well, I can simply simulate draws from my known posteriors and then combine the draws to estimate the posterior distribution of \(\frac{1}{q}\), which is the number of applications required to land a job! Here is the posterior for \(q|y_1,n_1,y_2\):


The median of the posterior is 25.11 with a 90% credible interval of 5.22 to 383.9. Note that due to the memoryless property of the geometric distribution, we should interpret those as additional applications, i.e., my best estimate is 25 additional applications. But take a look at that upper bound. Yikes! So the job search may take me another week to ... another year. Better get back to the job search!

The Data Science Job Search

I moved to Colorado about a month ago and have been on the job hunt for about three weeks. It's never easy when you're making a bit of a career transition, and I find myself in the middle ground where I have relevant experience, but not in tech or big data or a role that is defined as "data scientist." So I am probably best qualified for entry-level data science jobs, which are less common than positions demanding 3-5 years of experience.

To track my progress (and to feel like I have some control over the process), I developed a Tableau dashboard. It's always more fun to have something with color. Note that the colored numbers here are Glassdoor ratings. I also constructed a very rough, naive model using a geometric distribution to estimate how many applications I'll need to submit. A Bayesian approach would be better suited, so perhaps if this drags on I'll whip out RStan. Only 17 more applications to submit!

Job Dashboard.png

The Two-Sample Location Problem: Don't Do A Preliminary Test For Equal Variances

Ah... the two-sample t-test. It's old, it's classic, and it's still used thousands of times every day.

There's a lot we can learn from the ol' two-sample t-test. Yes, it can tell us with Type I error level ⍺ whether two means from two random samples from independent populations are the same, or not the same (different?). But how we use the t-test has some lessons for how we conduct data analysis in general.

In fact, this story hints at the the impossibility of truly "objective" analysis and suggests that, despite the robo-histeria, an algorithmic approach to data analysis is unlikely to replace statisticians anytime soon. Or so I hope.

The story goes something like this:

The problem

The two-sample location problem in the typical setup (i.e., ignoring the non-parametric tests) involves two means from independent populations.

Formally, the populations should be normally distributed so that the sample means are exactly normally distributed, but in practice, we often invoke asymptotic arguments that tell us, for large sample sizes, the sample means are approximately normally distributed. 

To characterize a normal distribution, we need to know the mean and the variance (or standard deviation). Most of the time, we don't know variance (and we obviously don't know the means), so we must estimate both. When we're dealing with a single population, this is easy and leads to the one-sample t-test. When we're dealing with the difference of sample means, it's a bit more complicated.

The fork in the road

The variance question leads to a fork in the road, with one path leading to Welch's test and the other to the pooled two-sample t-test. The simple heuristic taught in intro stats classes is: 

  • If you know the variances in the two approximately normal populations are equal, use the pooled two-sample t-test.
  • If you know the variances in the two populations are not equal, use Welsch’s test

But if you don't know if the variances are equal (the "Behren-Fisher problem"), you're up a tree. What procedure should a good statistician use?

A well-intentioned but misleading solution

In the 80s or 90s, some researchers decided that a simple solution to this problem is create a conditional procedure, so-named because the choice of second test is conditional on the results of the first:

  • First, conduct a test for equality of variances, using either an F-test or Levene’s test.
  • Second, we can use the simple heuristic: use the pooled two-sample t-test if we fail to reject the null in the first test, and use Welch's test if we reject the null in the first test.

Implicit in this reasoning is a conjecture that this procedure leads to greater power, since the pooled t-test has greater power than Welch’s test when the variances of the populations are equal. Another potential advantage of this procedure is the appearance of the “unbiased statistician,” since the precedure requires no prior knowledge or intuition in selecting tests. 

This procedure has gained sufficient traction to become incorporated in statistical packages such as SPSS, noted on Wikipedia, and described on “how-to” blogs. As shown below, SPSS provides the results of Levene’s test alongside outputs from the pooled t-test and Welsch’s test. This presentation of results, while not forcing the user to select one test, certainly implies the user would be foolish to select Welch’s test if Levene’s test fails to reject the null hypothesis of equal variances, and vice versa.

 What kind of dummy would ignore the Levene's test output when selecting which t-test to interpret?

What kind of dummy would ignore the Levene's test output when selecting which t-test to interpret?

It's not just software. Wikipedia states the following on the page for Levene’s test:

Levene’s test is often used before a comparison of means. When Levene’s test shows significance, one should switch to more generalized tests that is [sic] free from homoscedasticity assumptions (sometimes even non-parametric tests).

And if you were to google "how to conduct a two-sample t-test in R"? You might get pointed to a blog post on that states:

Before proceeding with the t-test, it is necessary to evaluate the sample variances of the two groups, using a Fisher’s F-test to verify the homoskedasticity (homogeneity of variances).

Wow, pretty convincing! How could so many sources be wrong?

Tests of Homoscedasticity Break Down Too

The major problem with the conditional procedure is that the test for equality of variances (homoscedasticity) breaks down when variances are close, but not actually equal. It has trouble detecting this difference—statistically this is called "low power"—and then frequently selects the "equal variances" pooled t-test, which is the incorrect choice. The pooled t-test then produces type I errors at up to a 50% higher rate than the specified alpha level. 

What compounds this problem is unbalanced sample sizes—one sample is much larger than the other. The theory behind the pooled t-test only holds if (1) variances are equal, or (2) sample sizes are equal in a balanced experiment. If one of these conditions is exactly true, we can do without the other—but approximately true won't cut it.

If variances are slightly unequal and our sample sizes are very unbalanced, we get the worst case scenario where type I error for the conditional procedure can be 50% higher than the specified alpha, shown below. This figure compares type I error for the conditional procedure (with Levene's test as the first step) to the unconditional procedure (just Welch's test) using results from a simulation study (10,000 simulations per data point). The unconditional procedure does well across the board. The conditional procedure is only okay when sample sizes are close to balanced.

 Recall we want the Type I error to be 0.05 for a level alpha=0.05 test. Any deviations over 0.05 are bad news!

Recall we want the Type I error to be 0.05 for a level alpha=0.05 test. Any deviations over 0.05 are bad news!

Next, we'll fix the sample size ratio at 0.2 and vary the variances. Here we see that the conditional procedure is okay when the variances are very different—when Levene's test has high power—but breaks down in the middle area where the variances are slightly different. The unconditional procedure is again fine across the board.



So we have seen that the conditional procedure has some pretty serious issues! The following Q&A addresses what we can take away from this.

Question: Are there any scenarios when we should use the conditional procedure?

Answer: No.

It turns out the argument that the conditional procedure has greater power does not hold water. While it does have greater power in some circumstances, it has higher rates of Type I error in those same circumstances! You can email me for those results or consult the literature.

In the borderline scenario where the variances are close, one should definitely not use the conditional procedure. Welch's test performs well in such scenarios.

Question: Why is the conditional procedure so prevalent?

Answer: At a high level, it's because the people recommending it don't understand the statistical theory of the tests.

I also think the following factors contribute: the appearance of the "unbiased statistician," the idea that math is mechanical, an engineering approach to statistics, and the idea that we can extract a deterministic knowledge from data (not subject to randomness).

Question: What lessons should we take away from this?

Answer: I learned that relying on tests to assess assumptions can lead to unintended consequences. The conditional procedure is perhaps akin to an overfitting problem where researchers may attempt to extract more information from the data than is available.

Lastly, mechanical procedures don't necessarily lead to better outcomes. It's best to use your own knowledge of the data, the data-generating process, and the context of the data to select methods.

Which is why we need statisticians!

Using Changepoints in Heart Rate Variability to Identify Activities

In the course of a project on changepoints methods for a time series course, I found that that there is not a lot of literature documenting analysis of heart rate and fitness tracking data. Presumably there are some neat analytics in a product like Strava Premium, but these methods are proprietary. Since the data are relatively straightforward (pace, heartrate, and altitude), it would be neat to explore whether there are some easy insights in the data.

For this project, I looked at heart rate variability (HRV), which is commonly known as a metric for recovery. It seems like it's also a good indicatory of "activity types." See below for a quick demonstration of changepoint in variance methods on heart rate data collected by me on the Mailbox Peak hike in North Bend, WA (8,690 m, 1,220 m elevation gain) using a Suunto Ambit2 HR watch and heart rate belt. Although heart rate data is discrete, I found that lag-1 differenced heart rate is approximately normal for a given activity (e.g., running) with minimal autocorrelation beyond lag 0. Thus the assumption of an i.i.d. Gaussian process appeared to be reasonable for these data. 

Applying the PELT method to detect changepoints in variance with a normal likelihood cost appeared to identify different activities and phases in the hike: an initial "try hard" phase where pace was erratic, a cardiovascular phase where heart rate variability was lower as pace settled, a rest phase (around the discontinuity in the data), and the descent phase. The binary segmentation method with a cumulative sums of squares cost identified a few additional changepoints, suggesting possible overfitting. Note that we omitted some outliers corresponding to short breaks (visible in the figure) to clean up the output.


Mapping Public Transit Accessibility

A few months ago, I moved almost exactly one mile up Capitol Hill, from an area nearly adjacent to downtown to a more residential and quiet area. My old apartment was kitty-corner to a dance club and Occupy Seattle and had experienced a break-in where the only thing stolen was my old bike. It was a crazy and fun place to live for nearly two years.

My new neighborhood is beautiful and peaceful, but the move has added considerably more travel time to any public transit travel outside of the downtown core of Seattle. Andrew Hardin at the University of Colorado recently created an interactive visualization that demonstrates exactly how the move affected my transit times.

Trips to the neighborhoods and cities of Fremont (north), Magnolia (west), West Seattle (southwest), Georgetown (south), Kirkland (east), and Issaquah (east) are now 50 minute long hauls—all of these places were previously within 40 minutes. (Ballard remains a long haul from Capitol Hill.)

Seattle is investing in express buses, streetcars, and rapid transit that should increase my transit range, but I still feel public transit in the city has a long way to go. These maps don't reflect the frequent bus delays that can make transfers difficult to time—one of the major advantages of rapid transit.

The King County transit system can boast, however, of being one of few systems that allows for bus-to-hike. You can see some of these options off the 520 in Issaquah (southeast). In less than two hours, I can take a series of buses to Mt. Si and hike 3,500 feet to a snow-covered peak. Now that's a different kind of accessibility!

Interactive Storytelling

The New York Times has put together a superb archive of interactive infographics, visualizations, and photo/video journalism that they are calling 2013: The Year of Interactive Storytelling. This is, in fact, a misuse of the term—wikipedia defines interactive storytelling as "a form of digital entertainment in which users create or influence a dramatic storyline through actions." But it is nonetheless a term that seems to encompass the experimental and innovative formats the Times has begun incorporating into their reporting.

My favorite piece was How Y’all, Youse and You Guys Talk, a combination of a survey and map depicting linguistic similarity to the user (within the US). I think it's fair to say this was the most talked about visualization of the year—and among my friends, probably the most discussed visualization ever. It popped up in social media, in my email inbox, and in conversations over beers.

Addendum: It turns out the piece was the most read article of the year on, despite coming online on December 21st. Remarkable.

 My map: born and raised in Berkeley, CA, college in Minnesota and Southern California, current resident of Seattle, WA. One of my biggest work clients is located in Michigan—how many milliseconds does it take them to realize I'm an out-of-towner?

My map: born and raised in Berkeley, CA, college in Minnesota and Southern California, current resident of Seattle, WA. One of my biggest work clients is located in Michigan—how many milliseconds does it take them to realize I'm an out-of-towner?

I think this visualization succeeds because it reminds us, in a highly personal way, of the communities and cultures we come from, years after we have physically left them. My dad's map reflects the decade he spent growing up in Washington D.C., despite the 40 years he's spent in Cailfornia.

The results are memorable because they challenge some our conventional notions of place divisions. In the West, the urban/rural divisions seen in voting patterns are not discernible (Minneapolis, Chicago, and Washington D.C. are easily spotted, however). State lines seem to matter to some extent but the trends bleed across the borders.

I do wish there was additional annotation and explanation. The visualization presents you with the words most definitive of the three most and three least similar cities. But I have no idea what pronunciations or vocabulary I share with South Carolina and Maine.

In a high school linguistic class, I remember being told the US has a remarkably low number of dialects given its size, which is of course a product of the country's young history. This visualization does not refute that, but does show a surprising amount of linguistic diversity in light of a dominant national media and high rates of mobility between states and regions. 

In conclusion, it's a hella savage visualization.

Scientific Integrity

Scientific integrity ... corresponds to a kind of utter honesty--a kind of leaning over backwards. For example, if you're doing an experiment, you should report everything that you think might make it invalid--not only what you think is right about it: other causes that could possibly explain your results; and things you thought of that you've eliminated by some other experiment, and how they worked--to make sure the other fellow can tell they have been eliminated.

Important words from Richard Feynman's 1974 Caltech Commencement Address.

Cited in Arthur Lupia's great plenary talk at Evaluation 2013.