One Path to Becoming a Data Scientist

Well, I am happy to announce that the job search is finally over and I am officially a “data scientist.” It was a long and circuitous path to getting there, and there is certainly not one path to becoming a data scientist. I’d like to describe my path here for those aspiring to a career in data science.

Estimating Applications

Before that, let’s all bask in the glory that is Bayesian statistics. After nine applications, I constructed a Bayesian model to estimate how many additional job applications I would need to submit. I submitted 11 additional applications, receiving a job offer on the 8th, for a total of 20 applications.

Here is the posterior distribution of additional applications from that model, plus the actual outcome.


What did I take away from this exercise?

  1. Applying to jobs is a random process. Don’t take rejections personally—just keep applying and trying to optimize your chances.

  2. You’ll have a good sense of how many applications you need to submit after 10 to 20 applications.

  3. With a Masters degree from a reputable school (i.e., Tier 2 in US News Rankings), a reasonable lower bound for interview success rate is 10%. If you’re above 20%, then you’re in a good spot, provided you’re not botching interviews.

  4. If your interview success rate is approaching or below 10%, it’s a good time to regroup and take a long look at your resumé.

To elaborate on point 4, midway through the process I completely overhauled my resumé and most importantly, added new projects to it. And yes, this breaks the assumption of constant probability of success in the Bayesian model. But all models are wrong and some are useful, and I’d argue that model was useful.

Applying to jobs is a random process. But from my handful of jobs interviews as well as a few informational interviews, I think there are couple “core competencies” that employers are looking for. In a highly technical position, there are assuredly more competencies, but I think these are common across any positions calling themselves “data science.”

Core Competency #1: Track Record of GTD

Hiring for a technical role, a manager needs to see that you have a track record of getting things done. GTD is an entire productivity framework—I just mean a track record of managing tasks and projects. I can imagine two worst case scenarios for a hiring manager: (1) hiring someone who isn’t proficient in stats and machine learning, and (2) hiring someone who gets lost in the weeds, doesn’t deliver projects on time, and can’t convince people that the model is useful. This is addressing the latter.

This track record can take many forms. I have a background of five years in consulting, using data and research to improve client programs. You can easily talk about managing graduate school research or graduate school teaching. It’s just a matter of articulating your experience.

A major challenge of many data science teams is convincing other elements of the company to adopt their models. Make sure your experience shows you can deliver the models and communicate their value.

Core Competency #2: Applying Models to Projects

Coming straight out of grad school, I could only talk about a few class projects as evidence that I knew how to apply stats and machine learning methods to “real world” problems. Even in grad school, a class project is a two to three week affair (and of course some students do them in a weekend). A data science project can span many months, and employers want to see evidence you are prepared to tackle something of that scale. A thesis demonstrates this competency, but as a non-thesis Masters student, I could not point to a long-term research project.

The solution? Find some interesting problems, and apply a model to them. My MLB attendance model was a big undertaking—not only to wrangle the data, but also to tune the model, interpret the model outputs, and produce the visualizations.  I think that project was the single biggest factor in landing my job.

Core Competency #3: Articulate Your Data Science Strength

Data science can be described as the intersection of statistics, hacking, and subject matter knowledge. Where do you sit within that intersection? I am a strong believer in the importance of understanding statistical theory. It’s why I went back to school, why I chose my program, and what I prioritized in terms of coursework. Some of my colleagues come from an engineering and computer science background, while others come from an applied math background. While we all aspire to be strong in all disciplines of data science, it’s important to plant your flag somewhere and articulate what you believe your greatest strength to be.

Good luck to all the aspiring data scientists out there! There is no single path to the career, and mine is just one example. However, I hope it gives you some nuggets of information. The “no data” problem is no fun! 


The Outdoors Industry in One Chart

One of my favorite websites is, founded by former big wall Yosemite climber Chris McNamara (who also founded The best way to describe it is the Pitchfork of outdoor gear: a little site that has become a big site that can make or break your product. The coveted “Editors’ Choice” is the equivalent of Pitchfork’s “Best New Music” badge. I’m sure there is a bump to sales.

But in my opinion, OutdoorGearLab is a more reliable review site than Pitchfork because its reviews are far less subjective. OutdoorGearLab is a breath of fresh air in the world of gear reviews because (1) manufacturers do not pay for or provide the gear for review, (2) a team of reviewers assigns ratings (rather than an individual), and (3) each score is a sum of sub-scores, making the scores transparent.

Brand Analysis

OutdoorGearLab has more than 194 reviews for 2,198 products on its website. Each review provides a great summary of the products, including a scatterplot of score versus price. However, there is no public analysis of brands across reviews.

So here is a November 2018 snapshot of brand performance compared to brand value.

All data from

Briefly, on the Y-axis we have review scores, and the X-axis we have value (i.e., points per dollar. A lot more explanation below). Z-scores are related to a normal distribution, where higher is better and a typical range is from -3 to 3. A Z-score can be interpreted as the number of standard deviations above the mean. The size of the dots is the number of reviews.

The Four Quadrants of the Outdoor Industry

This figure gives us four quadrants in the outdoor industry:

  1. Neither Performance nor Value. Brands such as MSR and Five Ten haven’t delivered consistently above-average products or consistently high-value products.

  2. Capable Budget Brands: These brands are typically below-average performers, but they consistently provide good value. REI is the face of this group. Metolius and Mad Rock are undoubtedly the value brands of climbing. The North Face is surprisingly in this group.

  3. Premium Brands at Premium Prices: These are the “technical” brands, which consistently have received above-average ratings but do not provide above-average value. Arc’teryx is the face of this group; they frequently offer top-of-the-line products for insane prices.

  4. Value and Performance: These are the brands the gearheads should love. Above-average performance at above-average value. Outdoor Research and Black Diamond are most representative of this group.

Four Quadrants in Outdoor Sub-Industries

It’s interesting to see that some brands position themselves differently in some sub-industries. For example, REI is in value and performance in the camping and hiking sub-industry.


Similarly, The North Face is value and performance in the shoe sub-industry, whereas Arc’teryx is barely a premium brand. This makes sense since Arc’teryx is a new entrant into shoes.


Patagonia (although only having a few products reviewed) meets value and performance in the climbing sub-industry, which is consistent with my experience. You don’t have to pay a premium for their climbing packs and they perform well.


In snow sports, we have Salomon positioned as a budget brand, even though their shoes are premium and their clothes are neither performance nor value.


The quadrants for the outdoor clothing sub-industry are spot on, from my experience.


And lastly, we have bikes, which I don’t know much about. Let me know if this jives with your feel for the industry!


There is an interactive dashboard below where you can look at specific gear categories. But before setting you loose, a bit more explanation.


First, how were the data collected? I scraped in November 2018, but only downloaded data from reviews where a 1–100 score was given.

As mentioned, Z-scores are related to a normal distribution. When data follow a skewed distribution, the data must first be transformed before a Z-score is calculated.

Review scores were skewed left, so the best transformation was the square transformation. Because scores can vary widely across reviews, I calculated the z-score within a review. So the average review z-score is the average number of standard deviations above the mean squared score for a review. For example, Petzl products are on average 0.4 standard deviations better than other products in a review.

Value was calculated as points per dollar. Again value can differ a lot across reviews (e.g., bikes vs. bike helmets), so I calculated the z-scores within a review. Value was skewed right, so a log transform was appropriate. Thus average value z-score is the average number of standard deviations above the mean log value for a review. For example, Mad Rock products have on average 0.7 standard deviations better value than the average product.

Interactive Dashboard

Here is the interactive dashboard created in Tableau. Things you can do in the dashboard:

  • Select one or multiple categories using the checkboxes.

  • Set a different minimum number of reviews for a brand. I recommend this be set to at least 3. (There are many brands with only one review on OutdoorGearLab.)

  • Click on a brand to see it highlighted in other charts.

  • Mouse over the histogram bars to see individual product ratings, price, and awards.

The Immigration Act of 1924 Ruined American Cuisine

This is the first of a series of posts on the Japanese-American internment.

Okay, this post is not really about American cuisine, but it is a bit of a thought experiment on what the US would look like if not for a series of xenophobic policies enacted in the early 20th century. The policies targeted at Japanese immigration—the Gentlemen’s Agreement of 1907 and the Immigration Act of 1924 (a.k.a Asian Exclusion Act)—were in fact effective at limiting immigration and the growth of the Japanese-American population.

My ancestral homeland (prison): Topaz, UT

Much like investing in a Roth IRA in your 20s, permitting more immigration in the early 20th century would have had noticeable—and perhaps dramatic—effect on the demographics of today’s Western US. These policies nipped the “Japanese problem” in the bud and prevented an exponential growth of the Japanese-American community. Indeed, my grandfather kept editorials from the San Francisco Examiner that described the Japanese-Americans as “breeding like rabbits.” Without the Immigration Act of 1924, the editors at the Examiner would have seen a lot more rabbits in the 1930s and 40s.

Rabbits who cook delicious Japanese food. See where this is going?


The Japanese-American internment/relocation/imprisonment was a massive government effort that required detailed record keeping for each internee/prisoner. This resulted in a number of comprehensive and detailed datasets on the Japanese-American population, including immigration and birth years. The US National Archives and Records Administration thankfully has made the information available, and it has been complied and cleaned by I was able to use these data to look at Japanese immigration by year, estimate the number of adult, non-senior women, and then estimate birthrates.

To simulate immigration and population without the policies, I modeled immigration as an AR(1) (autoregressive) process with normal errors, estimated from the period before the policies were enacted. Then I took a pseudo-Bayesian approach by taking random draws from a normal distribution to estimate birthrate. Despite the stats jargon, this was a simplistic approach meant to capture the general trends. Precision was not an important part of the exercise.


Looking at the historical record (solid lines), it is immediately obvious that the Gentlemen’s Agreement and Immigration Act of 1924 had dramatic and immediate effects on immigration. The first Japanese immigrants to the US were predominantly men. The Gentlemen’s Agreement put an end to immigration for workers; however, family members of existing immigrants were still permitted to immigrate. This led to many brothers coming over to the US, as well as “picture brides” who would claim to be married to an existing immigrant. But in reality, many men would select a bride using pictures and then the marriage would be arranged by the families in Japan. My great-grandparents were one such couple (it worked out for them).

It’s hard to say what would have happened without the Gentlemen’s Agreement. If the linear trend in immigration had continued, the West Coast of the US would look a lot different.

The Immigration Act of 1924 put a hard stop to Japanese immigration. But before 1924, immigration from Japan had already begun to dip. I can only speculate that this may be because of improving economic conditions in Japan, informal immigration restrictions imposed by the US, animosity experienced by Japanese immigrants, or a realization that prosperity may not be easy to come by in America.

There are many echoes of 1924 today.

The Immigration Act of 1924 did not only target non-whites, all of whom were excluded from immigration (with the exception of black African immigrants). According to Wikipedia, the law was “primarily aimed at further decreasing immigration of Southern Europeans, especially Italians; and, to a lesser extent, immigrants from countries with Roman Catholic majorities, Eastern Europeans, Arabs, and Jews.” Countries with Roman Catholic majorities included Spain and France.

So a list of the targeted countries included:

  • Spain

  • France

  • Italy

  • Japan

  • China

  • Thailand

  • India

  • Greece

What do all these countries have in common? They represent eight of the top eight food cultures.

It seems safe to conclude that American cuisine would be in a better place without the Immigration Act of 1924. I can only imagine that amount of constipation that could have been prevented (Source: my one week eating sausage and potatoes in Germany).

Japanese-American Population

With immigration curtailed more than 90 years ago, the Japanese-American population has slowly dwindled over time largely due to intermarriage with other ethnic groups. As a fourth-generation Japanese-American, I am incredibly distanced from my ethnic roots. While we maintain some traditions and a few institutions such as ethnic churches, the Japanese-American identity is now essentially an American identity with a few twists—and hopefully, a deepened appreciation for history and civil rights.

It’s abundantly clear that the Japanese-American population would have continued growing at an exponential rate without the Immigration Act of 1924. It would have likely been twice as large as the actual pre-war population.

History may have been very different. Some possibilities:

  • A much larger Japanese-American population would have been more difficult to forcibly relocate into internment camps, affecting the nature of that decision.

  • A larger Japanese-American and Asian population in general may have encouraged further immigration from Japan and other Asian countries post-war.

  • Regardless, this would have meant more, and larger, Japanese-American cultural institutions post-war.

  • This would make it a lot less likely that I was the only Japanese-American in my graduating high school class and the only Japanese-American male in my graduating college class.

And we already know that this would mean more delicious Japanese food in markets and restaurants.


Significantly Bad: Luck Does Not Explain Oakland A's Playoff Struggles

The A’s lost the 2018 American League Wild Card game to the Yankees yesterday, October 3rd 2018, in a game that favored the Yankees roughly 60-40 according to FiveThirtyEight’s win probabilities. The outcome on its own was not surprising, but it was another devastating loss for a long-term A’s fan such as myself. The A’s have trouble winning in the playoffs, particularly in games where they could advance to the next round.

Since 2000, the A’s are 15-24 in the playoffs for a winning percentage of 0.38.

They are 1-14 (0.06 winning percentage) in games where they could advance.

Those numbers don’t tell us how many games the A’s should have won. The A’s were favored to win in many of those games. How many games should we expect the A’s to have won? How unusual is their record in the playoffs?

It’s significantly bad.

Simulating the A’s Playoffs Since 2000

As far as statistics goes, this is a simple problem. I simply downloaded FiveThirtyEight’s game-by-game win probabilities. I extracted the “rating probabilities” which factor in home field advantage, pitchers, rest, and travel. Then I simulated each A’s playoff game as a Bernoulli trial using the probabilities as the p parameter. I ran 200,000 simulations for the A’s and Yankees playoff scenarios and 100,000 simulations for all other teams.

Since the A’s just faced the Yankees, I used them as a point of comparison.

playoff wins.png

Again, the A’s have won 15 of 35 playoff games since 2000. Of the 200,000 simulations, the A’s won 15 or fewer games only 2.8% of the time. This can be interpreted as a p-value. And the general rule of thumb places statistical significance on results with a p-value of 5% or lower. Hence, the A’s have been significantly bad in the playoffs. Based on the simulation, we would expect them to win 21 games, and they’ve won only 15.

The Yankees have won 70 out of 131 playoff games since 2000. Of the 200,000 simulations, the Yankees won 70 or fewer games 78.8% of the time. So they have gotten lucky, but not unusually lucky. We would expect them to win 66 of their playoff games.

Simulating A’s and Yankees Clinching Games Since 2000

A p-value of 2.8% indicates we’re seeing something unusual with the A’s playoff record. It’s unusual at the one in forty level. But with regards to clinching games (where they could have advanced with a win), the A’s are almost impossibly bad.

Of my 200,000 simulations, in only 0.008% of simulations did the A’s win either one or zero games. That’s a p-value very close to zero. Extremely significantly bad. Based on the simulations, we would expect them to have won 8 games. Not just one.

The Yankees performance in clinching games is very reasonable. They have won 16 out of 33 clinching games since 2000. In the simulations, they won 16 or fewer games 47.1% of the time. The Yankees expected wins for clinching games is 17.


The little yellow bar that would represent as bad or worse scenarios for the A’s is invisible in the plot because it is so rare. The A’s record in clinching games cannot be explained by luck. It’s too unusual. Too extreme.

This is very strong evidence that the win probabilities I used are wrong. The A’s must have had a greater chance of losing. Why?

I really don’t know. I’d like to look into it more. Some possibilities:

  • Youth: Younger teams can’t handle the pressure as well.

  • Lack of premier pitchers: Maybe the A’s haven’t had enough high velocity, premium pitchers.

  • Sports psychology: Some sort of small market team inferiority complex that inflicts the A’s band of misfit toys.

League Playoff Performance

I wasn’t able to look at clinching games across the league, but I did look at playoff performance for MLB since 2000. The A’s are not the worst. They have done slightly better than the Braves and Twins. The vertical lines indicate statistical significance. On the other end are the Giants and Royals. There is a less than 1% chance that the Giants or Royals would win more playoff games if we were to replay their postseasons and the probabilities are correct.


This plot is unusual on its own. Remember how I said the A’s playoff record is unusual at the 1 in 40 level? We have more teams underachieving and overachieving than we would expect. The teams should be bunched up in the middle of the plot.

This leaves me thinking that there is a different dynamic at play in the playoffs. FiveThirtyEight’s win probabilities are missing something.

Wouldn’t we all like to know what it is?

A New Ballpark Can't Hide Mismanagement: Measuring Fan Engagement in New MLB Stadiums

In Part 1, I predicted Oakland A’s attendance in a hypothetical new ballpark.

In Part 1b, I look at MLB teams with actual new ballparks and predict attendance if they had hypothetically stayed in their old ballpark.

Teams build new stadiums to mobilize fans and generate excitement. Some teams use their new stadium to engage fans far more effectively than others.

In Part 1, I attempted to measure a “new stadium effect” independently from fan engagement. I essentially hard-coded fan engagement into the model. For example, from 2007-2011, the A’s were actively seeking to move away from Oakland. Fan engagement was very poor, and my predictions for a new stadium in 2008 took that into account.

This is because I had no way to predict how fan engagement would improve with a new stadium.

But there is a way to retrospectively measure this: why not look at teams with new stadiums and compare attendance to what it would be with fan engagement at a fixed, baseline level?

There is a huge advantage to the second approach: we can now measure how fan engagement changed with the new ballpark by comparing actual attendance to predicted attendance. And it quickly emerges that there are three scenarios when a team opens a new ballpark:

  1. High fan engagement: They capitalize on the new ballpark by winning, producing a long run of seasons with high attendance.

  2. Medium fan engagement: a middle of the road scenario where there is clearly a benefit from the new stadium, but not a huge one.

  3. Low fan engagement: they whiff on the opportunity by fielding a terrible team, alienating fans within one season.

Thinking of recent new stadiums, the Giants, Phillies, and Twins generated high fan engagement with AT&T Park, Citizens Bank Park, and Target Field. The Padres and Pirates generated medium fan engagement with Petco Park and PNC Park. And the Mariners, Mets, and Marlins have generated little positive to negative fan engagement after opening Safeco Field, Citi Park, and Marlins Park.


As described in Part 1, my model for predicting MLB attendance has the following parts:

  • On-field performance data: runs scored, games back from first place, Elo rating, win probabilities, etc.

  • Weather data: average wind speed, average temperature, low temperature, high temperature

  • Location data: stadium age, capacity, metropolitan area population

  • Team+year/fan engagement effect: Having a team and year variable allows for a “team+year” effect that is akin to fan engagement. This may be influenced by long-term performance of the team, the economy, and popularity of baseball in general. The fan engagement effect measures how attendance responds to on-field performance and weather—elasticity of demand. Are fans more willing to come out for a mediocre team on a cold night? Also can be thought of as fan excitement/stoke.

For this analysis, I used the same model as Part 1 with one change: I trained the model on 1990-2018 data instead of 1995-2018 data. This allowed the model to have more training data for the old stadiums.

I kept fan engagement fixed by setting the “year” variable to the year before the new stadium opened.

For the Giants, AT&T opened in 2000. So all game and weather data for the Giants from 2000 to 2018 was used but labeled as “1999.” This is the baseline period. Additionally, I set the stadium to “Candlestick Park” and set stadium age and capacity accordingly. Then I used this dataset for prediction.

Low Fan Engagement

We’ll go from lowest fan engagement to highest. The Mets come in dead last. Part of this is luck; the Mets had their highest attendance the year before Citi Field opened. People came out in droves in 2008, but the Mets collapsed down the stretch and were bad in 2009. Fans and players alike complained about the “cavernous” dimensions of the park. It was a bad start. The model thinks fan engagement is worse than 2008 and I don’t think many people would argue.


Next we have the Mariners. They have been absent from the postseason since 2001, the longest playoff drought in baseball. Safeco Field is beautiful and conveniently located in downtown Seattle, yet fans aren’t willing to show up or invest their time in the team because of the many years of disappointment. Fan engagement is now worse than it was in 1999, when there was at least the promise of a new ballpark.

This means if the Mariners had played their 2018 season in the Kingdome and fans were as engaged/excited as 1999, they would have drawn more fans.


The Marlins are the last low engagement team. Although their stadium is only four years old, attendance in 2018 is abysmal. The team was sold in 2017 and the new ownership subsequently traded many of their best players. The team is tanking, and fans aren’t showing up. The new stadium can’t stop the plummet.


Medium Fan Engagement

I consider the Padres and Pirates to have generated medium fan engagement with Petco Park and PNC Park.

The Pirates drew huge crowds at PNC Park in its first year, but the on-field product was terrible in the second year. In general, attendance at PNC Park is slightly higher than it would be at Three Rivers Stadium, using a fixed 2000 baseline for fan engagement/excitement. However, PNC Park is frequently considered to be one of the best—if not the best—ballpark in MLB. The Pirates have failed to capitalize on their gem.


The Padres are probably the middle-of-the-road case for a new ballpark. The Padres have had a few very bad years, but are usually an average baseball club. They have also been slightly above average at times. Likewise, Petco Park has helped them increase attendance and fan engagement, but not as much as the teams in the next section.


High Fan Engagement

Lastly, we have the best case scenario: new stadium plus winning and championships. That’s the Giants and the Phillies.

The Phillies were a powerhouse in the late 2000s and won the World Series in 2008. Citizens Bank Park was packed in those years. While they had a rough streak, they are once again interesting and attendance is on the upswing. Interestingly, their previous ballpark, Veterans Stadium, showed that the Phillies could draw crowds in the mid 1990s. The model thinks the Vet could have drawn similar crowds in the late 2000s as well, but not as big as the crowds that showed up in Citizens Bank park.


A’s fans, myself included, now consider the Giants to be an arch-rival. They built a classic, beautiful stadium by the water in San Francisco in 2000 and instantly drew sold-out crowds. They had the Barry Bonds show for the early 2000s and a National League Pennant in 2002. Then they won the World Series in 2010, 2012, and 2014. Now they are an attendance juggernaut even though their on-field product was very, very bad in 2018.


I hate to say it, but the transformation achieved by the Giants in AT&T Park blows all the other stadiums out of the water. They have the best fan engagement of all the teams to build a stadium in the past 20 years.

If I were the A’s, I would study all of these examples closely. There’s a mix of ballpark qualities, winning, location, and timing that leads to these different outcomes. Since the A’s are privately financing their stadium, they may particularly want to study Citi Field, Safeco Field, and Marlins Park. From an attendance and fan engagement perspective, those stadiums did not warrant the investment.

The Oakland A's Need More Than a New Ballpark To Galvanize Fans

Part 1 in a two-part series on MLB attendance. In Part 2, I will look more deeply into how weather affected attendance in 2018.

This is a summary of analysis of an XGBoost model for MLB attendance from 1995-2018, constructed with the goal of predicting attendance for a new Oakland A’s stadium. I simulated a variety of stadiums at two sites: Howard Terminal by Jack London Square and the current Oakland Coliseum site. Here are the main takeaways:

  1. New stadiums provide a steep one-year bump in attendance, but increased attendance is hard to sustain.

  2. Downtown stadiums draw more people to games and may help sustain attendance, but my model did not show a large effect. An amazing stadium at the Coliseum site with a winning team could also be a draw, especially with a ballpark village that can serve as a “downtown.”

  3. Local weather differences between sites have a negligible impact on annual attendance, with the caveat that I could not find a weather station that showed the rumored Candlestick-like weather at Howard Terminal. The difference between West Alameda and Oakland Airport weather had a minimal impact on predicted attendance.

  4. Based on 33 years of attendance and game data, the weight of evidence suggests most new stadiums do not substantially alter attendance long-term. The only obvious recipe for long-term attendance is a new stadium coinciding with on-field success.

  5. The A’s have two things working against them: (a) they don’t have a track record of drawing large crowds, and (b) there are few precedents for the long-term transformation that the A’s are seeking. Dreaming BIG definitely seems like a prudent strategy if the A’s want to break out of the “small market” bubble.


Even though they play in the cavernous Oakland Coliseum with its sewage leaks and possums, being an Oakland A’s fan is one of the most fulfilling experiences in sports. The franchise consistently outperforms its low budget (lowest in the majors in 2018) and comes out of nowhere to make the playoffs. After three consecutive last-place finishes from 2015 to 2017, they are officially back in the playoffs for at least one game.

Led by Moneyball genius Billy Beane and playing in the concrete bowl that is the consensus worst or second-worst stadium in MLB, the A’s represent the East Bay nexus of intellectuals and blue-collar workers. The diverse fan base developed a soccer-style culture of drums and vuvuzelas before Major League Soccer was a thing. When the Coliseum is packed, it’s the craziest, zaniest, and loudest venue in MLB.

Most of the time it isn’t packed. It’s been that way for decades.

Somewhere between the early 2000s Moneyball era and 2011, A’s fans abandoned all other chants in favor of a single, unabashedly local one: “Let’s Go Oakland! clap-clap clap-clap-clap.”


There is a simple explanation: the A’s ownership under Lew Wolff attempted numerous times to move the A’s out of Oakland. First it was Fremont, then San Jose. Both fell through. They looked in Oakland as well, and finally in 2016, frustrated with the never-ending search for a new ballpark, Lew Wolff stepped aside and Dave Kaval stepped in as president, coinciding with a recommitment of the franchise to Oakland. 

MLB’s revenue sharing, which redistributes money from the richest teams to the poorest, has played a part in keeping the A’s competitive. However, MLB is pulling the plug on revenue sharing for the A’s as the new stadium search has dragged on too long. MLB considers the current Oakland Coliseum to be subpar for the league and the A’s consider a new stadium to be the only hope in increasing revenue.

So a new stadium is a must. Currently there are two sites on the table:

  1. Howard Terminal, a waterfront site close to Jack London Square, and

  2. The Oakland Coliseum site, which will soon be vacated by the Raiders and Warriors, leaving only the A’s on the East Oakland parcel.

Neither is a true “downtown” site, as Howard Terminal is about a 20 minute walk from the 12th Street BART station and a 10 minute walk from Jack London. The Coliseum is located in an industrial warehouse district where I have heard gunshots leaving a game. Both sites will likely be accompanied by ancillary development (i.e., a ballpark villlage with housing, restaurants, and offices), especially if the A’s choose to build on the Coliseum site.

New Ballpark Scenarios

It is no secret that the A’s are hoping a new ballpark will be transformative for the franchise. When the San Francisco Giants moved to AT&T Park, they sold out the stadium for 10 years and became the Bay Area’s baseball team. The A’s are hoping to strike back. But a more reasonable expectation is a 2-3 year bump in attendance and increased revenue from pricier seats and suites.

The figure below shows the increase in average attendance for the first six years for the following new ballparks: AT&T Park in San Francisco, PNC Park in downtown Pittsburgh, Petco Park in downtown San Diego, Citizens Bank Park in Philadelphia, Citi Field in Queens, Nationals Park in DC, Target Field in downtown Minneapolis, Marlins Park in Miami, and SunTrust Park in Georgia (only open for two years). Attendance is being compared to the average attendance in the old ballpark (back to 1995).


So the best case appears to be AT&T Park, with an amazing increase of more than 15,000 fans per game, and the worst case appears to be Marlins Park. The Marlins were one of the worst teams in baseball one year after opening—as were the Pirates in PNC Park. A weird anomaly is Citi Field—in 2008, the year before Citi Field opened, the Mets were good before collapsing down the stretch. The beneficiaries of the collapse were the Phillies and Citizens Bank Park; the Phils won the World Series in 2008, the fourth year for the stadium.

Clearly team performance is wrapped up in the attendance numbers above. It’s reasonable to guess that stadium location and weather could also affect attendance.

So to predict attendance in a new A’s ballpark, let’s lay out some scenarios:

  • Team performance: (1) The A’s are good: similar to 2011–2014, (2) The A’s are bad: similar to 2007-2010.

  • Stadium location: (1) Howard Terminal (downtown-ish): AT&T, Target, or PNC are comps, (2) Oakland Coliseum site: SunTrust or Citizens Bank are comps.

  • Weather at stadium location: (1) Howard Terminal is rumored to have weather as cold and as windy as the Giant’s old Candlestick Park, (2) we can assume weather at the Oakland Coliseum will remain the same. For Howard Terminal, I used West Alameda weather data and for the Coliseum, I used Oakland International Airport data.

Once we have a model for predicting attendance, we can plug in these scenarios and predict the A’s attendance in a new ballpark!

Model: XGBoost

I used boosted trees with XGBoost to model MLB attendance for individual games for all teams 1995-2018. The dataset included daily weather data, FiveThirtyEight Elo ratings, and game information from Baseball-Reference. A decision tree is an intuitive method that allows for “decision rules,” e.g., if the A’s are playing in the Coliseum in June and the temperature is above 70, then the attendance will be about 30,000. Decision trees on their own don't work very well because reality is usually not so structured. Boosting is a method of sequentially building a series of small trees to focus on unexplained variance in the data, and the XGBoost package is an efficient implementation of Gradient Boosted Trees that is frequently a part of winning Kaggle entries.

The model achieved a score of 0.854. For additional discussion of methods, skip to the end of this document.


First off, let’s validate the model. I excluded the second half of 2018 from the model training set to use for qualitative testing. Here’s what the predictions and actual data look like for the A’s second half.


Aside from the A’s-Giants series from 7/20 to 7/22—a rivalry which has heated up lately—the model seems to predict when attendance spikes and dips while having more of a central tendency than the actual data. A big missing variable is promotions: fireworks nights, bobblehead days, etc. I thought that could be a deal breaker, but overall I think the model performs quite well, as the 0.854 score reflects.

Predicting Attendance in a new Oakland Ballpark

Predicting A’s attendance in a new ballpark is, on the surface, an impossible task. This is because the effect of a given ballpark is tangled up with the effect of the team, which we can’t separate from the ballpark. Is the reason the Giants draw so many fans a product of fan loyalty or the beautiful ballpark location?

A tree-based model will occasionally look at the stadium and adjust the model based on the stadium. Other times, it will look at the team and make the adjustment. So any stadium “effect” is going to be diluted and therefore we should understand that these are conservative estimates of the differences between stadiums.

To predict attendance in a new ballpark, we kept all the game data the same, but changed the ballpark, the ballpark age, a flag for downtown location, and ballpark capacity (set at 35,000). So this is a retrospective prediction.

Making predictions retrospectively means the A’s chilly relationship with their fans circa 2008-2011 is coded into the model (“team+year” or “fan engagement effect” discussed in Part 1b). If the A’s were to build a new ballpark, fan loyalty would likely improve. So the estimate of the effect of a new ballpark is again conservative.

Bad A’s Team

Let’s first look at the attendance predictions for a hypothetical new ballpark constructed in 2008 when the A’s were bad.


We see a bump of 6,000 to 7,000 fans per game in the first year for the new ballpark. Only a substantial first year bump is guaranteed. With a very bad A’s team in 2009, attendance quickly drops. Some of the simulated stadiums withstand the drop slightly better than others. But the model, in a sense, “knows” that when the A’s are bad, their attendance is terrible. There is no data to suggest otherwise.

Let’s take a closer look at the effect of location. This analysis uses the same years (2007 to 2010) but just one stadium: Target Field. The two locations are simulated using different weather data: West Alameda (supposedly by the shore) for Howard Terminal and Oakland International Airport for the Coliseum. I marked one of the Howard Terminal simulations as “downtown.”


This is actually quite interesting. We see that the downtown effect is negligible in the initial year but it emerges as the stadium hits 3 to 4 years old. So downtown stadiums appear to sustain attendance better than suburban stadiums. How much? This graphic says up to 700 fans per game but again, we have to interpret these effects as conservative.

The second finding is the lack of a weather effect. The weather station used for Howard Terminal was slightly colder but with less wind than the weather station for the Coliseum. I don’t know if that is representative of the specific Howard Terminal site, and perhaps the wind and temperature offset each other. In Part 2 of this series, I’ll be looking at weather in detail.

Good A’s Team

Next we look at predicted attendance for a hypothetical new A’s ballpark constructed in 2012, during a good A’s run from 2011 to 2014.


The good team scenario clearly has a longer new stadium bump than the bad team scenario. Because the team was good in 2013, the first-year attendance momentum carries forward—there is an autoregressive effect (i.e., looking at past year’s attendance) in the model. Winning in the first year of the stadium leads to a large bump that becomes spread across time. In other words, first-year winning plus a new stadium equals attendance over the medium-term.

But notice that the slope of the predicted attendance lines for most stadiums is slightly less than the slope of the actual attendance from 2013 to 2014. This is a period when the A’s were getting even better. It’s as if the A’s have a cap on attendance in the model, and no matter how good they got, the model refuses to predict more than 30,000 per game in attendance. I would guess this is because the A’s last drew more than 30,000 fans per game in 1992—which is before the training data for the model.

The exception, of course, is if they played in AT&T Park, or an AT&T-equivalent stadium. Then they would finally cross the 30,000 average attendance line. AT&T Park was transformative for the Giants.

As discussed, I think these simulations are conservative. I would love for the A’s to break my model. But they first need to build the damn ballpark.


  • This study captures the effect of the average new ballpark. The estimated effect is conservative, particularly for the difference between various new ballparks.

  • The average new ballpark has a sharp first year bump. Sustaining that bump likely depends on whether the team is winning and the location and quality of the stadium. Most teams are not able to sustain the bump long-term.

  • A downtown stadium is likely to draw at least slightly larger crowds than a suburban stadium, particularly when the team is not good. I did not see a notable effect of local weather between sites.

  • There are few precedents for a new stadium transforming a team’s attendance patterns long-term. The A’s will have to think outside of my “black box” model to achieve this goal.



Methods: Details

Note: Boosting is a machine learning method and cannot be interpreted as easily as typical linear model. It isn’t possible for example, to estimate some “universal” effect of a stadium being downtown, for example. Statistical models are often built for generalization. This model is not. You can think of it as a complicated model for reality that we are using to extrapolate—with no measure of uncertainty—to hypothetical scenarios that are not in reality. That is dangerous! But there is frankly no way a linear model would work well for this problem.

I relied heavily on Troy Hepper’s scraping and cleaning code to download 1990-2018 game data from Troy added moving averages for runs scored and allowed and an indicator for whether a game is within a division. Troy used stadium capacity data manually compiled by Chris Leonard. I also added some new predictors:

  • Temperature by day: average daily temperature, low temperature, high temperature, heating degree days, and cooling degree days from the NOAA NCDC weather station closest to the stadium.

  • Average wind speed by day

  • Metro-area population by year, extrapolated to the 1990s

  • FiveThirtyEight Elo ratings (i.e., how good the team is), pitcher adjustments, and win probabilities by game

  • Lag-1 (previous year) winning percentage, playoff status, average attendance, and Wins Above Replacement for the best player on the team (a proxy for star power).

  • Stadium construction and renovation dates to calculate stadium age

I downloaded weather data using the rnoaa package. Despite being familiar with these data from EMI Consulting, I spent many hours identifying missing and incomplete data and then replacing it. For example, I had to duplicate 2007 data for Alameda since 2006 and 2011 data were missing. I used more than 14GB of hourly data to produce the daily weather dataset. The final dataset had 59,661 games and 165 predictors per game.

I also fit the data with LightGBM and assessed whether an ensemble of XGBoost and LightGBM improved predictions. It didn’t. I “one hot encoded” my categorical variables after seeing higher predictions for stadiums at the beginning of the alphabet… don't let anyone tell you it’s not necessary. Parameters were tuned with cross-validation. I achieved a model score of 0.854, which is akin to an R-squared value.

Lastly, I used game data from 2007-2015 to predict A’s attendance in a new ballpark. A huge part of the model is how good the team is, who they’re playing, the day of the week, etc. so it was necessary to use real game data. The time period was split in two so that 2007-2011 represented a new stadium opening when the team was bad and 2011-2015 represented a new stadium opening when the team was good. I included an autoregressive component so predictions were made one year at a time.

Let me know if you have any questions about this analysis. I will share some of the code on GitHub in the coming weeks.

Americans Checking Trump's Twitter En Masse Signals A Change In His Approval Rating

As President Trump received criticism for how he responded to the death of Senator John McCain, I noticed that his approval rating took a small but noticeable dip. Coincidence or correlation? This presidency has upended the traditional understanding of how Americans respond to political events. Is it possible to identify any signal or patterns in Trump’s approval ratings, or has fake news penetrated so deeply that the American public is responding to untracked rumors?

In this post, I link google search trends for “Trump twitter” to his approval ratings. Scroll to the bottom if you want to skip the statistical details :)


Approval ratings: FiveThirtyEight compiles approval polls and calculates a daily weighted average to estimate Trump’s true approval rating. The data are available on FiveThirtyEight’s GitHub. I used the “All polls” model output. It is important to note that building a model on another model’s output is akin to making copies of a copy—it adds noise (and possibly bias). This is a key limitation of these data.

Google search trends: I used the gtrendsR package to extract Google trends data for the search terms “Trump” and “Trump twitter.” Unfortunately, daily trends are only available if querying the past three months from present; any other queries provide weekly data. This is the major limitation of the search trends data. While it is possible to interpolate between the weekly data points or aggregate the daily approval ratings, more granular data generally allows for better models. The “interest index” provided by Google is relative to all other searches and the maximum value in the set.

Exploratory Analysis

Let’s quickly take a look at these datasets. First is the FiveThirtyEight estimated approval rating for President Trump. Looks like it ranges from about 37% to 47%. I’ve included the confidence bands in this plot, but ignored them in the analysis.

Next are the Google search trends. I included some state-level data, which show remarkably similar trends despite very different political leanings. Thus I only used the Overall US “Trump Twitter” search data for the rest of the analysis. I chose the search term “Trump Twitter” over “Trump” since this is how many people (including myself) access the President’s Twitter account (rather than navigating through Twitter). A quick comparison of “Trump” vs. “Trump twitter” (not shown) suggested to me that “Trump twitter” may be more related to his approval rating.

When do people go to the President’s Twitter? A reasonable guess: when they hear he tweeted something controversial.

Lastly, here is a combined look at Approval rating and “Trump Twitter” search trends. Both have been normalized so we can look at them on a similar scale. It looks like there’s some relationship between the signals! We can see that spikes in search activity often correspond to dips or spikes in approval.


Based on the last plot, it seems reasonable to build a regression model where a unit change in search activity leads to some change in approval. Or better yet, a lagged regression model where, for example, a unit change in search activity the week before leads to some change in approval.

I fit a number of ARIMA (autoregressive integrated moving average) time series lagged regression models to the data and they were lackluster. In general, these models identified some effect of search activity, but with very small coefficients.

Below is one such model, an ARIMA(0,1,1). I trained it on a subset of the data and forecasted the most recent period. The grey line represents the true approval rating and the blue line is the prediction based on search activity.


It doesn’t look great. While there may be some use to this model, I think a regression model is likely inappropriate for this problem because:

  • There can be a positive or negative effect on approval from a bump in search activity.

  • The magnitude of the effect is highly context-dependent. For example, a ridiculous but harmless tweet may generate a lot of traffic but not affect Pres. Trump’s approval rating.

A Random Walk with Pres. Trump

Okay, maybe we can’t predict Pres. Trump’s approval rating. Maybe his approval rating is just a random walk process, where at each point in time, his approval goes up or down by a draw from some distribution (e.g., a normal distribution).

If this were the case and we looked at changes in approval, we would see a symmetric squiggle. For a normal random walk, we would see the squiggle contained in a uniform band.


We definitely do not see a normal random walk. We can see “bursts” where there are large changes in approval. If we make a histogram of the changes in approval, we can see there is a “long left tail” and a non-symmetric distribution.

To me, this suggests Trump’s approval is not a random walk.

The Changepoints Paradigm

Here is another way to think about approval ratings:

  • Let’s assume that Trump’s approval is constant plus some noise in some intervals.

  • There are external shocks/bursts to his approval that lead to some new equilibrium.

This is a changepoints approach to modeling our problem. Changepoints detection has a long history in statistics, but has recently heated up. The most basic changepoint model assumes independent normal observations with a change in mean (see Hinkley 1970). I used a non-parametric changepoint detection method based on empirical distributions (Haynes et al. 2017). So let’s change our model:

  • Assume that Trump’s approval follows some (fixed) distribution in intervals.

  • And let’s add another assumption: only large shocks above some threshold lead to a new equilibrium.

Looking at the search trends data, it looks like a search interest index above 40 may qualify as a large shock. I wrote an algorithm that attempts to identify only the beginning of a “shock,” since there may be sustained activity afterwards.

Putting this model together, we have these shocks identified by a vertical line and my best guess for the reason for the search traffic. The horizontal lines represent Trump’s mean approval between the identified changepoints. Click on the graphic to enlarge!

Looks pretty good! The “shocks” in search activity are generally aligned with changepoints in approval rating. A back-of-the-envelope estimate of RMSE using the search “shocks” to identify changepoints in approval is 13.9 days. This suggests there is room for improvement, perhaps by attempting to measure sentiment on social media. But not everyone tweets—and almost everyone googles!


  • This analysis suggests that when the Google search interest index for “Trump Twitter” exceeds 40 (see note below), we should expect to see a change in Trump’s approval rating either up or down.

  • Political scientists have long theorized the “bully pulpit” (i.e., the fact that media will report the president’s speeches, press releases, etc.) serves as an “agenda setting” tool for presidents. It is unprecedented that a president use the bully pulpit to damage their own popularity and agenda. This analysis suggests that is the case for Pres. Trump.

Note on Google Trends Search: Due to how trends are adjusted, the time period that you select (and any comparisons to other search terms) will determine what this threshold value is. It is approximately the April 8th 2018 value.

Code for this analysis can be found on my GitHub.

How Injuries Affect Fitness via VO2Max

In April 2017, I tore the meniscus in my right knee and in July 2017 I finally had surgery to repair it. It took me about a week to walk again, but due to atrophy in my right quad, it was months before I could run again. Now more than a year later, I'm curious how this injury affected my fitness. How does it compare to my peak months climbing North America's highest mountains?

VO2Max: A Better Measure of Fitness

Back in the 1990s, baseball fans and teams alike used two primary metrics to measure player performance: batting average (BA) and earned run average (ERA). By the late 1990s and early 2000s, "sabermetricians" (i.e., baseball nerds) developed better metrics to measure player performance and value: first came on-base percentage (OBP) and walks/hits per inning pitched (WHIP), then a bunch of intermediate "sabermetrics," and today we have Wins Above Replacement (WAR) and weighted on-based average (wOBA). The beginning of this transition was the subject of the book and film "Moneyball." WAR is such a good measure of player value that even old-school broadcasters on ESPN are using it.

When we think about personal aerobic fitness, we are often stuck using a basic metric akin to batting average: pace (min/miles). Pace has a lot of problems: we go slower uphill and faster downhill. Our pace varies with altitude. Sometimes we deliberately train at a faster pace or a slower pace. In fact, low-intensity endurance training is frequently recommended to be the base of an aerobic athlete's training regiment. While Strava now calculates Grade-Adjusted Pace, this does not give us a measure of fitness when we aren't trying to maximize our pace. Wouldn't it be great if we had a barometer of fitness for any activity? Enter \(VO_{2}Max\).

 \(VO_{2}Max\) is defined as the maximal rate of oxygen uptake and consumption during exercise. It is measured in mL/(kg·min). In some sense, it's a measure of the horsepower of the engine that is our aerobic system. The more oxygen we are able to use per min, the faster we are able to go. While elite athletes may hit a ceiling trying to over-optimize \(VO_{2}Max\), the rest of us will likely experience fitness gains when increasing our \(VO_{2}Max\).

Fortunately, most fitness and smart watches that track heart rate (including the Apple Watch) now provide an estimate of \(VO_{2}Max\). How good are these estimates? It depends on the equipment and manufacturer, but using a heart rate strap, the estimates may be as close as 2%. The methodology, as far as I can tell, has been developed by a single company: Firstbeat. My best guess is that estimate depends on the relationship between heart rate and pace. I would love to reverse-engineer the calculation, so stayed tuned for that! 

The Data

I have a Suunto Ambit2 fitness watch with a heart rate strap. I try to use the heart rate strap on runs and occasionally on hikes, climbs, and bike rides. Here's a plot of my activities using the Ambit2. Larger circles are more mileage.


For the purposes of this analysis, I chose to only look at the running data, since the estimates of \(VO_{2}Max\) seem to be quite different for other activities. For example, while "trekking," I frequently carry a heavy pack.

Suunto has an online platform called "MovesCount," which provides an estimate of \(VO_{2}Max\). Unfortunately, there is no export function for these estimates! To download these data, I first found a list of my MoveIDs. I then used Selenium in Python to access each "move" (necessary because MovesCount is Javascript-heavy) and then fed the resulting XML into BeautifulSoup and scraped the data. This was a massive pain, but I guess Suunto is trying to protect customer data.

VO2Max Over Time

Here is the plot of my \(VO_{2}Max\) from 2015 to 2018. I've annotated it with some climbs and the red line is the date of my surgery. Somewhat surprisingly, there is not an immediate dip—a short October run had an estimated \(VO_{2}Max\) of 48—but there is a dip after a winter of neither bike commuting nor running.


My fitness is highest right before major climbs and often crashes immediately after, perhaps because of fatigue and recovery. It seems to generally increase after those climbs. And the good news is it seems to be increasing again!

So in conclusion: it appears the the surgery itself did not impact my fitness, but a winter of minimal aerobic exercise did. Ohio winters…

Possible Issues with the Estimated VO2Max

Anecdotally, I have noticed that the estimated \(VO_{2}Max\) is higher for more intense runs. Are there other "external" factors that affect these estimates? Short answer: yes.


The above scatterplot matrix shows correlation between variables. We can see estimated \(VO_{2}Max\) is correlated with average heart rate (r = 0.47), temperature (r=-0.34), and % of the run in the "hard" heart rate zone (r=0.42). So just working harder means a higher estimate of \(VO_{2}Max\) . And higher temperature means a lower estimate of \(VO_{2}Max\). We can also see a strong relationship between % hard and average heart rate.

So the methodology appears to have some issues—ideally, we would see a consistent estimate of \(VO_{2}Max\) between a low-intensity and high-intensity workout for an equal level of fitness. And we're getting docked in hot weather.

On the flip side, the correlation between average heart rate and estimated \(VO_{2}Max\) is not that strong, which is encouraging. Additionally, the estimated \(VO_{2}Max\) over time don't appear to vary too wildly.

How should we view these estimates? I think they do capture trends of my true \(VO_{2}Max\), but there are some problems with the method that should be addressed. There is room for improvement and hopefully we will see it as fitness and smart watches gain additional sensors. Perhaps I'll give it a shot!

A Fully Bayesian Approach to the Job Search

The job search is a long and random process. So long that my statistical skills may need some sharpening! To put them to use, I developed a fully Bayesian model to estimate how long it will take me to get a job!


I am estimating how many applications it will take before I get a job using a Bayesian model. With some simplification, there are two major steps to the application process: (1) getting a first-round interview, and (2) getting a job offer after the interview. There may actually be many interviews between the first-round interview and landing an offer, but I will likely not get enough data in those intervening steps to build a worthwhile model.

For this model, I am considering each application submitted to be a Bernoulli trial where a success is getting a first-round interview. I'm going to assume this is a fixed probability \(P(\text{First-round Interview})= p_1\). Then, if there is a success in this first trial, there is a second Bernoulli trial where a success is getting a job offer. Again I'm going to assume a fixed, but different parameter for this conditional probability of success, which I will call \(P(\text{Job Offer}|\text{First-round Interview}) = p_2\).

Note that I have data for both Bernoulli steps. For the first step, I have \(n_1=9\) trials and \(y_1=1\) success. For the second step, I have \(y_1=1\) trial (this is the number of first-round interviews I've had - we could also call this \(n_2\)) and \(y_2=0\) successes.

Honestly, I have no idea what my success probability is. So I chose a uniform prior for \(p_1\) on [0,1], which leads to a neat posterior. The posterior distribution of \(p_1|y_1, n_1\) is \(Beta(y_1+1,n_1-y_1+1)\) where \(n_1\) is the number of applications submitted. The same holds for \(p_2\) with a uniform prior; the posterior for \(p_2|y_1, y_2\) is \(Beta(y_2+1,y_1-y_2+1)\).

Posterior Distributions

Let's visualize these densities.


Nothing crazy here. The median of the posterior for \(p_1\) (chance of getting a first-round interview) is 0.16, with a 90% credible interval of 0.04 to 0.39. So the uniform prior is nudging that upwards from the \(\hat{p}\) we would get from \(\frac{y_1}{n_1}\).

The median of the posterior for \(p_2\) (chance of getting a job offer after first-round interview) is 0.29, which is not really saying much since our 90% credible interval is 0.03 to 0.78. We just don't have much data, which is doubly sad :( The prior in this case is giving me the benefit of the doubt, saying that I have a roughly one-quarter chance of getting a job offer even though I've received no job offers!

So how many jobs do I need to apply for?

Let's return back to the original question. We have \(P(\text{First-round Interview})= p_1\) and \(P(\text{Job Offer}|\text{First-round Interview}) = p_2\). Then \(P(\text{First-round Interview & Job Offer}) = p_1*p_2\). Let's call this joint probability \(q\). Now we can conceive of a set of Bernoulli trials with this joint probability. The expected number of trials required to achieve one success comes from the geometric distribution and is \(\frac{1}{q}\).

Okay… so how does this help me? Well, I can simply simulate draws from my known posteriors and then combine the draws to estimate the posterior distribution of \(\frac{1}{q}\), which is the number of applications required to land a job! Here is the posterior for \(q|y_1,n_1,y_2\):


The median of the posterior is 25.11 with a 90% credible interval of 5.22 to 383.9. Note that due to the memoryless property of the geometric distribution, we should interpret those as additional applications, i.e., my best estimate is 25 additional applications. But take a look at that upper bound. Yikes! So the job search may take me another week to ... another year. Better get back to the job search!

The Data Science Job Search

I moved to Colorado about a month ago and have been on the job hunt for about three weeks. It's never easy when you're making a bit of a career transition, and I find myself in the middle ground where I have relevant experience, but not in tech or big data or a role that is defined as "data scientist." So I am probably best qualified for entry-level data science jobs, which are less common than positions demanding 3-5 years of experience.

To track my progress (and to feel like I have some control over the process), I developed a Tableau dashboard. It's always more fun to have something with color. Note that the colored numbers here are Glassdoor ratings. I also constructed a very rough, naive model using a geometric distribution to estimate how many applications I'll need to submit. A Bayesian approach would be better suited, so perhaps if this drags on I'll whip out RStan. Only 17 more applications to submit!

Job Dashboard.png

The Two-Sample Location Problem: Don't Do A Preliminary Test For Equal Variances

Ah... the two-sample t-test. It's old, it's classic, and it's still used thousands of times every day.

There's a lot we can learn from the ol' two-sample t-test. Yes, it can tell us with Type I error level ⍺ whether two means from two random samples from independent populations are the same, or not the same (different?). But how we use the t-test has some lessons for how we conduct data analysis in general.

In fact, this story hints at the the impossibility of truly "objective" analysis and suggests that, despite the robo-histeria, an algorithmic approach to data analysis is unlikely to replace statisticians anytime soon. Or so I hope.

The story goes something like this:

The problem

The two-sample location problem in the typical setup (i.e., ignoring the non-parametric tests) involves two means from independent populations.

Formally, the populations should be normally distributed so that the sample means are exactly normally distributed, but in practice, we often invoke asymptotic arguments that tell us, for large sample sizes, the sample means are approximately normally distributed. 

To characterize a normal distribution, we need to know the mean and the variance (or standard deviation). Most of the time, we don't know variance (and we obviously don't know the means), so we must estimate both. When we're dealing with a single population, this is easy and leads to the one-sample t-test. When we're dealing with the difference of sample means, it's a bit more complicated.

The fork in the road

The variance question leads to a fork in the road, with one path leading to Welch's test and the other to the pooled two-sample t-test. The simple heuristic taught in intro stats classes is: 

  • If you know the variances in the two approximately normal populations are equal, use the pooled two-sample t-test.
  • If you know the variances in the two populations are not equal, use Welsch’s test

But if you don't know if the variances are equal (the "Behren-Fisher problem"), you're up a tree. What procedure should a good statistician use?

A well-intentioned but misleading solution

In the 80s or 90s, some researchers decided that a simple solution to this problem is create a conditional procedure, so-named because the choice of second test is conditional on the results of the first:

  • First, conduct a test for equality of variances, using either an F-test or Levene’s test.
  • Second, we can use the simple heuristic: use the pooled two-sample t-test if we fail to reject the null in the first test, and use Welch's test if we reject the null in the first test.

Implicit in this reasoning is a conjecture that this procedure leads to greater power, since the pooled t-test has greater power than Welch’s test when the variances of the populations are equal. Another potential advantage of this procedure is the appearance of the “unbiased statistician,” since the precedure requires no prior knowledge or intuition in selecting tests. 

This procedure has gained sufficient traction to become incorporated in statistical packages such as SPSS, noted on Wikipedia, and described on “how-to” blogs. As shown below, SPSS provides the results of Levene’s test alongside outputs from the pooled t-test and Welsch’s test. This presentation of results, while not forcing the user to select one test, certainly implies the user would be foolish to select Welch’s test if Levene’s test fails to reject the null hypothesis of equal variances, and vice versa.

What kind of dummy would ignore the Levene's test output when selecting which t-test to interpret?

What kind of dummy would ignore the Levene's test output when selecting which t-test to interpret?

It's not just software. Wikipedia states the following on the page for Levene’s test:

Levene’s test is often used before a comparison of means. When Levene’s test shows significance, one should switch to more generalized tests that is [sic] free from homoscedasticity assumptions (sometimes even non-parametric tests).

And if you were to google "how to conduct a two-sample t-test in R"? You might get pointed to a blog post on that states:

Before proceeding with the t-test, it is necessary to evaluate the sample variances of the two groups, using a Fisher’s F-test to verify the homoskedasticity (homogeneity of variances).

Wow, pretty convincing! How could so many sources be wrong?

Tests of Homoscedasticity Break Down Too

The major problem with the conditional procedure is that the test for equality of variances (homoscedasticity) breaks down when variances are close, but not actually equal. It has trouble detecting this difference—statistically this is called "low power"—and then frequently selects the "equal variances" pooled t-test, which is the incorrect choice. The pooled t-test then produces type I errors at up to a 50% higher rate than the specified alpha level. 

What compounds this problem is unbalanced sample sizes—one sample is much larger than the other. The theory behind the pooled t-test only holds if (1) variances are equal, or (2) sample sizes are equal in a balanced experiment. If one of these conditions is exactly true, we can do without the other—but approximately true won't cut it.

If variances are slightly unequal and our sample sizes are very unbalanced, we get the worst case scenario where type I error for the conditional procedure can be 50% higher than the specified alpha, shown below. This figure compares type I error for the conditional procedure (with Levene's test as the first step) to the unconditional procedure (just Welch's test) using results from a simulation study (10,000 simulations per data point). The unconditional procedure does well across the board. The conditional procedure is only okay when sample sizes are close to balanced.

Recall we want the Type I error to be 0.05 for a level alpha=0.05 test. Any deviations over 0.05 are bad news!

Recall we want the Type I error to be 0.05 for a level alpha=0.05 test. Any deviations over 0.05 are bad news!

Next, we'll fix the sample size ratio at 0.2 and vary the variances. Here we see that the conditional procedure is okay when the variances are very different—when Levene's test has high power—but breaks down in the middle area where the variances are slightly different. The unconditional procedure is again fine across the board.



So we have seen that the conditional procedure has some pretty serious issues! The following Q&A addresses what we can take away from this.

Question: Are there any scenarios when we should use the conditional procedure?

Answer: No.

It turns out the argument that the conditional procedure has greater power does not hold water. While it does have greater power in some circumstances, it has higher rates of Type I error in those same circumstances! You can email me for those results or consult the literature.

In the borderline scenario where the variances are close, one should definitely not use the conditional procedure. Welch's test performs well in such scenarios.

Question: Why is the conditional procedure so prevalent?

Answer: At a high level, it's because the people recommending it don't understand the statistical theory of the tests.

I also think the following factors contribute: the appearance of the "unbiased statistician," the idea that math is mechanical, an engineering approach to statistics, and the idea that we can extract a deterministic knowledge from data (not subject to randomness).

Question: What lessons should we take away from this?

Answer: I learned that relying on tests to assess assumptions can lead to unintended consequences. The conditional procedure is perhaps akin to an overfitting problem where researchers may attempt to extract more information from the data than is available.

Lastly, mechanical procedures don't necessarily lead to better outcomes. It's best to use your own knowledge of the data, the data-generating process, and the context of the data to select methods.

Which is why we need statisticians!

Using Changepoints in Heart Rate Variability to Identify Activities

In the course of a project on changepoints methods for a time series course, I found that that there is not a lot of literature documenting analysis of heart rate and fitness tracking data. Presumably there are some neat analytics in a product like Strava Premium, but these methods are proprietary. Since the data are relatively straightforward (pace, heartrate, and altitude), it would be neat to explore whether there are some easy insights in the data.

For this project, I looked at heart rate variability (HRV), which is commonly known as a metric for recovery. It seems like it's also a good indicatory of "activity types." See below for a quick demonstration of changepoint in variance methods on heart rate data collected by me on the Mailbox Peak hike in North Bend, WA (8,690 m, 1,220 m elevation gain) using a Suunto Ambit2 HR watch and heart rate belt. Although heart rate data is discrete, I found that lag-1 differenced heart rate is approximately normal for a given activity (e.g., running) with minimal autocorrelation beyond lag 0. Thus the assumption of an i.i.d. Gaussian process appeared to be reasonable for these data. 

Applying the PELT method to detect changepoints in variance with a normal likelihood cost appeared to identify different activities and phases in the hike: an initial "try hard" phase where pace was erratic, a cardiovascular phase where heart rate variability was lower as pace settled, a rest phase (around the discontinuity in the data), and the descent phase. The binary segmentation method with a cumulative sums of squares cost identified a few additional changepoints, suggesting possible overfitting. Note that we omitted some outliers corresponding to short breaks (visible in the figure) to clean up the output.


Mapping Public Transit Accessibility

A few months ago, I moved almost exactly one mile up Capitol Hill, from an area nearly adjacent to downtown to a more residential and quiet area. My old apartment was kitty-corner to a dance club and Occupy Seattle and had experienced a break-in where the only thing stolen was my old bike. It was a crazy and fun place to live for nearly two years.

My new neighborhood is beautiful and peaceful, but the move has added considerably more travel time to any public transit travel outside of the downtown core of Seattle. Andrew Hardin at the University of Colorado recently created an interactive visualization that demonstrates exactly how the move affected my transit times.

Trips to the neighborhoods and cities of Fremont (north), Magnolia (west), West Seattle (southwest), Georgetown (south), Kirkland (east), and Issaquah (east) are now 50 minute long hauls—all of these places were previously within 40 minutes. (Ballard remains a long haul from Capitol Hill.)

Seattle is investing in express buses, streetcars, and rapid transit that should increase my transit range, but I still feel public transit in the city has a long way to go. These maps don't reflect the frequent bus delays that can make transfers difficult to time—one of the major advantages of rapid transit.

The King County transit system can boast, however, of being one of few systems that allows for bus-to-hike. You can see some of these options off the 520 in Issaquah (southeast). In less than two hours, I can take a series of buses to Mt. Si and hike 3,500 feet to a snow-covered peak. Now that's a different kind of accessibility!

Interactive Storytelling

The New York Times has put together a superb archive of interactive infographics, visualizations, and photo/video journalism that they are calling 2013: The Year of Interactive Storytelling. This is, in fact, a misuse of the term—wikipedia defines interactive storytelling as "a form of digital entertainment in which users create or influence a dramatic storyline through actions." But it is nonetheless a term that seems to encompass the experimental and innovative formats the Times has begun incorporating into their reporting.

My favorite piece was How Y’all, Youse and You Guys Talk, a combination of a survey and map depicting linguistic similarity to the user (within the US). I think it's fair to say this was the most talked about visualization of the year—and among my friends, probably the most discussed visualization ever. It popped up in social media, in my email inbox, and in conversations over beers.

Addendum: It turns out the piece was the most read article of the year on, despite coming online on December 21st. Remarkable.

My map: born and raised in Berkeley, CA, college in Minnesota and Southern California, current resident of Seattle, WA. One of my biggest work clients is located in Michigan—how many milliseconds does it take them to realize I'm an out-of-towner?

My map: born and raised in Berkeley, CA, college in Minnesota and Southern California, current resident of Seattle, WA. One of my biggest work clients is located in Michigan—how many milliseconds does it take them to realize I'm an out-of-towner?

I think this visualization succeeds because it reminds us, in a highly personal way, of the communities and cultures we come from, years after we have physically left them. My dad's map reflects the decade he spent growing up in Washington D.C., despite the 40 years he's spent in Cailfornia.

The results are memorable because they challenge some our conventional notions of place divisions. In the West, the urban/rural divisions seen in voting patterns are not discernible (Minneapolis, Chicago, and Washington D.C. are easily spotted, however). State lines seem to matter to some extent but the trends bleed across the borders.

I do wish there was additional annotation and explanation. The visualization presents you with the words most definitive of the three most and three least similar cities. But I have no idea what pronunciations or vocabulary I share with South Carolina and Maine.

In a high school linguistic class, I remember being told the US has a remarkably low number of dialects given its size, which is of course a product of the country's young history. This visualization does not refute that, but does show a surprising amount of linguistic diversity in light of a dominant national media and high rates of mobility between states and regions. 

In conclusion, it's a hella savage visualization.

Scientific Integrity

Scientific integrity ... corresponds to a kind of utter honesty--a kind of leaning over backwards. For example, if you're doing an experiment, you should report everything that you think might make it invalid--not only what you think is right about it: other causes that could possibly explain your results; and things you thought of that you've eliminated by some other experiment, and how they worked--to make sure the other fellow can tell they have been eliminated.

Important words from Richard Feynman's 1974 Caltech Commencement Address.

Cited in Arthur Lupia's great plenary talk at Evaluation 2013.