Part 1 in a two-part series on MLB attendance. In Part 2, I will look more deeply into how weather affected attendance in 2018.
This is a summary of analysis of an XGBoost model for MLB attendance from 1995-2018, constructed with the goal of predicting attendance for a new Oakland A’s stadium. I simulated a variety of stadiums at two sites: Howard Terminal by Jack London Square and the current Oakland Coliseum site. Here are the main takeaways:
New stadiums provide a steep one-year bump in attendance, but increased attendance is hard to sustain.
Downtown stadiums draw more people to games and may help sustain attendance, but my model did not show a large effect. An amazing stadium at the Coliseum site with a winning team could also be a draw, especially with a ballpark village that can serve as a “downtown.”
Local weather differences between sites have a negligible impact on annual attendance, with the caveat that I could not find a weather station that showed the rumored Candlestick-like weather at Howard Terminal. The difference between West Alameda and Oakland Airport weather had a minimal impact on predicted attendance.
Based on 33 years of attendance and game data, the weight of evidence suggests most new stadiums do not substantially alter attendance long-term. The only obvious recipe for long-term attendance is a new stadium coinciding with on-field success.
The A’s have two things working against them: (a) they don’t have a track record of drawing large crowds, and (b) there are few precedents for the long-term transformation that the A’s are seeking. Dreaming BIG definitely seems like a prudent strategy if the A’s want to break out of the “small market” bubble.
Even though they play in the cavernous Oakland Coliseum with its sewage leaks and possums, being an Oakland A’s fan is one of the most fulfilling experiences in sports. The franchise consistently outperforms its low budget (lowest in the majors in 2018) and comes out of nowhere to make the playoffs. After three consecutive last-place finishes from 2015 to 2017, they are officially back in the playoffs for at least one game.
Led by Moneyball genius Billy Beane and playing in the concrete bowl that is the consensus worst or second-worst stadium in MLB, the A’s represent the East Bay nexus of intellectuals and blue-collar workers. The diverse fan base developed a soccer-style culture of drums and vuvuzelas before Major League Soccer was a thing. When the Coliseum is packed, it’s the craziest, zaniest, and loudest venue in MLB.
Most of the time it isn’t packed. It’s been that way for decades.
Somewhere between the early 2000s Moneyball era and 2011, A’s fans abandoned all other chants in favor of a single, unabashedly local one: “Let’s Go Oakland! clap-clap clap-clap-clap.”
There is a simple explanation: the A’s ownership under Lew Wolff attempted numerous times to move the A’s out of Oakland. First it was Fremont, then San Jose. Both fell through. They looked in Oakland as well, and finally in 2016, frustrated with the never-ending search for a new ballpark, Lew Wolff stepped aside and Dave Kaval stepped in as president, coinciding with a recommitment of the franchise to Oakland.
MLB’s revenue sharing, which redistributes money from the richest teams to the poorest, has played a part in keeping the A’s competitive. However, MLB is pulling the plug on revenue sharing for the A’s as the new stadium search has dragged on too long. MLB considers the current Oakland Coliseum to be subpar for the league and the A’s consider a new stadium to be the only hope in increasing revenue.
So a new stadium is a must. Currently there are two sites on the table:
Howard Terminal, a waterfront site close to Jack London Square, and
The Oakland Coliseum site, which will soon be vacated by the Raiders and Warriors, leaving only the A’s on the East Oakland parcel.
Neither is a true “downtown” site, as Howard Terminal is about a 20 minute walk from the 12th Street BART station and a 10 minute walk from Jack London. The Coliseum is located in an industrial warehouse district where I have heard gunshots leaving a game. Both sites will likely be accompanied by ancillary development (i.e., a ballpark villlage with housing, restaurants, and offices), especially if the A’s choose to build on the Coliseum site.
New Ballpark Scenarios
It is no secret that the A’s are hoping a new ballpark will be transformative for the franchise. When the San Francisco Giants moved to AT&T Park, they sold out the stadium for 10 years and became the Bay Area’s baseball team. The A’s are hoping to strike back. But a more reasonable expectation is a 2-3 year bump in attendance and increased revenue from pricier seats and suites.
The figure below shows the increase in average attendance for the first six years for the following new ballparks: AT&T Park in San Francisco, PNC Park in downtown Pittsburgh, Petco Park in downtown San Diego, Citizens Bank Park in Philadelphia, Citi Field in Queens, Nationals Park in DC, Target Field in downtown Minneapolis, Marlins Park in Miami, and SunTrust Park in Georgia (only open for two years). Attendance is being compared to the average attendance in the old ballpark (back to 1995).
So the best case appears to be AT&T Park, with an amazing increase of more than 15,000 fans per game, and the worst case appears to be Marlins Park. The Marlins were one of the worst teams in baseball one year after opening—as were the Pirates in PNC Park. A weird anomaly is Citi Field—in 2008, the year before Citi Field opened, the Mets were good before collapsing down the stretch. The beneficiaries of the collapse were the Phillies and Citizens Bank Park; the Phils won the World Series in 2008, the fourth year for the stadium.
Clearly team performance is wrapped up in the attendance numbers above. It’s reasonable to guess that stadium location and weather could also affect attendance.
So to predict attendance in a new A’s ballpark, let’s lay out some scenarios:
Team performance: (1) The A’s are good: similar to 2011–2014, (2) The A’s are bad: similar to 2007-2010.
Stadium location: (1) Howard Terminal (downtown-ish): AT&T, Target, or PNC are comps, (2) Oakland Coliseum site: SunTrust or Citizens Bank are comps.
Weather at stadium location: (1) Howard Terminal is rumored to have weather as cold and as windy as the Giant’s old Candlestick Park, (2) we can assume weather at the Oakland Coliseum will remain the same. For Howard Terminal, I used West Alameda weather data and for the Coliseum, I used Oakland International Airport data.
Once we have a model for predicting attendance, we can plug in these scenarios and predict the A’s attendance in a new ballpark!
I used boosted trees with XGBoost to model MLB attendance for individual games for all teams 1995-2018. The dataset included daily weather data, FiveThirtyEight Elo ratings, and game information from Baseball-Reference. A decision tree is an intuitive method that allows for “decision rules,” e.g., if the A’s are playing in the Coliseum in June and the temperature is above 70, then the attendance will be about 30,000. Decision trees on their own don't work very well because reality is usually not so structured. Boosting is a method of sequentially building a series of small trees to focus on unexplained variance in the data, and the XGBoost package is an efficient implementation of Gradient Boosted Trees that is frequently a part of winning Kaggle entries.
The model achieved a score of 0.854. For additional discussion of methods, skip to the end of this document.
First off, let’s validate the model. I excluded the second half of 2018 from the model training set to use for qualitative testing. Here’s what the predictions and actual data look like for the A’s second half.
Aside from the A’s-Giants series from 7/20 to 7/22—a rivalry which has heated up lately—the model seems to predict when attendance spikes and dips while having more of a central tendency than the actual data. A big missing variable is promotions: fireworks nights, bobblehead days, etc. I thought that could be a deal breaker, but overall I think the model performs quite well, as the 0.854 score reflects.
Predicting Attendance in a new Oakland Ballpark
Predicting A’s attendance in a new ballpark is, on the surface, an impossible task. This is because the effect of a given ballpark is tangled up with the effect of the team, which we can’t separate from the ballpark. Is the reason the Giants draw so many fans a product of fan loyalty or the beautiful ballpark location?
A tree-based model will occasionally look at the stadium and adjust the model based on the stadium. Other times, it will look at the team and make the adjustment. So any stadium “effect” is going to be diluted and therefore we should understand that these are conservative estimates of the differences between stadiums.
To predict attendance in a new ballpark, we kept all the game data the same, but changed the ballpark, the ballpark age, a flag for downtown location, and ballpark capacity (set at 35,000). So this is a retrospective prediction.
Making predictions retrospectively means the A’s chilly relationship with their fans circa 2008-2011 is coded into the model (“team+year” or “fan engagement effect” discussed in Part 1b). If the A’s were to build a new ballpark, fan loyalty would likely improve. So the estimate of the effect of a new ballpark is again conservative.
Bad A’s Team
Let’s first look at the attendance predictions for a hypothetical new ballpark constructed in 2008 when the A’s were bad.
We see a bump of 6,000 to 7,000 fans per game in the first year for the new ballpark. Only a substantial first year bump is guaranteed. With a very bad A’s team in 2009, attendance quickly drops. Some of the simulated stadiums withstand the drop slightly better than others. But the model, in a sense, “knows” that when the A’s are bad, their attendance is terrible. There is no data to suggest otherwise.
Let’s take a closer look at the effect of location. This analysis uses the same years (2007 to 2010) but just one stadium: Target Field. The two locations are simulated using different weather data: West Alameda (supposedly by the shore) for Howard Terminal and Oakland International Airport for the Coliseum. I marked one of the Howard Terminal simulations as “downtown.”
This is actually quite interesting. We see that the downtown effect is negligible in the initial year but it emerges as the stadium hits 3 to 4 years old. So downtown stadiums appear to sustain attendance better than suburban stadiums. How much? This graphic says up to 700 fans per game but again, we have to interpret these effects as conservative.
The second finding is the lack of a weather effect. The weather station used for Howard Terminal was slightly colder but with less wind than the weather station for the Coliseum. I don’t know if that is representative of the specific Howard Terminal site, and perhaps the wind and temperature offset each other. In Part 2 of this series, I’ll be looking at weather in detail.
Good A’s Team
Next we look at predicted attendance for a hypothetical new A’s ballpark constructed in 2012, during a good A’s run from 2011 to 2014.
The good team scenario clearly has a longer new stadium bump than the bad team scenario. Because the team was good in 2013, the first-year attendance momentum carries forward—there is an autoregressive effect (i.e., looking at past year’s attendance) in the model. Winning in the first year of the stadium leads to a large bump that becomes spread across time. In other words, first-year winning plus a new stadium equals attendance over the medium-term.
But notice that the slope of the predicted attendance lines for most stadiums is slightly less than the slope of the actual attendance from 2013 to 2014. This is a period when the A’s were getting even better. It’s as if the A’s have a cap on attendance in the model, and no matter how good they got, the model refuses to predict more than 30,000 per game in attendance. I would guess this is because the A’s last drew more than 30,000 fans per game in 1992—which is before the training data for the model.
The exception, of course, is if they played in AT&T Park, or an AT&T-equivalent stadium. Then they would finally cross the 30,000 average attendance line. AT&T Park was transformative for the Giants.
As discussed, I think these simulations are conservative. I would love for the A’s to break my model. But they first need to build the damn ballpark.
This study captures the effect of the average new ballpark. The estimated effect is conservative, particularly for the difference between various new ballparks.
The average new ballpark has a sharp first year bump. Sustaining that bump likely depends on whether the team is winning and the location and quality of the stadium. Most teams are not able to sustain the bump long-term.
A downtown stadium is likely to draw at least slightly larger crowds than a suburban stadium, particularly when the team is not good. I did not see a notable effect of local weather between sites.
There are few precedents for a new stadium transforming a team’s attendance patterns long-term. The A’s will have to think outside of my “black box” model to achieve this goal.
Note: Boosting is a machine learning method and cannot be interpreted as easily as typical linear model. It isn’t possible for example, to estimate some “universal” effect of a stadium being downtown, for example. Statistical models are often built for generalization. This model is not. You can think of it as a complicated model for reality that we are using to extrapolate—with no measure of uncertainty—to hypothetical scenarios that are not in reality. That is dangerous! But there is frankly no way a linear model would work well for this problem.
I relied heavily on Troy Hepper’s scraping and cleaning code to download 1990-2018 game data from baseball-reference.com. Troy added moving averages for runs scored and allowed and an indicator for whether a game is within a division. Troy used stadium capacity data manually compiled by Chris Leonard. I also added some new predictors:
Temperature by day: average daily temperature, low temperature, high temperature, heating degree days, and cooling degree days from the NOAA NCDC weather station closest to the stadium.
Average wind speed by day
Metro-area population by year, extrapolated to the 1990s
FiveThirtyEight Elo ratings (i.e., how good the team is), pitcher adjustments, and win probabilities by game
Lag-1 (previous year) winning percentage, playoff status, average attendance, and Wins Above Replacement for the best player on the team (a proxy for star power).
Stadium construction and renovation dates to calculate stadium age
I downloaded weather data using the rnoaa package. Despite being familiar with these data from EMI Consulting, I spent many hours identifying missing and incomplete data and then replacing it. For example, I had to duplicate 2007 data for Alameda since 2006 and 2011 data were missing. I used more than 14GB of hourly data to produce the daily weather dataset. The final dataset had 59,661 games and 165 predictors per game.
I also fit the data with LightGBM and assessed whether an ensemble of XGBoost and LightGBM improved predictions. It didn’t. I “one hot encoded” my categorical variables after seeing higher predictions for stadiums at the beginning of the alphabet… don't let anyone tell you it’s not necessary. Parameters were tuned with cross-validation. I achieved a model score of 0.854, which is akin to an R-squared value.
Lastly, I used game data from 2007-2015 to predict A’s attendance in a new ballpark. A huge part of the model is how good the team is, who they’re playing, the day of the week, etc. so it was necessary to use real game data. The time period was split in two so that 2007-2011 represented a new stadium opening when the team was bad and 2011-2015 represented a new stadium opening when the team was good. I included an autoregressive component so predictions were made one year at a time.
Let me know if you have any questions about this analysis. I will share some of the code on GitHub in the coming weeks.