Eric Fischer works on data visualization and analysis tools at Mapbox. He was previously an artist in residence at the Exploratorium and before that was on the Android team at Google. He is best known for “big data” projects using geotagged photos and tweets, but has also spent a lot of time in libraries over the years searching through old plans and reports trying to understand how the world got to be the way it is.
Eric was interviewed for GeoHipster by Amy Smith.
Q: You’re coming up on four years at Mapbox, is that right? What do you do there?
A: I still feel like I must be pretty new there, but it actually has been a long time, and the company has grown tremendously since I started. My most important work at Mapbox has been Tippecanoe, an open-source tool whose goal is to be able to ingest just about any kind of geographic data, from continents to parcels to individual GPS readings, numbering into the hundreds of millions of features, and to create appropriate vector tiles from them for visualization and analysis at any scale. (The name is a joke on “Tippecanoe and Tyler Too,” the 1840 US Presidential campaign song, because it makes tiles, so it’s a Tyler.)
Q: I read that you’re working on improving the accuracy of the OpenStreetMap base map. Can you describe that process? I’m guessing one would need to figure out how accurate it is in the first place?
A: I should probably update my bio, because that was originally a reference to a project from long ago: to figure out whether it would be possible to automatically apply all the changes that the US Census had made to their TIGER/Line base map of the United States since it was imported into OpenStreetMap in 2006, without overriding or creating conflicts with any of the millions of edits that had already been made directly to OpenStreetMap. Automated updates proved to be too ambitious, and the project was scaled back to identifying areas where TIGER and OpenStreetMap differed substantially so they could be reconciled manually.
But the work continues. These days, TIGER is valuable to OpenStreetMap mostly as a source of street names and political boundaries, while missing and misaligned streets are now identified mostly through anonymized GPS data. Tile-count is an open source tool that I wrote a few months ago for accumulating, normalizing, and visualizing the density of these GPS tracks so they can be used to find streets and trails that are missing from OpenStreetMap.
Q: In the professional mapping world, I’ve noticed there’s a nervousness around datasets that aren’t time-tested, clearly documented, and from an authoritative source such as the US Census. These official datasets are great resources of course, but there’s a growing amount of data at our fingertips that’s not always so clean or complete. You’ve been successful at getting others to see that there’s a lot to learn about cities and people with dynamic (and sometimes messy) data that comes from many different sources. Do you have any advice on warming people up to thinking creatively and constructively with unconventional datasets?
A: I think the key thing to be aware of is that all data has errors, just varying in type and degree. I don’t think you can spend very much time working with Census data from before 2010 without discovering that a lot of features on the TIGER base map were missing or don’t really exist or are tagged with the wrong name or mapped at the wrong location. TIGER is much better now, but a lot of cases still stand out where Census counts are assigned to the wrong block, either by mistake or for privacy reasons. The big difference isn’t that the Census is necessarily correct, but that it tries to be comprehensive and systematic. With other data sets whose compilers don’t or can’t make that effort, the accuracy might be better or it might be worse, but you have to figure out for yourself where the gaps and biases are and how much noise there is mixed in with the signal. If you learn something interesting from it, it’s worth putting in that extra effort.
Q: Speaking of unconventional data: you maintain a GitHub repository with traffic count data scraped from old planning documents. For those who may not be familiar, traffic counts are usually collected for specific studies or benchmarks, put into a model or summarized in a report… and then rarely revisited. But you’ve brought them back from the grave for many cities and put them in handy easy-to-use-and-access formats, such as these ones from San Francisco. Are you using them for a particular project? How do you anticipate/hope that others will use them?
A: The traffic count repository began as a way of working through my own anxieties about what unconventional datasets really represent. I could refer to clusters of geotagged photos as “interesting” and clusters of geotagged tweets as “popular” without being challenged, but the lack of rigor made it hard to draw any solid conclusions about these places.
And I wanted solid conclusions because I wasn’t making these maps in a vacuum for their own sake. I wanted to know what places were interesting and popular so that I could ask the follow-up questions: What do these places have in common? What are the necessary and sufficient characteristics of their surroundings? What existing regulations prevent, and what different regulations would encourage, making more places like them? What else would be sacrificed if we made these changes? Or is the concentration of all sparks of life into a handful of neighborhoods in a handful of metro areas the inevitable consequence of a 150-year-long cycle of adoption of transportation technology?
So it was a relief to discover Toronto’s traffic count data and that the tweet counts near intersections correlated reasonably well with the pedestrian counts. Instead of handwaving about “popularity” I could relate the tweet counts to a directly observable phenomenon.
And in fact the pedestrian counts seemed to be closer than tweet counts to what I was really looking for in the first place: an indicator of where people prefer to spend time and where they prefer to avoid. Tweets are reflective of this, but also capture lots of places where people are enduring long waits (airport terminals being the most blatant case) rather than choosing to be present. Not every pedestrian street crossing is by choice either, but even when people don’t control the origin and destination of their trips, they do generally have flexibility to choose the most pleasant route in between.
That was enough to get me fixated on the idea that high pedestrian volume was the key to everything and that I should find as many public sources of pedestrian counts as possible so I could understand what the numbers look like and where they come from. Ironically, a lot of these reports that I downloaded were collecting pedestrian counts so they could calculate Pedestrian Level of Service, which assumes that high crossing volumes are bad, because if volumes are very high, people are crowded. But the numbers are still valid even if the conclusions being drawn from them are the opposite.
What I got out of it was, first of all, basic numeracy about the typical magnitudes of pedestrian volumes in different contexts and over the course of each day. Second, I was able to make a model to predict pedestrian volumes from surrounding residential and employment density, convincing myself that proximity to retail and restaurants is almost solely responsible for the number, and that streetscape design and traffic engineering are secondary concerns. Third, I disproved my original premise, because the data showed me that there are places with very similar pedestrian volumes that I feel very differently about.
If “revealed preference” measured by people crossing the street doesn’t actually reveal my own preferences, what does? The ratio of pedestrians to vehicles is still a kind of revealed preference, of mode choice, but the best fit between that and my “stated preference” opinions, while better than pedestrian volume alone, requires an exponent of 1.5 on the vehicle count, which puts it back into the realm of modeling, not measuring. There may yet be an objective measure of the goodness of places, but I haven’t found it yet.
Why did I put the data on GitHub? Because of a general hope that if data is useful to me, it might also be useful to someone else. The National Bicycle and Pedestrian Documentation Project is supposedly collecting this same sort of data for general benefit, but as far as I can tell has not made any of it available. Portland State University has another pedestrian data collection project with no public data. Someday someone may come up with the perfect data portal and maybe even release some data into it, but in the meantime, pushing out CSVs gets the data that actually exists but has previously been scattered across hundreds of unrelated reports into a form that is accessible and usable.
Q: What tools do you use the most these days to work with spatial data (including any tools you’ve created — by the way, thanks for sharing your geotools on Github)?
A: My current processes are usually very Mapbox-centric: Turf.js or ad hoc scripts for data analysis, Tippecanoe for simplification and tiling, MBView for previewing, and Mapbox Studio for styling. Sometimes I still generate PostScript files instead of web maps. The tool from outside the Mapbox world that I use most frequently is ogr2ogr for reprojection and file format conversion. It is still a constant struggle to try to make myself use GeoJSON for everything instead of inventing new file formats all the time, and to use Node and standard packages instead of writing one-of-a-kind tools in Perl or C++.
Q: You’re prolific on Twitter. What do you like about it, and what do you wish was better?
A: I was an early enough adopter of Twitter to get a three-letter username, but it wasn’t until the start of 2011 that I started really using it. Now it is my main source of news and conversation about maps, data, housing policy, transportation planning, history, and the latest catastrophes of national politics, and a place to share discoveries and things to read. I’ve also used long reply-to-myself Twitter threads as a way of taking notes in public as I’ve read through the scientific literature on colorblindness and then a century of San Francisco Chronicle articles revealing the shifting power structures of city planning.
That said, the Twitter timeline interface has become increasingly unusable as they have pulled tweets out of sequence into “in case you missed it” sections and polluted the remainder of the feed with a barrage of tweets that other people marked as favorites. I recently gave up entirely on the timeline and started reading Twitter only through a list, the interface for which still keeps the old promise that it will show you exactly what you subscribed to, in order.
Q: If you could go back in time, what data would you collect, from when, and where?
A: I would love to have pedestrian (and animal) intersection crossing volume data from the days before cars took over. Was the median pedestrian trip length substantially longer then, or can the changes in pedestrian volumes since motorization all be attributed to changes in population and employment density?
Speaking of which, I wish comprehensive block-level or even tract-level population and employment data went back more than a few decades, and had been collected more frequently. So much of the story of 20th century suburbanization, urban and small-town decline, and reconsolidation can only be told through infrequent, coarse snapshots.
And I wish I had been carrying a GPS receiver around with me (or that it had even been possible to do so) for longer, so that I could understand my own historic travel patterns better. I faintly remember walking to school as a kid and wondering, if I don’t remember this walk, did it really happen? Now my perspective is, if there is no GPS track, did it really happen?
Q: Are you a geohipster? Why or why not?
A: I think the most hipster thing I’ve got going on is a conviction that I’m going to find a hidden gem in a pile of forgotten old songs, except that I’m doing my searching in promo copies of 70-year-old sheet music instead of in the used record stores.