GeoData: Messy (and a peak at the future)
Tuesday, July 10th, 2007I don’t think people realize how dirty all aspects of local is. Really - other than maybe weather and movies, everything else is as suspect as you can get.
Local data: beyond dirty. There is no central body that contains even the most basic of business info.
Roads data: dirty. Again, no central body that contains all roads/highways/etc in the US. Even the US government’s data source comes woefully short.
Geo data: dirty. Rarely mentioned, but data pertaining to both informal space and formal space borders on ridiculous.
Local data gets a lot of attention, roads data gets mentioned here and there, but geo data is rarely (if ever) mentioned.
For example, one thing we wanted to offer at iBegin Source was metro areas - a lot of local data work is done for a metro, and we wanted to make that process simple. So we went to work on the biggest one: New York.
First step was the official US census data. ‘Massive’ barely covers it. See the picture below:

All that area encompassed in black is the official metro area.
The New York county area didn’t help either - a lot of area on the east that should be included (Brooklyn) wasn’t in the county.

DMA (designated media markets) were even worse, trying to rope in as many people as possible into a metro (obviously to charge higher prices).
Private sources didn’t fare better. For example, one of our data sources (very large company) had pre-assigned cities to metro areas. The NY Metro (each point is a city):

In the end, all we ended up with was a headache. And this was a (relatively) simple problem - New York Metro. Sure there may be issues about a city being included (or not being included), but in general this should be relatively easy. And it only gets far worse with other geo-data.
And so a little teaser for the future:

That was the Yorkville neighbourhood in Toronto. More to come.