Personal Geotagging: Data wrangling

I recently learned that the process of cleaning datasets so that they can really be used is called “data wrangling”.

At first, I thought that the main data wrangling task in personal geotagging was going to be cleaning the GPS data. Sometimes the GPS is too coarse, sometimes it’s off because not enough satellites were in view, sometimes the GPS was just not turned on when I took the photo.

But, in the course of resetting my cameras’ clocks to standard time today, I realized that I’m going to have to wrangle the timestamps on the photos as well. I had thought that one-hour increments due to daylight savings and time-zone travel would be the main problem, but now I see that the minutes matter, too. Camera clocks don’t sync with the network or with GPS (at least not on my relatively ancient cameras), and I can move a significant distance in the three or seven minutes by which the clock has drifted away from GPS time.

I’m going to have to be able to apply short time offsets to the timestamps of all photos taken by a given camera on a given day. Setting the correction is going to require precisely recognizing at least one key location in each batch of photos in order to compute the offset.

That’s for past photos. For the future, I can get in the habit of either correcting the clock frequently or of snapping a reference shot with my cellphone along with each set of camera shots.

Space-time is tricky.

Posted in Geotagging | Leave a comment

Personal geotagging: note 1

The common item that lets you find a photo’s location in a GPS track is the time it was taken. In database terminology, time is the join column.

Since I’m starting with photos that are already timestamped, I want to be able to look up their timestamps in some sort of GPS database to see whether there’s a recorded time/location event near that time. Unfortunately, my initial perusal of a couple of geo-extensions to well known databases (SpatialLIte, PostGIS), show a generous helping of geometry-oriented query capabilities but no obvious time-query capabilities.

It could be that time-indexing is so simple compared with spatial indexing that what I’m looking for is just swamped in the documentation, or it could be that I’ll have to dig some more.

Posted in Geographic, Geotagging, Personal information | Leave a comment

Personal data: geotagging photos

I have over 50,000 photos I’d like to tag with the locations they were taken.

In a perfect world, this would be a relatively simple matter of looking up each photo’s creation date and time, finding that date and time in my years of GPS logs, and associating my GPS location for that date and time with the photo.

The world isn’t perfect. I took a lot of photos before I got a GPS device. Sometimes the device’s batteries give out or I forget to turn it on. There may be time zone and daylight-savings issues that make nonsense of the date-and-time link between the photo and the location, and I need to find and fix them.

Even if I manage to bring perfection to my existing data sets, I had better design a smooth-enough workflow for my future photos that I’ll actually keep up, and of course I’ll need to correct future mistakes and omissions.

Further, tagging the photos really becomes useful if I put them on maps, so I need a tool to make located map icons that point to the photos. I already tag my photos with broad categories, and I’d like to be able to vary the icon according to the category, or to sort the categories into separate map layers.

These are the basic goals. I’ll be blogging more as I design and build this thing.

Posted in Geographic, Geotagging, Personal information | Leave a comment

Finding data, not just information, on the Web

Matthew Hurst at Data Mining points to his experimental site. d8taplex, a site for exploring data sets found on the web. The current state of the site only has 50,000 data sets from a few countries, with a limited set of visualizations and analyses available. He’s working on visual design, not on scaling up to full web search.

The experiment looks promising, and I hope he finds a way to hook up the high-quality graphics to a good indexing of available data sets Web-wide.

Posted in Data search, Information visualization | Leave a comment

Martin Odersky founds Scala-based startup

Peter Delevett reports in the San José Mercury-News that Martin Odersky (with whom I did my Ph.D. research at Yale) is starting up a company to serve the Scala programming language, which he developed.

[via TechMeme]

Posted in Uncategorized | Leave a comment

Slower than paper

Palo Alto’s Peninsula Creamery operates two restaurants with identical menus. The one downtown uses traditional paper order pads; the one at the Stanford Shopping Center uses a bulky portable electronic gadget.

Yesterday I ate at the shopping center branch, and I noticed the server hunting for the button as I reeled off options (“wheat toast”, “hash browns”). The process was distinctly slower than with paper.

The custom of using selection lists in computer interfaces derives from work showing that it’s easier for humans to select from a list than to remember a command, but maybe they’re both slower than have the server scrawl the brief code that the diner has been using for years.

Posted in User experience | Leave a comment

Economist: Information Overload

The Economist has a special pull-out section on information overload this week. It’s a useful non-technical overview of where things are.

Posted in Uncategorized | Leave a comment

Thoughts on XML namespaces from James Clark

Influential XML personage James Clark has posted a very carefully thought-out essay on XML namespaces.

Everybody loves to bash XML namespaces, but this essay is the most careful and dispassionate I have seen to date. You should read the whole thing if you deal with XML (or even if you just like to read the product of clear thinking).

Here’s a key quote:

I would claim that the aspect of XML Namespaces that causes pain is the URI/prefix duality: the thing that occurs in the document (the prefix + local name) is not the same as the thing that is semantically significant (the namespace URI + local name). As soon as you accept this duality, I believe you are doomed to a significant extra layer of complexity.

Posted in Uncategorized | Leave a comment

New York Times article on mining local government data

The New York Times has an article on mining local-government data for unforeseen purposes.

Nothing new here, but its being in the Times means my mom reads about it, and yours might too.

Posted in Uncategorized | Leave a comment

Dueling GPSes: Garmin vs. Android G1

For the last 2.5 years I’ve been using a Garmin GPSmap 60CSx device to record my travels. It’s not bad overall, but sometimes if I don’t replace the micro-SD card just right it doesn’t record my tracks.

Recalling that my Android G1 phone has GPS, I installed MyTracks from the online Android apps store.

Yesterday I took both devices for a spin together, and they recorded the same paths (within a few feet), with more differences while walking or indoors than while driving. Today I tried a comparison bike ride, which the Android app flunked by only seeing satellites intermittently from the pocket where I was carrying it. I’ll have to try again with the phone away from my body.

Assuming that works, the real difference between the two devices is cultural.

The Garmin is a classic device from a hardware company. It’s marketed to runners, bicyclists, boaters, hikers, and other outdoorists. The battery compartment, display, and data ports are protected by rubber fittings: you can use it in the rain. On the other hand, the software sucks, and it’s hooked to a business model that makes maps from Garmin the only ones you can install.

The G1 is an open smartphone with a GPS receiver in it. The MyTracks software is pretty good, and the mapping comes from Google Maps on the device. This is the classic computer platform style: the platform vendor stands back and lets third-parties compete to make the best app in each niche. On the other hand, the hardware is Just a Phone: as suggested above, I have my doubts about the antenna, and there’s certainly no special provision for the elements.

Historical evidence from analogous market situations strongly suggests that the open platform model will win. Hardware-company culture will suck up energy by trying to segment the market with too many SKUs (have a look through Garmin’s web site to see what I mean), and software improvement will always be an afterthought as they try to move boxes through channels as diverse as Best Buy, Crutchfield, REI, and auto parts stores.

Android GPS software vendors, on the other hand, only have one channel to worry about: the app store. They only have to worry about being able to clearly state which devices their apps run on. This frees them to concentrate on acquiring and making interesting use of GPS data. They have no physical boxes to move (the hardware vendors take that risk).

On the hardware side, the niche for rain-resistance can be satisfied by accessory makers making rubber sleeves for entire devices or by hardware vendors wrapping rugged housings around electronics designed and built by another company that knows how. In neither case does the hardware outfit need to know that the customer wants to do GPS: it’s sufficient just to know that they anticipate getting rained on for whatever reason.

The same arguments apply to iPhone, of course, but I don’t have one of those to write about.

Posted in Uncategorized | Leave a comment