Information in Rotation

Personal Geotagging: Data wrangling

Dan Rabin — Sun, 01 Nov 2015 20:21:20 +0000

I recently learned that the process of cleaning datasets so that they can really be used is called “data wrangling”.

At first, I thought that the main data wrangling task in personal geotagging was going to be cleaning the GPS data. Sometimes the GPS is too coarse, sometimes it’s off because not enough satellites were in view, sometimes the GPS was just not turned on when I took the photo.

But, in the course of resetting my cameras’ clocks to standard time today, I realized that I’m going to have to wrangle the timestamps on the photos as well. I had thought that one-hour increments due to daylight savings and time-zone travel would be the main problem, but now I see that the minutes matter, too. Camera clocks don’t sync with the network or with GPS (at least not on my relatively ancient cameras), and I can move a significant distance in the three or seven minutes by which the clock has drifted away from GPS time.

I’m going to have to be able to apply short time offsets to the timestamps of all photos taken by a given camera on a given day. Setting the correction is going to require precisely recognizing at least one key location in each batch of photos in order to compute the offset.

That’s for past photos. For the future, I can get in the habit of either correcting the clock frequently or of snapping a reference shot with my cellphone along with each set of camera shots.

Space-time is tricky.

Personal geotagging: note 1

Dan Rabin — Thu, 24 Sep 2015 05:06:34 +0000

The common item that lets you find a photo’s location in a GPS track is the time it was taken. In database terminology, time is the join column.

Since I’m starting with photos that are already timestamped, I want to be able to look up their timestamps in some sort of GPS database to see whether there’s a recorded time/location event near that time. Unfortunately, my initial perusal of a couple of geo-extensions to well known databases (SpatialLIte, PostGIS), show a generous helping of geometry-oriented query capabilities but no obvious time-query capabilities.

It could be that time-indexing is so simple compared with spatial indexing that what I’m looking for is just swamped in the documentation, or it could be that I’ll have to dig some more.

Personal data: geotagging photos

Dan Rabin — Wed, 02 Sep 2015 06:54:20 +0000

I have over 50,000 photos I’d like to tag with the locations they were taken.

In a perfect world, this would be a relatively simple matter of looking up each photo’s creation date and time, finding that date and time in my years of GPS logs, and associating my GPS location for that date and time with the photo.

The world isn’t perfect. I took a lot of photos before I got a GPS device. Sometimes the device’s batteries give out or I forget to turn it on. There may be time zone and daylight-savings issues that make nonsense of the date-and-time link between the photo and the location, and I need to find and fix them.

Even if I manage to bring perfection to my existing data sets, I had better design a smooth-enough workflow for my future photos that I’ll actually keep up, and of course I’ll need to correct future mistakes and omissions.

Further, tagging the photos really becomes useful if I put them on maps, so I need a tool to make located map icons that point to the photos. I already tag my photos with broad categories, and I’d like to be able to vary the icon according to the category, or to sort the categories into separate map layers.

These are the basic goals. I’ll be blogging more as I design and build this thing.

Finding data, not just information, on the Web

Dan Rabin — Tue, 17 May 2011 06:10:58 +0000

Matthew Hurst at Data Mining points to his experimental site. d8taplex, a site for exploring data sets found on the web. The current state of the site only has 50,000 data sets from a few countries, with a limited set of visualizations and analyses available. He’s working on visual design, not on scaling up to full web search.

The experiment looks promising, and I hope he finds a way to hook up the high-quality graphics to a good indexing of available data sets Web-wide.

Martin Odersky founds Scala-based startup

Dan Rabin — Thu, 12 May 2011 16:07:49 +0000

Peter Delevett reports in the San José Mercury-News that Martin Odersky (with whom I did my Ph.D. research at Yale) is starting up a company to serve the Scala programming language, which he developed.

[via TechMeme]

Slower than paper

Dan Rabin — Mon, 09 May 2011 20:01:57 +0000

Palo Alto’s Peninsula Creamery operates two restaurants with identical menus. The one downtown uses traditional paper order pads; the one at the Stanford Shopping Center uses a bulky portable electronic gadget.

Yesterday I ate at the shopping center branch, and I noticed the server hunting for the button as I reeled off options (“wheat toast”, “hash browns”). The process was distinctly slower than with paper.

The custom of using selection lists in computer interfaces derives from work showing that it’s easier for humans to select from a list than to remember a command, but maybe they’re both slower than have the server scrawl the brief code that the diner has been using for years.

Economist: Information Overload

Dan Rabin — Wed, 03 Mar 2010 18:33:38 +0000

The Economist has a special pull-out section on information overload this week. It’s a useful non-technical overview of where things are.

Thoughts on XML namespaces from James Clark

Dan Rabin — Sat, 02 Jan 2010 09:06:34 +0000

Influential XML personage James Clark has posted a very carefully thought-out essay on XML namespaces.

Everybody loves to bash XML namespaces, but this essay is the most careful and dispassionate I have seen to date. You should read the whole thing if you deal with XML (or even if you just like to read the product of clear thinking).

Here’s a key quote:

I would claim that the aspect of XML Namespaces that causes pain is the URI/prefix duality: the thing that occurs in the document (the prefix + local name) is not the same as the thing that is semantically significant (the namespace URI + local name). As soon as you accept this duality, I believe you are doomed to a significant extra layer of complexity.

New York Times article on mining local government data

Dan Rabin — Mon, 07 Dec 2009 04:54:29 +0000

The New York Times has an article on mining local-government data for unforeseen purposes.

Nothing new here, but its being in the Times means my mom reads about it, and yours might too.

Dueling GPSes: Garmin vs. Android G1

Dan Rabin — Mon, 07 Dec 2009 04:01:49 +0000

For the last 2.5 years I’ve been using a Garmin GPSmap 60CSx device to record my travels. It’s not bad overall, but sometimes if I don’t replace the micro-SD card just right it doesn’t record my tracks.

Recalling that my Android G1 phone has GPS, I installed MyTracks from the online Android apps store.

Yesterday I took both devices for a spin together, and they recorded the same paths (within a few feet), with more differences while walking or indoors than while driving. Today I tried a comparison bike ride, which the Android app flunked by only seeing satellites intermittently from the pocket where I was carrying it. I’ll have to try again with the phone away from my body.

Assuming that works, the real difference between the two devices is cultural.

The Garmin is a classic device from a hardware company. It’s marketed to runners, bicyclists, boaters, hikers, and other outdoorists. The battery compartment, display, and data ports are protected by rubber fittings: you can use it in the rain. On the other hand, the software sucks, and it’s hooked to a business model that makes maps from Garmin the only ones you can install.

The G1 is an open smartphone with a GPS receiver in it. The MyTracks software is pretty good, and the mapping comes from Google Maps on the device. This is the classic computer platform style: the platform vendor stands back and lets third-parties compete to make the best app in each niche. On the other hand, the hardware is Just a Phone: as suggested above, I have my doubts about the antenna, and there’s certainly no special provision for the elements.

Historical evidence from analogous market situations strongly suggests that the open platform model will win. Hardware-company culture will suck up energy by trying to segment the market with too many SKUs (have a look through Garmin’s web site to see what I mean), and software improvement will always be an afterthought as they try to move boxes through channels as diverse as Best Buy, Crutchfield, REI, and auto parts stores.

Android GPS software vendors, on the other hand, only have one channel to worry about: the app store. They only have to worry about being able to clearly state which devices their apps run on. This frees them to concentrate on acquiring and making interesting use of GPS data. They have no physical boxes to move (the hardware vendors take that risk).

On the hardware side, the niche for rain-resistance can be satisfied by accessory makers making rubber sleeves for entire devices or by hardware vendors wrapping rugged housings around electronics designed and built by another company that knows how. In neither case does the hardware outfit need to know that the customer wants to do GPS: it’s sufficient just to know that they anticipate getting rained on for whatever reason.

The same arguments apply to iPhone, of course, but I don’t have one of those to write about.

Good talk on scaling data services

Dan Rabin — Wed, 04 Nov 2009 00:42:58 +0000

Werner Vogels of Amazon talks about availability and consistency at their kind of scale. What he says resonates with my experience from working at another big Web company.

¡Data Liberation, si!

Dan Rabin — Mon, 14 Sep 2009 20:03:22 +0000

Google [disclosure: a former employer] introduces the Data Liberation Front to ensure that all its services have simple data export functionality.

Brad Fitzpatrick says:

What does product liberation look like? Said simply, a liberated product is one which has built-in features that make it easy (and free) to remove your data from the product in the event that you’d like to take it elsewhere.

Note the emphasis on free and easy. Online services tend to stop at possible, which falls short of easy by the proverbial Simple Matter of Programming.

Facebook, for example, has a programming interface for each kind of information you store that lets a Facebook application extract it into a file. It’s possible for me to write an application that does so for everything I have on Facebook, but that isn’t the same as having a predefined procedure that archives my entire Facebook presence into a standard file format that most social networking sites can import from.

Good luck to Google in this effort! I myself would go further in asking that the entire behavior (as well as data) of my social-network presence be standardizable and portable, as Ramón Cáceres [disclosure: personal friend] and his collaborators are proposing with their work on virtual individual servers.

Genome data at NCBI

Dan Rabin — Sun, 13 Sep 2009 23:27:47 +0000

The National Center for Biotechnology Information (U.S.) has a nice online viewer for the genomes of many organisms, including Homo sapiens.

The human genome has just over 3 billion base pairs in about 25 thousand genes. This is a large enough data set that it gets algorithms of its own.

Geoff Nunberg on Google Books metadata

Dan Rabin — Thu, 03 Sep 2009 16:39:30 +0000

Linguist Geoff Nunberg comments on the poor general quality of metadata in Google Books, and why that’s a problem.

It’s a tough problem: if you do things (like scanning entire libraries) at Google-scale, you just can’t pay attention to the details. One partial way out (which Geoff mentions) is to allow users to submit corrections, as Google Maps does for positions of placemarks.

The article addresses a number of important points about the provenance and usefulness of metadata, and Google employees provide some great comments and discussion.

(Via Brad DeLong).

Making public data APIs is a business now

Dan Rabin — Mon, 06 Jul 2009 14:50:59 +0000

Jon Udell blogs about a company that builds interfaces to public-sector data.

Udell points out, quite rightly, that

â€œGive us the dataâ€ is an easy slogan to chant. And thereâ€™s no doubt that much good will come from poking through what we are given. But we also need to have ideas about what we want the data for, and communicate those ideas to the providers who are gathering it on our behalf.