New York City government seeks data miners

Written by Dan Rabin on June 29th, 2009

Sewell Chan and Patrick McGeehan report today in the New York Times that the New York City government is out to make its piles of public data actually usable:

In what is planned to become an annual competition known as NYC Big Apps, the city will make available about 80 data sets from 32 city agencies and commissions. The winners of the competition will get a cash prize, recognition at a dinner with the mayor, and marketing opportunities.

One has to be wary of competitions: they can be a way of trying to get some work for free, or a sign that the project doesn’t have realistic funding behind it. On the other hand, it shows that the sponsor wants to tap a wider range of imagination than it would get with the usual contracting process.

Dinner with the mayor!

Hat tip: "tootie" at reddit.

Disk drive reliability in detail

Written by Dan Rabin on June 29th, 2009

I tend to get abstract and philosophical about data here, and it’s good to have have an occasional splash in the cold water of how the stuff gets stored.

Jon Elerath’s article on hard disk reliability in the June 2009 Communications of the ACM (may require ACM login) gives a lot of detail about the different kinds of disk storage error and appropriate countermeasures. He gets down to the level of things that scratch vs. things that smear (kind of like H.P. Lovecraft for the archivally-minded).

The big takeaway is that there are big obvious crash-failures like the bearing getting wobbly or servo tracks being trashed: these make the drive stop working, and you rebuild from something you still trust. And then there are insidious quiet read/write failures that you can only counteract with a policy of "scrubbing" drives proactively.

Now that I buy storage by the 1.5 terabytes/spindle, I really should do something to dissuade Those Whose Names Are Random from assimilating my data to the Outer Abyss of Maximum Entropy. If you should happen to find some of your goats missing, better not to ask…

The two cultures

Written by Dan Rabin on June 24th, 2009

Jon Stokes has an excellent description of the two contrasting philosophies of information management in his comparison of the Palm Pre and the iPhone.

He names the two approaches “structure-and-browse” and “collect-and-query”. I feel like I’ve been groping for these terse descriptions for years!

OpenStreetMap constructs maps from GPS tracks!

Written by Dan Rabin on February 21st, 2007

Sources and uses of digital information are in-scope for this blog, and a great example just showed up in my RSS reader today.

OpenStreetMap is a wiki-like project to build a world map using contributed GPS tracks [OpenGeoData pointed me there]. Their map of Baghdad is here.

This project is truly a product of the early 21st century: it requires GPS satellites, cheap but accurate GPS receivers, the World Wide Web, inexpensive computers with fast color graphics, and so forth.

And like all modern geographic applications, it also exploits a special property of GPS’s information domain: everyone agrees on the meaning of geographical location; only dates and times have a similar level of standardization. In relational-database terminology, this means that any table with a date or location column has a meaningful join with any other.

This doesn’t work with most data. I’ve had driver’s licenses in four U.S. states, but you can’t aggregate my driving record from the state records because they all use different ID numbering schemes (nice for my privacy in this case).

Also noteworthy is the fact that GPS information can be used to put a time dimension into maps, since we can tell when the street is used as well as where it is. There are some very pretty examples at Cabspotting.

Information Patterns: series introduction

Written by Dan Rabin on February 2nd, 2007

Every time a new data format spec hits my inbox, I get a little twinge of dread.

Such documents are often enormous. They’re written in standardese (often badly). They’re usually written by committees. They go through a maze of twisty little revisions, all different.

But worst of all, they often bury their novelty in a sea of details that resemble those in the last spec I reviewed.

I’d like to do for data formats and other information representations what the Gang of Four book does for programs: call out and label the patterns that come up over and over again so that I can classify details into bigger chunks for mental processing.

You can expect to see several different kinds of post in this series:

  • Case studies. I have to look at lots of actual data formats in order to discern the patterns!
  • Data format patterns. Most posts will be about patterns I find in data formats…
  • Information usage patterns. …but some posts will be about how information is generated, stored, and used.
  • Other. I’ll probably think of some other topics as well.

I expect to look at simple cases, such as comma-separated values, as well as fiendishly complex cases, such as PDF. Programming-language syntaxes are fair game; database index disk structures are right out. In between, I’ll draw the boundary as interest dictates.

This series will be open-ended as long as people keep inventing data formats faster than I can look at them.

Quick addendum about the Chandler Repository

Written by Dan Rabin on January 31st, 2007

If my brief article about the Chandler Repository caught your interest, you might want to check out Andi’s blog, in which he discusses some of the design and implementation issues.
Accounts of software design from the time of implementation are incredibly valuable. Maybe someday we’ll have a search tool that lets you enter your implementation ideas and gives you back the accounts of people who have tried them before.

The Chandler Repository

Written by Dan Rabin on January 30th, 2007

I spent a few hours yesterday in the company of Andi Vajda, lead developer of the data repository component of the Open Source Applications Foundation’s Chandler project. We talked about the technical details of the repository.

The Chandler repository is an object database with some interesting design features:

  • All links between objects are bidirectional. Andi’s main motivation for this choice was to be able to guarantee to clients that references aren’t dangling without having to implement a garbage collector.
  • All objects have universally unique identifiers (UUIDs) to make equality testing trivial.
  • Repositories are versioned. Aside from the usual rollback capabilities this enables, client code can inspect a particular object as it was in any extant version.
  • In addition to a conventional notion of class inheritance, the Chandler repository supports a notion of “cloud inheritence” for merging data schemas when copying objects from one database to another.

The repository layer is cleanly separated from the layer of Chandler that knows the semantics of calendar items, tasks, and so forth. That layer is responsible for mapping its items to repository items. Interchange of personal information is carried out at this higher layer, not by the repository. There’s an interesting bit of Python metaprogramming by Philip Eby that hides some of the mapping complexity from the application-level programmer.

I’m going to have a look at the source to see how this works. There are popular application-building tools that carry out object/relational mappings (moving between the objects that are convenient to use in code and the representations that are convenient in the usual relational database), but this is the first time I’ve come face-to-face with an object/object mapping tool. I’ll report back when I’m better informed.

[Thanks to Andi for fact-checking this post!]

Scott Rosenberg’s Dreaming in Code [book pointer]

Written by Dan Rabin on January 30th, 2007

Scott Rosenberg’s Dreaming in Code is the best journalistic portrayal of software development that I’ve ever read.

The romantic cliché of the lone introverted genius shaping masterpieces through many midnights of unfathomable incantations is mercifully absent. Rosenberg follows the Open Source Applications Foundation’s Chandler project through several years of development, from initial impetus to its milestone 0.6 release. We see the process as it actually is: as a highly social undertaking in which people pass through the project, and the project passes through people’s lives. The developers have families, pets, outside interests; they also have passions (often conflicting) about technology and the process of creation.

Dreaming in Code is much more than a simple chronicle: Rosenberg delves deeply into the history of software development and the frustration it causes for its participants and customers as the results never seem to improve even as the underlying hardware undergoes the most rapid progress of any technology ever.

Issues of data representation, storage, and synchronization are front and center in Dreaming in Code, all carefully explained by the author in terms that make sense to the non-practitioner while remaining recognizable to us professionals (he’s really, really good at this).

I might give this book to my mom to read.

[Disclosure: I've known Andi Vajda, one of the developers portrayed in the book, for about twenty years, and count him as a friend.]