Information in Rotation

Good talk on scaling data services

Posted on 2009-11-03 by Dan Rabin

Werner Vogels of Amazon talks about availability and consistency at their kind of scale. What he says resonates with my experience from working at another big Web company.

Posted in Uncategorized | Leave a comment

¡Data Liberation, si!

Posted on 2009-09-14 by Dan Rabin

Google [disclosure: a former employer] introduces the Data Liberation Front to ensure that all its services have simple data export functionality.

Brad Fitzpatrick says:

What does product liberation look like? Said simply, a liberated product is one which has built-in features that make it easy (and free) to remove your data from the product in the event that you’d like to take it elsewhere.

Note the emphasis on free and easy. Online services tend to stop at possible, which falls short of easy by the proverbial Simple Matter of Programming.

Facebook, for example, has a programming interface for each kind of information you store that lets a Facebook application extract it into a file. It’s possible for me to write an application that does so for everything I have on Facebook, but that isn’t the same as having a predefined procedure that archives my entire Facebook presence into a standard file format that most social networking sites can import from.

Good luck to Google in this effort! I myself would go further in asking that the entire behavior (as well as data) of my social-network presence be standardizable and portable, as Ramón Cáceres [disclosure: personal friend] and his collaborators are proposing with their work on virtual individual servers.

Posted in Uncategorized | Leave a comment

Genome data at NCBI

Posted on 2009-09-13 by Dan Rabin

The National Center for Biotechnology Information (U.S.) has a nice online viewer for the genomes of many organisms, including Homo sapiens.

The human genome has just over 3 billion base pairs in about 25 thousand genes. This is a large enough data set that it gets algorithms of its own.

Posted in Areas of application, Biology | Leave a comment

Geoff Nunberg on Google Books metadata

Posted on 2009-09-03 by Dan Rabin

Linguist Geoff Nunberg comments on the poor general quality of metadata in Google Books, and why that’s a problem.

It’s a tough problem: if you do things (like scanning entire libraries) at Google-scale, you just can’t pay attention to the details. One partial way out (which Geoff mentions) is to allow users to submit corrections, as Google Maps does for positions of placemarks.

The article addresses a number of important points about the provenance and usefulness of metadata, and Google employees provide some great comments and discussion.

(Via Brad DeLong).

Posted in Information Philosophy, Information usage patterns, Metadata | Tagged Metadata | Leave a comment

Making public data APIs is a business now

Posted on 2009-07-06 by Dan Rabin

Jon Udell blogs about a company that builds interfaces to public-sector data.

Udell points out, quite rightly, that

â€œGive us the dataâ€ is an easy slogan to chant. And thereâ€™s no doubt that much good will come from poking through what we are given. But we also need to have ideas about what we want the data for, and communicate those ideas to the providers who are gathering it on our behalf.

Posted in Data and society | Leave a comment

New York City government seeks data miners

Posted on 2009-06-29 by Dan Rabin

Sewell Chan and Patrick McGeehan report today in the New York Times that the New York City government is out to make its piles of public data actually usable:

In what is planned to become an annual competition known as NYC Big Apps, the city will make available about 80 data sets from 32 city agencies and commissions. The winners of the competition will get a cash prize, recognition at a dinner with the mayor, and marketing opportunities.

One has to be wary of competitions: they can be a way of trying to get some work for free, or a sign that the project doesn’t have realistic funding behind it. On the other hand, it shows that the sponsor wants to tap a wider range of imagination than it would get with the usual contracting process.

Dinner with the mayor!

Hat tip: "tootie" at reddit.

Posted in Data and society | Leave a comment

Disk drive reliability in detail

Posted on 2009-06-29 by Dan Rabin

I tend to get abstract and philosophical about data here, and it’s good to have have an occasional splash in the cold water of how the stuff gets stored.

Jon Elerath’s article on hard disk reliability in the June 2009 Communications of the ACM (may require ACM login) gives a lot of detail about the different kinds of disk storage error and appropriate countermeasures. He gets down to the level of things that scratch vs. things that smear (kind of like H.P. Lovecraft for the archivally-minded).

The big takeaway is that there are big obvious crash-failures like the bearing getting wobbly or servo tracks being trashed: these make the drive stop working, and you rebuild from something you still trust. And then there are insidious quiet read/write failures that you can only counteract with a policy of "scrubbing" drives proactively.

Now that I buy storage by the 1.5 terabytes/spindle, I really should do something to dissuade Those Whose Names Are Random from assimilating my data to the Outer Abyss of Maximum Entropy. If you should happen to find some of your goats missing, better not to ask…

Posted in Storing data | Leave a comment

The two cultures

Posted on 2009-06-24 by Dan Rabin

Jon Stokes has an excellent description of the two contrasting philosophies of information management in his comparison of the Palm Pre and the iPhone.

He names the two approaches “structure-and-browse” and “collect-and-query”. I feel like I’ve been groping for these terse descriptions for years!

Posted in Information Philosophy, Information usage patterns | Leave a comment

Chris Anderson: One size metadata doesn’t fit all

Posted on 2007-03-21 by Dan Rabin

Misfits of Metadata

Chris Anderson of The Long Tail has an important post about how the metadata used in some music-listening applications doesn’t satisfy the listeners needs:

[...] classical is a genre that the one-size-fits-all music aggregators such as iTunes don’t handle particularly well. They’re oriented around pop music, with its artist, album, track data format. Meanwhile classical music organizes around composer, conductor, performer, soloist

He also voices my exact peeve about how jazz is treated:

However, neither of them does a very good job with Jazz, where the individual musicians are often more meaningful than the band.

Yup. No reasonable cataloguer of jazz recordings separates “Thelonious Monk Trio” from “Thelonious Monk Quartet” from “Thelonious Monk”. At the same time, it’s important to be able to locate all appearances of Thelonious Monk, regardless of whether he was the leader of the session (note that “leader” and “session” are appropriate terms in jazz discography, but not for pop or classical).

When your only tool is a hammer…

I can’t help but wonder if the problems Chris calls out in iTunes come from the poor selection of data tools in most applications programmers’ toolkits. Relational databases, the current orthodox storage technique, favor using one or more tables, each consisting of records having the same selection of attributes. There are hacks you can use to simulate having, say, jazz tracks and pop tracks in the same Tracks relation, but hacks and simulations tend to twist one’s code, so most programmers resist going there.

An XML database in every toolbox!

We don’t really have to live this way anymore. With the popularity of XML for data interchange, the tools ecology has given us a variety of XML database systems. The XML data model has the flexibility to represent varying record structures: in fact, it has much more flexibility than we need for the purpose!

Heretical as it may seem to put the cart of an interchange format before the horse of data abstraction, the XML situation is very useful in practice, at least for databases of moderate size. The W3C has come up with the XPath and XML Query specifications that provide excellent query mechanisms for data represented in the XML model. XML Query in particular is designed to look somewhat familiar to the hardened SQL user. There’s data typing taken from the XML Schema datatype recommendataion as well.

Better nails

Anyhow, let’s learn to design with a more flexible hammer, and maybe we’ll be able to hit a wider class of nails, rather than our users’ thumbs!

March is International Runaway Metaphor Month.

Posted in Areas of application, Information Philosophy, Information usage patterns, Metadata, Musical | Leave a comment

OpenStreetMap constructs maps from GPS tracks!

Posted on 2007-02-21 by Dan Rabin

Sources and uses of digital information are in-scope for this blog, and a great example just showed up in my RSS reader today.

OpenStreetMap is a wiki-like project to build a world map using contributed GPS tracks [OpenGeoData pointed me there]. Their map of Baghdad is here.

This project is truly a product of the early 21st century: it requires GPS satellites, cheap but accurate GPS receivers, the World Wide Web, inexpensive computers with fast color graphics, and so forth.

And like all modern geographic applications, it also exploits a special property of GPS’s information domain: everyone agrees on the meaning of geographical location; only dates and times have a similar level of standardization. In relational-database terminology, this means that any table with a date or location column has a meaningful join with any other.

This doesn’t work with most data. I’ve had driver’s licenses in four U.S. states, but you can’t aggregate my driving record from the state records because they all use different ID numbering schemes (nice for my privacy in this case).

Also noteworthy is the fact that GPS information can be used to put a time dimension into maps, since we can tell when the street is used as well as where it is. There are some very pretty examples at Cabspotting.

Posted in Areas of application, Geographic, Information Philosophy, Information usage patterns | Leave a comment