Werner Vogels of Amazon talks about availability and consistency at their kind of scale. What he says resonates with my experience from working at another big Web company.
Google [disclosure: a former employer] introduces the Data Liberation Front to ensure that all its services have simple data export functionality.
Brad Fitzpatrick says:
What does product liberation look like? Said simply, a liberated product is one which has built-in features that make it easy (and free) to remove your data from the product in the event that you’d like to take it elsewhere.
Note the emphasis on free and easy. Online services tend to stop at possible, which falls short of easy by the proverbial Simple Matter of Programming.
Facebook, for example, has a programming interface for each kind of information you store that lets a Facebook application extract it into a file. It’s possible for me to write an application that does so for everything I have on Facebook, but that isn’t the same as having a predefined procedure that archives my entire Facebook presence into a standard file format that most social networking sites can import from.
Good luck to Google in this effort! I myself would go further in asking that the entire behavior (as well as data) of my social-network presence be standardizable and portable, as Ramón Cáceres [disclosure: personal friend] and his collaborators are proposing with their work on virtual individual servers.
The human genome has just over 3 billion base pairs in about 25 thousand genes. This is a large enough data set that it gets algorithms of its own.
Udell points out, quite rightly, that
â€œGive us the dataâ€ is an easy slogan to chant. And thereâ€™s no doubt that much good will come from poking through what we are given. But we also need to have ideas about what we want the data for, and communicate those ideas to the providers who are gathering it on our behalf.
Sewell Chan and Patrick McGeehan report today in the New York Times that the New York City government is out to make its piles of public data actually usable:
In what is planned to become an annual competition known as NYC Big Apps, the city will make available about 80 data sets from 32 city agencies and commissions. The winners of the competition will get a cash prize, recognition at a dinner with the mayor, and marketing opportunities.
One has to be wary of competitions: they can be a way of trying to get some work for free, or a sign that the project doesn’t have realistic funding behind it. On the other hand, it shows that the sponsor wants to tap a wider range of imagination than it would get with the usual contracting process.
Dinner with the mayor!
Hat tip: "tootie" at reddit.
I tend to get abstract and philosophical about data here, and it’s good to have have an occasional splash in the cold water of how the stuff gets stored.
Jon Elerath’s article on hard disk reliability in the June 2009 Communications of the ACM (may require ACM login) gives a lot of detail about the different kinds of disk storage error and appropriate countermeasures. He gets down to the level of things that scratch vs. things that smear (kind of like H.P. Lovecraft for the archivally-minded).
The big takeaway is that there are big obvious crash-failures like the bearing getting wobbly or servo tracks being trashed: these make the drive stop working, and you rebuild from something you still trust. And then there are insidious quiet read/write failures that you can only counteract with a policy of "scrubbing" drives proactively.
Now that I buy storage by the 1.5 terabytes/spindle, I really should do something to dissuade Those Whose Names Are Random from assimilating my data to the Outer Abyss of Maximum Entropy. If you should happen to find some of your goats missing, better not to ask…
Jon Stokes has an excellent description of the two contrasting philosophies of information management in his comparison of the Palm Pre and the iPhone.
He names the two approaches “structure-and-browse” and “collect-and-query”. I feel like I’ve been groping for these terse descriptions for years!
Sources and uses of digital information are in-scope for this blog, and a great example just showed up in my RSS reader today.
This project is truly a product of the early 21st century: it requires GPS satellites, cheap but accurate GPS receivers, the World Wide Web, inexpensive computers with fast color graphics, and so forth.
And like all modern geographic applications, it also exploits a special property of GPS’s information domain: everyone agrees on the meaning of geographical location; only dates and times have a similar level of standardization. In relational-database terminology, this means that any table with a date or location column has a meaningful join with any other.
This doesn’t work with most data. I’ve had driver’s licenses in four U.S. states, but you can’t aggregate my driving record from the state records because they all use different ID numbering schemes (nice for my privacy in this case).
Also noteworthy is the fact that GPS information can be used to put a time dimension into maps, since we can tell when the street is used as well as where it is. There are some very pretty examples at Cabspotting.