Information in Rotation » Metadata

Geoff Nunberg on Google Books metadata

Dan Rabin — Thu, 03 Sep 2009 16:39:30 +0000

Linguist Geoff Nunberg comments on the poor general quality of metadata in Google Books, and why that’s a problem.

It’s a tough problem: if you do things (like scanning entire libraries) at Google-scale, you just can’t pay attention to the details. One partial way out (which Geoff mentions) is to allow users to submit corrections, as Google Maps does for positions of placemarks.

The article addresses a number of important points about the provenance and usefulness of metadata, and Google employees provide some great comments and discussion.

(Via Brad DeLong).

Chris Anderson: One size metadata doesn’t fit all

Dan Rabin — Wed, 21 Mar 2007 17:04:23 +0000

Misfits of Metadata

Chris Anderson of The Long Tail has an important post about how the metadata used in some music-listening applications doesn’t satisfy the listeners needs:

[...] classical is a genre that the one-size-fits-all music aggregators such as iTunes don’t handle particularly well. They’re oriented around pop music, with its artist, album, track data format. Meanwhile classical music organizes around composer, conductor, performer, soloist

He also voices my exact peeve about how jazz is treated:

However, neither of them does a very good job with Jazz, where the individual musicians are often more meaningful than the band.

Yup. No reasonable cataloguer of jazz recordings separates “Thelonious Monk Trio” from “Thelonious Monk Quartet” from “Thelonious Monk”. At the same time, it’s important to be able to locate all appearances of Thelonious Monk, regardless of whether he was the leader of the session (note that “leader” and “session” are appropriate terms in jazz discography, but not for pop or classical).

When your only tool is a hammer…

I can’t help but wonder if the problems Chris calls out in iTunes come from the poor selection of data tools in most applications programmers’ toolkits. Relational databases, the current orthodox storage technique, favor using one or more tables, each consisting of records having the same selection of attributes. There are hacks you can use to simulate having, say, jazz tracks and pop tracks in the same Tracks relation, but hacks and simulations tend to twist one’s code, so most programmers resist going there.

An XML database in every toolbox!

We don’t really have to live this way anymore. With the popularity of XML for data interchange, the tools ecology has given us a variety of XML database systems. The XML data model has the flexibility to represent varying record structures: in fact, it has much more flexibility than we need for the purpose!

Heretical as it may seem to put the cart of an interchange format before the horse of data abstraction, the XML situation is very useful in practice, at least for databases of moderate size. The W3C has come up with the XPath and XML Query specifications that provide excellent query mechanisms for data represented in the XML model. XML Query in particular is designed to look somewhat familiar to the hardened SQL user. There’s data typing taken from the XML Schema datatype recommendataion as well.

Better nails

Anyhow, let’s learn to design with a more flexible hammer, and maybe we’ll be able to hit a wider class of nails, rather than our users’ thumbs!

March is International Runaway Metaphor Month.

Should metadata be stored in the file it describes? Jon Udell wonders…

Dan Rabin — Wed, 21 Feb 2007 04:20:09 +0000

In “Whoâ€™s got the tag? Database truth versus file truth, part 3″, Jon Udell contrasts the Microsoft Vista and Mac OS X ways of associating metadata tags with image files: Vista tends to store them into the image files, and Mac OS X tends to leave the files untouched and use a separate database to store the tags (or at least Jon was under this impression).

There’s a great discussion about the relative advantages of the two approaches on the blog. Basically, storing the tags in the file makes the association harder to lose as you move the file around, and storing the tags separately avoids modifying the user’s data file. Neither one is obviously in accord with the user’s intention in all cases.

I think the issue has whole extra layers of subtlety. We perceive metadata that is stored within a data file as being what Jon Udell calls “file truth”. Since there’s only one set of metadata stored in the file, it becomes the One True Metadata. On the other hand, metadata stored in a separate database reads as the opinion of the maintainer of the database. This is exactly what social bookmarking systems such as del.icio.usdo: each attribution of a tag to a URL is also associated with a user making that attribution.

A pluralistic society requires a separate metadatabase!

This isn’t just another engineering tradeoff, though. The truth about “file truth” is that it’s still an opinionâ€”the opinion of the last agent to modify the metadata within the file. When there’s One True Metadata, we can only represent disagreements by obliterating the last guy’s assertion.

Imagine trying to tag a scan of a photo taken at your parents’ wedding of someone you don’t recognize. You think it’s Dad’s college roommate, but your sister thinks it’s Mom’s second cousin. You have one “person depicted” slot: do you fight over it? Do you leave it blank and explain the situation in a semantically bland catch-all description field? Or do you each tag it as you will in your respective databases?

Not only is it unrealistic to allow for only one true description of a file, it’s also time we stopped regarding metadata as lost forever just because it’s not stored in the file. We could set up a distributed database that works like Gracenote‘s CD identification database, but for all files instead of just music files. As with CDs, the lookup key for a file can be generated by anyone who possesses the file (by applying a secure hash), but the particular metadata obtained depends on which tagger’s part of the repository is consulted. It’s all doable, and it would eliminate blogstorms about how evil application X erases user metadata.

Metadata Drift

Dan Rabin — Sun, 28 Jan 2007 17:14:22 +0000

Mark Dominus has an interesting post in which he does some serious software archaeology trying to discover how and when a piece of Unix filesystem metadata called “ctime” changed from being “creation time” to representing “change time”.

Mark’s post got my attention because it gives a detailed look at a case of what I term metadata drift: the tendency of metadata properties to change their meaning under the pressure of either

clients using the field that will give them the results they need (even if the semantic fit is poor), or
implementers following their convenience rather than the defined intent of the item.

I see this all the time in the music metadata that iTunes pulls off of CDDB. I do it myself for classical recordings, where I want to record both the composer’s name and the performer’s. I want classical music indexed primarily by composer, but rock and jazz by performer. The iTunes software makes you pick one or the other, so I abuse the album field to capture the performer’s name.

CDDB contributors make different choices under this pressure, so I spend a fair amount of time when ripping CDs editing metadata. This is just an inconvenience for me, but the ctime issue that Mark Dominus investigates can have serious consequences if, say, a revision-control system makes the wrong assumption about the semantics of “ctime” on a particular file system.