In “Who’s got the tag? Database truth versus file truth, part 3″, Jon Udell contrasts the Microsoft Vista and Mac OS X ways of associating metadata tags with image files: Vista tends to store them into the image files, and Mac OS X tends to leave the files untouched and use a separate database to store the tags (or at least Jon was under this impression).
There’s a great discussion about the relative advantages of the two approaches on the blog. Basically, storing the tags in the file makes the association harder to lose as you move the file around, and storing the tags separately avoids modifying the user’s data file. Neither one is obviously in accord with the user’s intention in all cases.
I think the issue has whole extra layers of subtlety. We perceive metadata that is stored within a data file as being what Jon Udell calls “file truth”. Since there’s only one set of metadata stored in the file, it becomes the One True Metadata. On the other hand, metadata stored in a separate database reads as the opinion of the maintainer of the database. This is exactly what social bookmarking systems such as del.icio.usdo: each attribution of a tag to a URL is also associated with a user making that attribution.
A pluralistic society requires a separate metadatabase!
This isn’t just another engineering tradeoff, though. The truth about “file truth” is that it’s still an opinion—the opinion of the last agent to modify the metadata within the file. When there’s One True Metadata, we can only represent disagreements by obliterating the last guy’s assertion.
Imagine trying to tag a scan of a photo taken at your parents’ wedding of someone you don’t recognize. You think it’s Dad’s college roommate, but your sister thinks it’s Mom’s second cousin. You have one “person depicted” slot: do you fight over it? Do you leave it blank and explain the situation in a semantically bland catch-all description field? Or do you each tag it as you will in your respective databases?
Not only is it unrealistic to allow for only one true description of a file, it’s also time we stopped regarding metadata as lost forever just because it’s not stored in the file. We could set up a distributed database that works like Gracenote‘s CD identification database, but for all files instead of just music files. As with CDs, the lookup key for a file can be generated by anyone who possesses the file (by applying a secure hash), but the particular metadata obtained depends on which tagger’s part of the repository is consulted. It’s all doable, and it would eliminate blogstorms about how evil application X erases user metadata.