Security/identity · 2008-12-03

Where should data live?

George Fletcher provides interesting commentary on a good social-web discussion by Om Malik. The issue: Whether aggregation and federation of data are opposite, or complementary.

George says:

[F]or aggregation to work in the “open web”, it must be able to access my data whereever I’ve chosen to place it.

I agree. If we’re looking to empower people, it’s not realistic to insist that all their information live in a single place. Just as inventing a new identifier type isn’t sufficient to eliminate all the various identifiers we already have in our lives — there are good reasons, not just legacy reasons, to have more than one — solving the problem of storing everything in one place isn’t sufficient to eliminate all the places information about us is stored. Here are two non-legacy reasons.

First, you should be able to choose (as George says) where to store information you created, and you might have reasons for choosing different hosts for information that has different levels of sensitivity, needs for high-availability access, needs for fine-grained access control specific to certain data types, etc. There needs to be an option not just to import/export everything en masse from one competing hosting environment to another, but also to tolerate multiple sources of data at once. It almost feels anti-web to prefer an architecture that requires everything to live together on one server.

Second, it doesn’t make sense to throw information about you for which you’re not authoritative (like your credit score, or really any reputation data) into one aggregation pile; you can’t control the value, but it’s still “yours” in lots of other senses. You might have the right to track who sees it, but a live copy shouldn’t reside in your one big database where you have write access. (I liked Gerry Gebel’s insight around this: Try to think of any application that relies on data from elsewhere to be stateless with respect to it. Another way to think about it is that you want to achieve a sort of “first normal form”, where information properly lives wherever its authoritative source chooses it to live.)

George notes that if you can authorize a relying party to get the data from whatever your preferred source is, you can get the best of both worlds. It’s aggregating some parts of data provisioning, usage, and auditing, but not the actual residence of the data.

I’ve become convinced that multi-sourced data access is a requirement for the core permissioned data sharing issue that’s common to identity, VRM, and social networking use cases.

I happened to do a webcast yesterday that describes the VRM proposition — you can watch the recording if you register for a free account — and I went into a bit of detail about the technical requirements I see, along with reviewing some of the architectures at our disposal for achieving them. (The good news is, there are already several…) [UPDATE: Slides are now available here.] I think I need to start adding this requirement to my list.