Monday, May 30, 2005

Danny Ayers is still valiantly defending the Semantic Web.

But frankly, all the alleged rebuttals are just shooting at a straw-man of their own.

The basic Clay Shirky critique of the SW is that the pain outweighs the potential benefits, and so it's not going to work. Instead, we're going to get machine-readable markup by small, self-interested increments rather than using the W3C solution. Two years on, that assertion looks pretty strongly backed up by events.

Shirky illustrated this generic complaint with two more specific criticisms :

1) that the SW was trying to build a monolithic ontology.

2) that the main touted-benefit of the SW is that, because every semantic item had a unique URI, it should be possible to translate between different documents refering to the same things, and therefore combine the data they contain, producing inferences or "joins" between information in different places. And that this, in practice, will be too hard to be useful.

The rebuttals tend to focus on these two claims.

Rebuttals to the first argue that because there are different rival vocabularies or "ontologies" available, the SW is far from trying to build a monolithic ontology.

Rebuttals to the second try to argue that either

a) yes, that's the idea, and there's good precedent in, say, relational databases, where joining tables is the core business;

or

b) no, that wasn't what the SW was meant to be about at all.

Of couse, Shirky rather over-egged the critique of syllogisms. And so pointing out that they happen in relational databases is a useful corrective. But this doesn't, as I'll try to show in a moment, actually save the SW project.

So let's take each of the rebuttal responses and look at them.

First, that there is no monolithic ontology. Well, if you take "ontology" in it's W3C technical sense, as a formal desciption of part of the world and the relations between the things it contains, then that's true. Each SW "ontology" is allowed to define its own things and relations. And W3C don't try to force everyone to use the same one.

But at a deeper level, there most certainly is an attempt to put all the things in the world into a single scheme. That is, everything has to have a URI. And URIs, by definition, need to uniquely individuate things.

Two things with different URIs have different identities in the SW, regardless of their context. While two things with the same URI are the same, regardless of context. If you look at Shirky's more recent obsession with tagging and folksonomies you'll see that he's discovering a contrasting world of useful meta-data that's being created without need for such unique identifiers.

In this sense, SW does demand a certain basic adherence to a universal standard that other, apparently more successful, markup schemes are not relying on.

I'll postpone the second claim, that "joins" in relational databases are proof that syllogism is valid, for a couple of minutes. Here I'll just ask if anyone knows of good examples of such joining being done in the wild using RDF. (Genuinely interested to hear of good, popular applications of this.)

More common is the "rebuttal" that argues Shirky is wrong because making joins between different documents is not what RDF is really about.

Which naturally raises the question : so what is the alleged benefit then?

Here's what it seems to be, according the counter Danny linked this time.

Unlike vanilla XML, RDF vocabalaries can be freely
mixed together in data without prior agreement. So
you often see ad-hoc combinations of Dublin Core,
RSS1, MusicBrainz, RDF-calendar, FOAF, Wordnet,
thesaurus, Geo-info etc etc frequently deployed together,
despite the fact that the creators of those various
vocabularies barely knew each other. This strikes me as the
height of loosly-coupled pragmatism rather than a
wide-eyed effort to build a monolithic universal category
system.


In other words, that we can mix different information from different vocabularies in the same document without danger of ambiguity.

And this gives the key to what the SW really is, and why I think that it's not all that useful.

What's really going on here is a discussion about the context of or units of semantics - a debate between some sort of atomism and some sort of holism.

There's long been discussion in philosophy of language about what defines the meaning of text. What's the "unit" that defines semantics. Is meaning a property of words or of sentences? Or of larger contexts, of languages or cultures? There's an analogous problem in genetics, often called the unit of selection problem. Does the evolutionary selective pressure act upon - ie. do we give a semantic interpretation to - the individual gene? Or does the gene only have an effect and meaning in the context of the whole body?

For a long time, I've been puzzled by what exactly is so good about the ability to mix vocabularies in a single document. Let's consider the situation where I have a document mixing data from vocabularies V1 and V2. Now clearly, this document is meant to communicate between two programs, P1 and P2, which need to understand ideas from both V1 and V2. In other words, if P1 can produce the document, and P2 can consume it, then both P1 and P2 should really know about the kinds of things that V1 and V2 can describe.

But if both P1 and P2 need to to know about these ideas, then they can choose whatever protocol they like to exchange them. They derive no great benefit from keying into a widely published vocabulary.

This is a discussion made concrete by Dave Winer's RSS 2.0. If both feed producers and feed consumers need to know about authors and published dates and posts etc., then any file format which can represent these things is viable. (And the simpler the better.)

The only story I could ever imagine that made sense of the claims for the virtue of mixing vocabularies that were defined elsewhere was that a program, P3, might not know about a particular file format (eg. a syndication feed) but might nevertheless know about the Dublin Core vocabulary, and could therefore extract this and do something useful with it from an RSS 1.0 feed.

To me, that looked absurd. It's analagous to the old joke about counting sheep by counting the legs and dividing by four. P3 doesn't know what a syndication document is but it can work out what it "means" by knowing what the sub-document vocabularies mean. And then it's supposed to do something useful with it?

I think this story really throws into relief what the Semantic Web is about, and what the arguments are all about.

The (capital S) Semantic Web is a bet that the appropriate unit of semantics is the Vocabulary or Ontology.

Anti-Semantic Web arguments are really assertions that this isn't the proper or most appropriate unit of semantics.


Let's suppose I have a string "John Smith". Is its meaning more crucially defined by attaching it to a global vocabulary, or is its meaning more crucially defined by its context, such as the document that contains it?

You can, of course, derive meaning from both contexts but, goes the Dave Winer argument, the document is normally sufficient context, so why pay for anything else?

The genetics analogy is instructive here. Even hardcore "atomists" or gene centred theorists have to accept that the body plays an important role, and they've introduced the term vehicle of selection to cover it.

In the same way, you can take the Winer argument as being that documents are the main "vehicles of semantics", whereas the SW are essentially the "atomists" here (pun not intended by me :-). The individual atoms have their meaning, fixed by the uniqueness of the identifier (URI), and their "type" given by the ontology.

The idea that the document is a sufficient vehicle seems to be gaining traction, as the concept of micro-formats becomes more widespread. Essentially, hype about micro-formats is nothing more than an increasing number of people waking up and getting Winer's insight : "we don't need to be intimidated by this Semantic Web. It's not going to happen, or at least not soon enough to be worth waiting for. Let's create something where semantics are fixed by the local context of the document and the programs that use it, rather than a global context."

The second main front of the war against W3C atomism, is tagging. In this case, there are two things that fix the meaning of tags : the natural language of the users, and, once again, the local context defined by which application they're in. This markup is created by non-technical users, who naturally aren't in a position to formally define an ontology or RDF-schema before adding their markup. But they do have the shared standard of their natural language which they can hang their mark-up on. Here the contexts are wider than the scope of the W3C formally defined vocabularies.

OK, quick summary :

The argument in the Semantic Web is all about "semantics" and what most appropriately binds tokens in documents to their meaning. The W3C bet is that individual atoms - given unique identity via URIs, and types selected from global ontologies - is the best model for this. Opponents say there are better ways.

Two prominent fronts have opened up where rival representations are challenging the SW :

  • the "documents are vehicles of semantics" view, of which the argument between RSS 2.0 and Atom is the most prominent example. But where other micro-formats are also skirmishes.


  • the "human behaviour" model, where the semantics of tokens is bound by users and derived from their resemblance to words in everyday language. This is the tagging / folksonomic story. Here the "unit" or vehicle is the cultural practice.





Now, to get back to those relational databases. Joining within the database is easy. Because the database is also a unit of semantics,. It's the local context from which all the items derive their meaning. On the other hand, importing and exporting from one database to another is traditionally hard, because that crosses the frontier of semantic definition.

Obviously advocates of RDF see this and think "if only we had a globally fixed semantics" it would be easy. But if getting things between one database and another is hard, defining good global standards is harder. And in practice is NOT happening much.

And it's not a response to this to say that SW allows a plurality of rival ontologies which anyone can invent. Or that lots of people are. Either there is a single standard (as with the near ubiquity of the Dublin Core, and inter-op is possible, or there isn't and inter-op isn't.) But SW defenders often gloss over this, touting the two contradictory benefits of plurality and compatibility as twin virtues - as if you can have them at the same time.

In most cases, the benefits of defining the semantics globally rather than "vertically" within the application domain are marginal.

But it might, just, have been worthwhile if the cost wasn't so high due to the whole W3C implementation of the SW being so FUCKING botched!

Everyone seems to agree that the XML-RDF is a bad design.

An XML serialization of RDF tripples should have looked like this.


<rdf:statement>
<rdf:subject>URI</rdf:subject>
<rdf:predicate>URI</rdf:predicate>
<rdf:object>URI</rdf.object>
</rdf.statement>
...


Everything else was just premature optimization.

But looking at this, something even more fundamental starts to increasingly bother me. Why did URIs have to look like URLs?

URLs describe both an online document and a transport protocol. URIs are nothing but unique labels for things which might or might not be documents and which might or might not be accessable over the internet.

I would, for example, be delighted to know whether friends at <xmlns:foaf="https://xmlns.com/foaf/0.1/"> are different from friends at <xmlns:foaf="http://xmlns.com/foaf/0.1/"> It's probably specified somewhere but I can't find the answer.

Basing URIs on URLs is, in retrospect, crazy. It's like deciding houses should be backwardly compatible with cars and have the same shape of door. Even though cars need to move and houses don't. Or, more charitably, it's reminds one of the early days of cinema which tried to apply the lighting techniques of theatre.

URLs and URIs are two different genres of reference making. And attempting to make them look similar has confused thousands of potential users. Marginally better would have been something like qualified names for classes in Java (eg. com.nooranch.myVocab.greeting) But even here there are some strange, unwanted, notions. Such as a clumsy attempt to classify the type of institution using the top-level domain such as ".com" or ".org". Why should it matter what the "type" of the organization is? Or the country that it comes from. Shouldn't we be suspicious that two vocabularies which define the same tags, but sit at .co.uk and .br are treated as sui generis different things?

Of course, there's a reason why URIs are URLs. Sometimes there really are things you want to get at over the web. And you need to find a real URL to get them. But this is an example of that damned premature optimization in action.

This is just one of many examples why, in the final analysis, the W3C's implementation of the SW smells so bad. And why programmers with a sense of design aesthetics run a mile when they see it. RDF is pitched as some extremely high-level meta-language which can describe almost anything, yet in practice it's riddled with premature implementation commitments : to web-protocols, to XML standards etc. It's this mismatch between the claims for generality, and these awkward, intrusive implementation details that looks ugly and is so off-putting.

Hand rolled XML doesn't have this problem. Sure it's inflexible, local, situated. But it feels appropriate to the scale of the problem. Micro-formats too. And maybe there are notations for the SW which you can reason about at a level of abstraction appropriate to the problem you're trying to solve. Though given the URIs == URLs commitment they clearly don't escape entirely.

And this, I suggest, is unfixable. Even if you dump XML-RDF (which I suspect people within the community who've invested (hundreds of?) thousands of hours of work in won't do) you can't dump the URI. That's the core commitment of W3C's SW. And that's an eternal, embarrassing reminder of the implementation leaking into something that was meant to be abstract. And it's what people cringe over when they see RDF and complain that "name-spaces are complicated".

Wow, this turned into a long rant ... quick summing up. In pure form, the SW is a hypothesis as to what's the "right" unit to fix the semantics of tokens in documents. And its value depends on that theory being right. Two rival notions of the correct unit of semantics seem to be thriving, and possibly showing that the SW hypothesis is wrong.

In practice, the SW looks ugly and off-putting because it failed to succesfully distance itself from certain implementation details as would befit the level of abstraction it aspires too. And this has left it with an awkward legacy of confusingness and complexity which is hard to fix.

Failure isn't inevitable. The SW may still be bulldozed through with enough hard-work by those with sufficient ideological commitment and / or money. But the rivals are thriving because they are cheap, simple and immediately useful. And history tends to favour such things in the long-term.

No comments: