Wednesday, September 14, 2005

I always figured that at least RDF was pretty sorted when it came to handling well understood data like who wrote what book. This always seemed to me to be the ur-application for which the SemWeb was created (and the shape into which everything else has to be squashed.)

According to
Ian Davis's crisis of faith even that gets complex pretty quickly.

Danny Ayers is on the defence case.

This reminds me that I've been thinking about the SemWeb again.

My current project is something I'm calling "SystemSketch" - which is a tool for easy authoring of stories about systems ie. a sequence of interactions between various players in some kind of system diagram.

As a data-structure it essentially boils down to a graph of nodes and relations between them, plus a timeline which is a list of events that occur between the different nodes.

This data sounds so like the kind of triple-graph which is the core of RDF that I felt obliged to go and see whether I shouldn't be using it in some way.

And, once again, I come away confused and frustrated. Not with the basic concepts of RDF, but with how I can engage with this thing to solve my problem.

I have many requirements that the SemWeb seems to be promising. I want graph-shaped data. An open data file format which can be read and interpreted by different SystemSketch players, or even other co-operating programs who only want to query one aspect of my data.

And yet, there's a great mismatch between what they offer and what I want.

When I start drafting my XML file format, I have nodes in the graph, and I have events. And, of course, the events refer to the nodes. In my program internally I've toyed with treating these as object references or some kind of Id. And I can easily make a UID that's internal to my data-file.

But this isn't SemWeb. What I should do is try to make these IDs meaningful outside my program.

But here I hit the obvious problem. That I don't know what these nodes represent. I have several applications in mind : I want to try to describe the current political scandal in the Brazilian parliament - a tortuous story of representatives being paid bribes to vote with the government, alleged meetings, guaranteed loans and suitcases full of money -; I want to be able to describe some of the interactions in the atmosphere that cause global warming; I want to be able to describe some of the dynamics of the fractional reserve banking system.

And, of course, I want users to be able to describe any system they want.

Within my program I can't see how there's much scope for any semantics.

What I am imagining is a family of applications who's commonality is syntactic. They all involve a graph, and a timeline of events. But beyond this, there's little semantic commitment I want to make.

There is some. For example, the events in the timeline all have a start and finish time, which is presumed to be a time in every case. But one application may want a date, while another might be working in nano-seconds, so it's hard to see what interesting semantic constraints one would really like to add there. There's not a lot of value to be added by trying to tie the "time" attribute of the events into any widely recognised ontology.

And this has led me to the bigger insight. Computers are symbol processing machines. All the hits of computing. All the mega-applications. All the applications that have actually made a difference in the world. They are all examples of this same principle : identify an interesting syntactic commonality and create a tool to support working with it. Without worrying what the data means.

Word processors? Sure, they just push characters around. But they don't care if you're writing love-letters or the communist manifesto. Spread-sheets? Excel doesn't know if it's a wedding list or book-keeping for a small business. Relational databases? The web? Blogging tools? PhotoShop? Java?

All of them are widely adopted, generic solutions, because they make interesting syntactic generalizations but are semantically uncommited.

Now, like the example of the time attribute I gave above, it's not that there's no semantics in these applications. Spreadsheets and RDBMSs and Java recognise the difference between words and numbers, newer word-processors will correct your grammar and spelling.

But in all these cases, the semantic support is either at a very generic level (data-types), or secondary to the core functionality. Often added as macros and plugins.

There is software which contains more semantic commitment. Software to help do tax-returns is probably the most widely known. But this is an exception. Most of the software written that's full of semantic commitments has the following properties : it's bespoke, very expensive, and very boring. We're talking in-house infrastructure for giant corporations, who's business-practices are hard-wired into their internal applications.

And even here, there's a move to pull the semantics out into generic business rule databases.

Maybe a better way to see it is that every piece of software has a mixture of syntactic and semantic constraints. And it aims to capture a sweet-spot : sufficient generality to be useful in various situations to various users, with sufficient constraint to reduce the complexity of the user's task.

And this leads us to question what the SemWeb is.

The SemWeb community are happily coming up with a set of super-generic languages. As representation schemes they fulfil the generality requirement. And, slowly, some super-general software is appearing like SPARQL, to do certain kinds of processing of the query.

But at this level of generality, they don't offer me any advantage over the tools I already have to build data-structures. Standard XML or RDBM interface tools.

As a programmer, I am searching for new useful balances between generality and constraint. And, my strategy is to search for syntactic commonalities, while leaving the semantics uncommited. This is the strategy that gave us the word-processor and spreadsheet and web.

What the SemWeb seems to be, is an offer of an alternative approach. "Hey! Instead of searching for syntactic commonalities and constraints, let's start by searching for semantic ones. You can stick to using our very generic tools but the value you add will be the extra semantic constraints you overlay"

The problem is, I don't know how to do this. And I don't think the SemWeb people have a very good idea of how to explain it. I suspect the current software development profession doesn't know how to find valuable general-application / automation sweet-spots based on semantic constraints. And so we don't know how to use the semantic web to build interesting software.

I guess the skills needed are pretty analagous to declarative / constraint based programming, which has always been hard to understand or teach. It pictures a world where the block-buster application isn't a spread-sheet (in virtue of capturing the common task of wanting to add up columns of numbers) but some sort of Prolog rule-base (in virtue of everyone using the same rules.)

The alternative thought is that the SemWeb doesn't have a place in it's ecosystem for developers. That users are ultimately expected to do their own programming once the SemWeb general tools become accessible enough.

3 comments:

Frodo said...

I think it is your us-verses-them mentality that is holding you back. I used to hold exactly your opinion about RDF and over the years have turned a complete 180. Free you mind! I'll try to help you any way you wish. If you can provide more detail about your problem - not how you think RDF fits in with it - then I could see if RDF really does fit in; because; if you can do it relational modeling with a RDBMS then it is rather trivial to model it with RDF.

phil jones said...

Thanks Jimmy.

I am serious about this. I can send you an example of the very simple XML I'm using for my file which may give you an idea how I'm approaching my problem. What's the best way to get in touch?

My email : interstar@gmail.com

Frodo said...

Check your mail.