Labelling working group primer

The technical and policy difficulties of adding PICS labels to websites is discussed in detail in the paper delivered to the workshop on content labelling at WWW2004 (see [WWW]). The WWW workshop itself and subsequent discussions have been instructive in laying the groundwork for possible approaches.

One thing is plain: a solution based on the W3C’s Resource Description Framework (RDF) is more likely to succeed in gaining widespread acceptance than the current alternatives, certainly more so than PICS.

Conversations held with content providers make it plain that some sort of de-referencing system is required. That is, it should be possible to make everything point to the same central location, then deliver the appropriate label from there based on a single set of rules.

This is explored in the WWW2004 conference paper in which it is suggested that something like a Policy Reference File as defined for P3P could be used. Discussions with experts at the conference suggest that whilst that works as a human picture, it’s not quite the best way to proceed technically. The better solution is to have a number of “labels” (descriptions) and then use software to point to the right one for the resource in question. What follows attempts to show a possible implementation of this, beginning with the diagram below.

It’s worth emphasising that this is a possible starting point for further discussion and is not presented as a fait accompli.

The process step by step

Step 1

Is easy – the user requests a page, normally through the browser. The only note of caution here is that if the filter or other processor is not the browser, the requested URI may need to be captured during the GET request.

Step 2

The page is returned and includes a pointer to the location of the label/description. This could be done by means of a Link/REL tag within the page but since the idea is that everything should point at the same location, irrespective of its eventual label, server configuration is the most efficient route.

This is a critical step as this is what we will be asking content providers to include in the headers of their content. This is the one-time server configuration job. An HTTP response header might be simply:

Link: ; /=”/”; rel=”meta”

or an HTML link might be:

Step 3

The client (filter or other device) should check whether it has a copy of the file at the designated location in its cache. If not, fetch it. If multiple pointers are included the client should collect all available information.

The server should respond to requests for metadata by sending back a single document. That’s what servers are designed to do. The document could be either of two possibilities, depending on the content provider’s choice.

  1. An XML/RDF instance that includes a set of RDF descriptions plus rules for resolving which one to use for a given URI.

    This is the key option and is explored in detail in the example below.

    OR

  2. A simple RDF instance specifically about the URI. This would be used directly by the client and provides a way for content providers to effectively override/bypass the dynamic system of option 1, if used.

In the short term, the LWG needs to consider a third possible response by the server to offer a “fix for PICS.”

Step 4

If the metadata document returned is one that includes a set of RDF descriptions and rules for applying them, the client will need to do a little processing to extract the relevant data but this will be a trivial task, see the example.

Step 5

The filter blocks or allows the content of the requested page based on a comparison of the relevant metadata and the user rules.

Why this approach?

The approach outlined here is all exactly as intended for RDF with the notable exception of the first option in step 3 which is the critical one for this discussion. There are several reasons for promoting this:

  1. Configuring servers to point everything at the same resource is easy, likewise writing the same Link tag into page templates.
  2. Whilst it is possible to make the server point to a different description depending on the URI, this puts additional load on the server which is always unpopular, especially as the information generated will not be used by everyone.
  3. It’s possible now to make servers issue the correct PICS label depending on which files are being served but fewer content providers do than ICRA would like. What’s possible is not always what’s wanted.
  4. ICRA would meet fierce resistance to the idea of installing any additional software on professional systems. This solution requires none, not even for the control of labelling policies.
  5. By supplying all the labels in a single document along with the rules for their application, the document can be served once, if requested, and cached.
  6. The server is simply asked to deliver a document – which servers are designed to do very efficiently.
  7. Processing only takes place by agents specifically wanting the information. Performance for other users is unaffected except for the additional bytes necessary to carry the pointer shown in step 2.

Example of how a single RDF/XML document might contain a set of labels and rules for their application

What follows is offered as a demonstration of a possible approach. It needs discussion and further thought but might act as a starting point. The listing passes as a well-formed XML document and an RDF description; it appears to be something a client could process easily, whether for child protection or broader purposes, but is it workable?

Imagine that the Link/HTTP response header from every resource on a server points to an RDF/XML file like this:[XML]



  
    contains
    "/chat/"
    rating2
  
  
    matches
    /.*\/filmreview\/action.*/
    rating3
  
  
    rating1
  

   

     
      The default label that declares 
      "None of the above" in all categories
      1
      1
      1
      1
      1
     

     
      The label for the 
      chat area of the site
      1
      0
      1
      1
      1
     

     
      A label for reviews of action films.
      There may or may not be nudity but violence and weapon 
      promotion is a given
      1
      1
      0
      1
      1
      1
      1
      1
      1
      1
      1
      1
      1
      1
     
   

The first part uses simple XML to declare a series of rules for applying the descriptions that are given in the second part. The correct label for the given URI would be extracted from the document using an XPath function. These have the general form

function [value 1, value 2]

For example the function “contains” returns true if the string in value 1 contains the string in value 2. We have all the necessary information available for this. Value 1 is always the URI we’re trying to find a label for and the argument element in the XML forms the string/expression for value 2.

The associated “use” element will point the client to the relevant ID within the RDF section of the document.

Rules can be tested sequentially until there is a match. If we get to the end and haven’t found a match, use the default label. So in this example, any URI that contains the string “/chat/” will be given rating2 – which declares unmoderated Chat according to the ICRA vocabulary (see [ICRA]).

Any URL that matches the regular expression /.*\/filmreview/action.*\/ will be given rating 3. Anything else will get rating 1.

Further investigation and discussion is required about whether XPath/XQuery is the way to go here but it looks like it and has the advantage of being a recognised method.

A document such as this would be retrieved once, cached by the client, and referred to for each resource that pointed to it.

How would it work? – a content provider’s point of view

Adding in Link tags or configuring servers to point to a central resource is a) very easy and b) beyond anyone other than the content provider’s authority to do. So, their first task is to make this happen – and it’s not a huge one, especially if we don’t expect everything to be labelled.

They then need to categorise their content and decide which descriptors apply to those categories. In the example above there were 3: the chat area, the action film reviews and “everything else.”

The single file that contains the labels and the rules for their application could be written manually but it’s not straightforward and we should provide a tool that makes it easy. The tool doesn’t need to communicate with the network of servers, it just needs to generate/edit a single text file. The tool will probably work on a local copy of the file that can then be uploaded in a simple operation to a single server.

Further discussion points

  1. Does everything have to be labelled? What would be the implications of making some exceptions, for example, script files and style sheets?
  2. Irrespective of the final proposal, how can backwards compatibility with PICS be achieved?.

References and links

[WWW] WWW2004 paper is at http://www.icra.org/press/www2004/

[ICRA] Current ICRA vocabulary/codes is at http://www.icra.org/faq/decode/

[XML] Example XML/RDF file is available at http://www.icra.org/press/labellingWG/labelling.xml

Information about XML, RDF, XQuery etc, can all be found on the W3C website.

Independent working group members

WebHost Automation

Kingston Communications.

Blogwise.com.

Labelling working group Home