Outcomes of technical meeting held 9th July 2004

The meeting was well attended by members of the LWG and other interested parties, namely:

Apologies received from:

Statement of intent

The meeting agreed to work to define “A standard metadata framework to support child protection, selection and display of content across web technologies, business to business channels, peer to peer and content production.”

Requirements

The meeting discussed requirements for such a system, and agreed on a number of desirable features including:

support hierarchical labels, meaning that a single label should be able to cover all resources within a defined domain, subdomain, path etc.
be optimized for high volume and low bandwidth.
allow a standard vocabulary to be extended in a standard way
support cascading labels. Meaning that a label applied at a domain level can be overridden at document level.
be applicable to broader content metadata, i.e.
1. be usable in a machine-machine environment
2. support personalization
3. support discovery/notification needs
4. increase the visibility of the metadata and the resources it describes.
5. be applicable to RSS feeds
support third party label bureaux/web services
be easy to use in a variety of contexts, such as:
1. multi-user environments such as blogwise.com
2. through server configuration
3. content creation/manipulation/delivery tools
4. amateur websites
5. filtering software
Support for many delivery methods such as for streaming media, EPG, TV in the home (NB. Time Text likely to be an important technology here).
Extensible by any industry, including the adult industry, to meet their requirements.
Must be functional within versions of HTML earlier than 4.01, even if it technically breaks the Doc Type.

RDF/XML

The meting agreed that RDF and its serialization in XML was the best available technology to achieve this. Many requirements are already met by RDF, however a new layer would be necessary to meet all the requirements in full.

Generalized labelling schema

Any labelling schema would need to support a common set of attributes including:

provenance (who created the label – a key issue in RDF)
time to live (how long a description can be cached)
creation date
expiry date
track-back notifications
times of day during which the label should be applied

Solutions

The meeting discussed various possible solutions, two of which it was agreed should be tested.

Candidate solution 1

When labelling a set of resources, an RDF instance should be created that includes one or more labels, each identified by a unique ID, as shown below.

The label for the chat area of the site, see www.icra.org/decode/ for explanation of rating Chatrooms on this site are unmoderated. They may, but are not known to, contain potentially offensive language 1 0 1 1 1 A label for reviews of action films. There may or may not be nudity but violence and weapon promotion is a given, see www.icra.org/decode/ for explanation of rating codes Reviews of action films are likely to contain descriptions of violence and to promote weapon use 1 1 0 1 1 1 1 1 A label for video clips Film clips on this site have been chosen so as not to contain any potentially offensive material 1 1 1 1 1 U 1 1 1 1 1

Listing 1: Multiple labels in a single RDF instance, each identified by a unique ID

Either through an HTTP Response Header or an HTML Link tag, resources then point to the appropriate label. An example link tag is shown below:

In RDF terms the graph is:

Figure 1 RDF graph for the relationship between a resource and #r2

This solution has the following features:

Each resource carries a pointer which points unambiguously to a label. To apply “r2” to any resource, for example, simply include a pointer to it. Similar pointers would point to rating 1, 3 etc.
As the label is held separately, it only needs to be retrieved once and cached. Once this has been done, the pointer will effectively be treated as the label by filters.
Which resources point to which labels is controlled by the server configuration, content management system or manually by the webmaster.

Candidate solution 2

This is an extension of candidate solution 2 and is NOT a replacement for it. The two can work side by side, each likely to find favour in different situations.

The key feature of this approach is that the RDF instance includes both the labels and the rules for their application. All resources, irrespective of how they should be labelled, can now point to the same location.

Listing 2 is a repeat of Listing 1 excapt that each label includes a new term, such as “/chat/”. A filter would match the URI of the resource against the patterns given in each label – in more technical terms would query the data – and then apply the first label for which a match was found.

The label for the chat area of the site, see www.icra.org/decode/ for explanation of rating Chatrooms on this site are unmoderated. They may, but are not known to, contain potentially offensive language “/chat/” 1 0 1 1 1 A label for reviews of action films. There may or may not be nudity but violence and weapon promotion is a given, see www.icra.org/decode/ for explanation of rating codes Reviews of action films are likely to contain descriptions of violence and to promote weapon use “/filmreview/action/” 1 1 0 1 1 1 1 1 A label for video clips Film clips on this site have been chosen so as not to contain any potentially offensive material “/videoclips/” 1 1 1 1 1 U /.*/ 1 1 1 1 1 1

Listing 2 Labels and rules in a single RDF instance

All resources would include either an HTTP response header or a link tag similar to this:

Features of this approach

It allows a single individual to take responsibility for labelling content, even if they dont control it directly. This is important for large organisations with widely dispersed personnel and infrastructure.
The “labelling operator” can create rules that say “everything on this website should have rating 1 except URLs beginning with A, containing B, ending with C or matching pattern D. They just need access control to a single file, not the whole network.
Once the file has been downloaded, a filter can decide whether to block or allow a resource without fetching it. This can lead to an important increase in speed.
Its very easy to configure servers, a CMS or web page template to include the same Link/ HTTP response header with all resources.

Other points discussed

This is the area that has the most potential for problems. If an RDF description says that a resource has the colour Red whilst another RDF description says that the same resource has the colour Green, for our purposes we need to have a set of rules that gives a definite answer to the questions “what colour is it?”

Requirement 4 suggests that it should be the “last value given”. However, the RDF community at present would be just as likely to say that the resource was “reddy-green.”

The notion of a default description for a domain and the working through an RDF instance like Listing 2 until a match is found and then going no further, whilst understood in many other areas of computer science, has not yet been applied to RDF.

XHTML 2.0

Current proposals for XHTML 2.0 might include further very useful methods of pointing to labels. The ability to include an about attribute within the link tag would make it possible to offer a set of link tags that used the same RDF instance to label resources other than those carrying the pointers. For example, an HTML page might include a link tag that labels an image thus:

The same proposals suggest the addition of new attributes to all tags, such as property and resource. These do not directly facilitate the labelling of, for example, hyperlinks or images, by adding a pointer to a label from within their tag. That would need the addition of an attribute such as meta. Whilst this might be a very useful addition to HTML for our purposes, it would require significant backing and the timetable for XHTML 2.0 is far from clear. Therefore it was decided not to make this approach part of the short term work.

Another idea being discussed for XHTML 2.0 is new rules for meta tags. Again, it was decided to leave these out of the present discussion but the significant future potential is noted.

Banner ads and third party content

Banner ads pose a particular problem. This is because whilst the request is made to a known domain, the reply – the ad – may come from any domain.

There does not appear to be an easy answer to this, except, of course, that banner ads should be labelled. The portal operators at the meeting suggested that the problem is manageable so long as a filter set to block unlabelled pages would block only the banner ad (there would be time enough to contact the online ad-agency without losing impressions on the primary page). This is the case with ICRAplus now, for example. If a labelled page has unlabelled content, only the unlabelled content is blocked from an otherwise complete page. The bigger problem would be if a single piece of unlabelled content were to block the whole page but this does not obtain.

Where are labels applied?

The more general point from the discussion of banner ads, the labelling of individual tags and the requirement that labels can be processed in a machine-machine environment, leads to a discussion on where labels should be applied in the chain. The most efficient and easiest point at which labels can be added is usually at the point of content creation, but this may not hold true or be possible in all cases.

Test cases

Files that provide test cases for the two candidate solutions have been created and can be seen at www.icra.org/RDF/testcases/.

Next steps

Members of the labelling working group and others are now invited to test the ideas discussed above. The outcome of those tests will inform the future direction of the work to be undertaken.

Blog