Processing RDF contentLabels

This document presents a flow chart describing a possible processing route for labels. The word “label” is used to mean the RDF class contentLabel as defined in the draft label ontology.

For processing purposes, labels fall into two categories:

  1. Generic, meaning labels that include application rules;
  2. Specific, meaning those without any application rules.

A key element in the label-processing client is the label list. This should be held in memory by the client to avoid repeated requests for the same labels. Labels should be stored long term, respecting time to live and expiry attributes (these are not specific to the label ontology but are general XML attributes).

Not only the label(s) but the URL of the RDF instance containing the label(s) should be stored. The label list therefore will be of the form shown in Figure 1 below.

http://exampleA.org/labels.rdf
http://exampleB.org/labels.rdf
http://exampleC.org/labels.rdf#1

Figure 1 The general structure of the label list

In the examples shown, http://exampleA.org/labels.rdf is the URL of an RDF instance containing 3 generic labels. http://www.exampleB.org/labels.rdf points to 4 generic labels. However, http://www.exampleC.org/labels.rdf#1 points to a specific label. The inclusion of the fragment at the end of the URL gives a hint that the label being pointed to is specific. However, it is the absence of any application rules that defines a label as specific. If application rules are found in the label those rules must still be respected even if the URL of the link that pointed to them contains a fragment identifier.

This allows content providers to include application rules that limit the scope of their labels. If the fragment identifier itself were taken to mean that the label pointed to was specific, and the client therefore ignored application rules, any content provider could point to any label on any domain. Whilst there is no reason to make this impossible, it should be also possible for a content provider to restrict their labels to specified domains if they wish to do so.

There may be some benefit in defining a term in the label ontology that declared the label to be specific or generic. This would be analogous to the generic true|false flag in PICS and may improve processing efficiency.

The order in which the labels retrieved from a given URL are stored must be the same as in the original RDF instance. The order in which the sets of labels are stored (and therefore parsed) is, however, unimportant. In other words, the order of entries within the right hand column of each row in Figure 1 should be preserved, however, the order in which the rows are processed is arbitrary. Content providers should be aware of this if choosing to make their labels available in more than one RDF instance.

Part 1: From initial request to conditional fetching of the resource.

Figure 2 From initial request to conditional fetch

Some processing is possible before the resource is fetched since the client may already hold a generic label that blocks it. This presents an important opportunity for efficiencies of speed.

The currentLabel is a variable that is always null at the beginning of each HTTP request. Its value may be set if the requested URL matches an application rule in the label list. However, it may subsequently be overwritten by later processing.

Part 2: Selecting the correct label once resource has been fetched

When a resource is fetched from the web, the client makes a note of all link tags, whether in the (X)HTML head section or the HTTP response headers, that point to RDF instances. The client should then work through that list of retrieved links to RDF in the order in which they were received.

If the link is already in the label list and is to a specific label, the currentLabel should be updated. If the link is to a generic label or set of generic labels, the client should work through the labels associated with that URL. The first match should cause the currentLabel to be updated.

If the link is not held in cache the RDF to which it points must be fetched. Clients should note the URLs of RDF instances that contain data that is not relevant to its purpose so that it can avoid fetching unnecessary data more than once. The “role” attribute proposed for XHTML 2.0 is likely to be relevant on this point in future and may mean that some RDF instances can be ignored without being fetched.

Again, if a link points to a specific label, the currentLabel should be updated and the URL/label pair added to the label list. If the link is to a generic label or set of generic labels, the client should work through the labels associated with that URL. The first match should cause the currentLabel to be updated.

It is possible that a single RDF instance will contain several labels, both specific and generic. The client should store these in the label list after constructing the correct linkage URL. For specific labels, the URL should end with the relevant fragment identifier; for generic labels, there should be no fragment identifier. Such caching of labels should not affect the currentLabel.

Each link to an RDF instance triggers a turn through the processing loop. Each successive turn may overwrite the currentLabel. Ultimately it is for the content provider to ensure that the correct label is applied to the content by following the rules set out here.

Figure 3 Main processing stage. NB: see text for important detail.

Part 3: Applying the current label

The final stage of the process is straightforward.

Figure 4 Applying the currentLabel

If there is a value for the currentLabel, it should be applied to the requested URL, i.e. the resource should be allowed or blocked in accordance with the user’s rules.

If there is no value for the currentLabel, ICRA makes a distinction based on the MIME type of the resource. Client designers are invited to offer users an option that blocks access to “unlabelled web pages. ” This option to block or allow should be applied to all unlabelled resources with the MIME type text/html. For other MIME types, the expectation is that the resource will be allowed by default unless the user has chosen to block all unlabelled web resources.

This does NOT mean that resources with MIME types other than text/html should always be allowed since it is perfectly possible to label all types of content. If a label is present it should be respected, irrespective of the resource’s MIME type.

Processing Application Rules

The draft label ontology allows a single contentLabel to include multiple applicationRules. For example, these two rules:

beginsWith “http://”
contains “example.org”

are likely to be commonplace since it covers both http://www.example.org and http://example.org. A client should simply AND multiple applicationRules so that all conditions must be satisfied for a label to be applicable to a given URL.