Processing rules for ICRA RDF labels

Introduction

The ICRA system is designed to allow providers to label their content quickly, efficiently and flexibly. In order to ensure predictable results for users, the way in which different filters process labels must be consistent. The recommended processing rules are set out below to enable content providers to assess the most convenient method of labelling their own material.¹

A reference module and more detailed technical information is also available for filter manufacturers, including how self-labelling data can be used alongside third-party data sources.

For a given resource (a page, an image etc.) there are three possible sources of a label².

It may be possible to deduce a label by processing data already held in the filter's cache (memory). Such labels are referred to below as Type 1.
A resource may include a link to data that contains rules that can be followed to identify a label. Labels identified in data sources linked from the resource itself are referred to below as labels of Type 2.
A resource may include a direct link to a label. Labels identified by a direct link from a resource are referred to below as labels of Type 3.

As detailed below, filters SHOULD assign increasing priority to each of these sources.

Before retrieving the resource

Flow diagram for processing rules, before resource is fetched

Figure 1: The processing to be carried out prior to any request to the internet.

If the filter already has a label for the requested URL in its cache then, immediately, a Type 1 label is available. If that label was initially retrieved from the same site as the URL we're interested in now, the label SHOULD be considered as a Type 2.

If the label data in cache was retrieved from a different website, it remains a Type 1 and the resource should be fetched and checked for further data.

For clarity:

A Type 1 label MUST NOT be used to block access to a URL before it has been fetched.
A Type 2 label MAY be used to block access to a URL before it has been fetched.

There are several reasons for this but they boil down to the idea that labels linked from a resource are "closer" to the content provider than labels that may have been published by someone with little or no connection with the described content. This is extended in the next section when further priority is given to labels that are linked from the resource itself.

Type 1 labels should not be confused with third-party labelling. If a filter is configured to request labels from a third-party source, such as an online database or a content analyzer, the filter will handle that data separately. Type 2 labels take precedence over Type 1 labels purely in the context of self-labelling.

If there is no data in the filter's cache, the resource at the URL MUST be fetched.

Flow diagram for processing rules, identifying correct label after resource is fetched

Figure 2 The processing rules after the resource has been fetched.

Identifying the correct label

If the resource includes links to label data, it may be necessary to fetch and process it. (Remember that labels are always held separately, never actually within the resource itself.)

If the resource includes a link to a specific label, this is classed as a Type 3. Since this is the highest priority in the hierarchy, once a Type 3 label is available, no further processing is necessary to identify the correct label to use for this resource.

However, clients SHOULD check any host restrictions. Clearly a label should only be recognised as valid if the resource pointing to it is from the declared host(s). If no host restrictions are declared, a client MAY accept the label.

The priority given to Type 3 labels is the crucial step that allows a content provider to work with the notion of a default label with local overrides.

If the resource carries a link to the same resource that had already been processed to identify a Type 2 label, clearly no further processing is necessary; the correct label has already been identified.

However, if a link points to a different data source than had already been used to derive a Type 2 label, the new data SHOULD be processed. This is because it is possible to include any number of data files on a site and it can be assumed that the one linked to from a given resource is the one the content provider intended to be used.

If no links are present in the resource then clearly the only information available is that which was available before the resource was fetched.

If multiple labels of the same Type are available, this is an error on the part of the content provider. The filter MAY use any of them but, purely for reasons of efficiency, will normally simply use the first one found of a given type.

Acting on the label

At this stage, the filter has checked whether a label is available, and, if there are multiple labels, has selected the correct one.

If there is no label available for a given resource, ICRA recommends that, by default, the filter allows it except when it is an (X)HTML web page. Whether unlabelled (X)HTML web pages are blocked or allowed should be under user control.

The reasoning behind this is that if a page is labelled, the author probably intended the label to cover all the elements within the page. They may even be unaware that for each image, external script file and style sheet, a separate request is made to the internet. It is less usual for an image to be accessed directly, i.e. without being displayed within a page or without the user finding the image by following a link from an (X)HTML page.

Finally, whilst it is easy for all webmasters to include link tags in an (X)HTML page, linking something like an image to a data source requires the server to be configured. This is generally the reserve of the professional.

Processing rules for ICRA RDF labels

Introduction

Before retrieving the resource

Identifying the correct label

Acting on the label

Related topics