Quatro – a metadata platform for trustmarks

The Quatro project has applied semantic web technologies to trustmark schemes and quality labels. Drawing on past and original research, the project has defined a vocabulary that can be used by any trustmark scheme (TMS) and a technical platform to deliver the trustmarks in a format that can be processed by semantic web agents.

Trustmark schemes have been established in many parts of the world, some are online versions of existing schemes, others have been developed specifically for the web. Two notable areas of interest for trustmarks are those designed to give consumers confidence in eCommerce operations and those that indicate that medical information has been peer reviewed. Operators of both types of TMS are among the partners in the Quatro project¹.

In all cases encountered, the model is essentially the same: a website is submitted for review by the TMS. If the site meets the TMS criteria it is allowed to show a logo. If a user clicks on the logo, a database is interrogated and the current record for that site is displayed, usually showing information such as the date on which the site was last reviewed. Despite the presence of a hyperlink that links to a database record, trustmarks are designed solely to be read by humans and not machines. As a result of Quatro, they will be available to both.

A significant amount of research has been done into trustmarks, particularly in Europe². Research has focussed on how trustmark schemes operate, what benefits they confer on the user and the websites carrying them etc. One such project in 2001 ³ produced a list of criteria that any trustmark scheme would be likely to use when assessing a website. Quatro has used that a starting point to create a vocabulary.

At the time of writing, the details are being finalised but the vocabulary is likely to include elements such as:

Uses clear and intelligible language
Claims are substantiated
Publishes Contact Information
Responds to queries
Does not encourage children to enter inappropriate websites

The complete vocabulary will be made available on the Quatro project website⁴ and elsewhere, both as a plain text document and an RDF schema. The plain text version is due end April/early May 2005 with the RDF schema shortly after that. It will be available for free usage by any trustmark scheme as they see fit.

Trustmark schemes will, of course, continue to devise their own criteria. However, where those criteria are equivalent to those in the Quatro schema, use of common elements offers some distinct advantages.

Firstly, a trustmark that is machine readable and uses common descriptors will be interpreted more easily by semantic web tools than one that uses purely proprietary elements. If a user agent is configured to look for Trustmark A but finds a site that is accredited by Trustmark B, at least the common elements will be recognised, even if those specific to Trustmark B are not. The incentive for content providers to gain accreditation for their material is therefore enhanced if the TMS uses at least some of the common descriptor set.

Secondly, a common set of elements makes it is possible to apply machine-learning techniques to the difficult area of ensuring that an accredited site continues to meet the TMS criteria. A machine cannot tell whether an e-mail sent to an eCommerce operator will be responded to within a given time, but it can detect that a contact route is still provided 6 months after the site was last reviewed by a human, even if the nature of the contact route changes.

For example, a site may offer a simple mailto link for contact but subsequently change this to a web form. Content analysis by machine learning will continue to recognise this as a contact route. Likewise, a document that is properly referenced is relatively easy for a machine to identify. If a TMS includes the criterion that all medical documents are properly referenced and a new medical document is added without such references, it can be detected and the TMS alerted that the site needs re-checking.

On both counts the use of a common vocabulary offers commercial advantages to trustmark scheme operators by increasing the value of the labels for content providers and end-users.

In its simplest form, a trustmark would be a series of elements encoded in much the same way as any other metadata. However, a trustmark will generally apply not to a single resource but to a group of resources, such as all those on a particular website. This presents a problem for RDF which is based on a single URI as a subject. An identical problem obtains for content labelling for other purposes such as child protection.

Project partners’ experience of working with PICS⁵ has been informative in devising a schema for RDF Content Labels. A set of documents produced under the aegis of the Quatro project and other activities in Europe and Japan gives use cases, test data and a full description of the schema⁶. Essentially the system allows for a single description to be applied to any number of resources. This can be done in two ways. Firstly a resource can be linked directly to a description using a tag such as:

The RDF instance, labels.rdf, would include a description – a content label – with an rdf:ID of “label1.”

However, the real power of the system comes from the second method – a simple rule set. All resources on a content management system or server can include a common link or HTTP response header that points to a single RDF instance. It is likely that this file will be under the control of the content provider’s editorial department rather than a production centre. Data in the RDF instance will allow an agent to take the URI of a particular resource and apply the rules that then lead to the correct content label.

Using this method, a trustmark operator, for instance, would be able to accredit a limited portion of a website or a suite of web properties. For ICRA’s child-centred labelling system⁷, it allows content providers to apply different labels to different resources on their network. Further uses quickly become apparent, such as film classification or applying a single set of management information to a large collection of resources.

The label schema supports three basic “types” of description:

A content label – a class whose properties provide the description. This is the one used by the Quatro and ICRA labelling schemes.
A classification – a class that itself provides a description such as “Suitable for persons aged 12 years and over”
Management Information – a class whose properties would typically include the DC metadata set, Creative Commons licence etc.

An important component of the RDF Content Labels schema is the idea of defaults and overrides. An RDF instance can declare global, default descriptions that are then overridden if a rule leads to a label of the same type. In other words, one might declare a website to be published by the Example Content Production Company with unrestricted copyright as default management information. However, a different set of management information would override this in the “Madrid” section of the site were published by España Example and all rights are reserved. Classifications and Content Labels can be overridden in the same way but act independently of each other.

By the time of the conference, Quatro will be approaching the end of its first year. Both the vocabulary and technical platform will be published with implementation by two trustmark schemes and ICRA. Work is now underway to develop applications to make use of the machine-readable labels. These are:

A browser-independent helper application that will recognise semantic web data where present on websites and provide a visual interpretation. A user will therefore be able to see that a site has a trustmark whether or not the actual trustmark logo is visible to them.

A wrapper for search results that will indicate the presence of trustmarks and/or other metadata on the websites listed. This will be available for inspection by clicking an icon adjacent to the relevant result.

The applications will use common code elements to identify the labels and use relevant methods to attempt to gain trust in them. These include automated database look-up and machine-learning based content analysis. The first application sits on an end-user’s computer, the second is an option for search engines.

Phil Archer With contributions from Quatro project members

11 April 2005

[1] The Quality and Content Description project is co-funded by the European Union’s Safer Internet Programme. Partners in alphabetical order: Coolwave, ECP.NL, ERCIM, ICRA, IQUA, NCSR “Demokritos,” Pira International, University of Milan, Web Mdica Accreditada. Full details on the project website: www.quatro-project.org
[2] See, for example, http://europa.eu.int/information_society/activities/sip/docs/pdf/reports/qual_lab_bkgd.pdf.
The UNICE – BEUC e-Confidence project. The final report, published 22/10/01 is available from www.beuc.org but is more easily found at www.quatro-project.org/unice-beuc/eConfidence.pdf
www.quatro-project.org/vocabulary
PICS – the Platform for Internet Content Selection. See www.w3.org/PICS/
The RDF Content Labels documentation http://www.w3.org/2004/12/q/doc/rdf-contentlabels.html
ICRA, see www.icra.org

Blog

Quatro – a metadata platform for trustmarks

icraorg