Overview

These modules are an integration of the Thomson Reuters' Calais web service into the Drupal platform. The Calais Web Service automatically creates rich semantic metadata for the content you submit – in well under a second. Using natural language processing, machine learning and other methods, Calais analyzes your document and finds the entities within it. But, Calais goes well beyond classic entity identification and returns the facts and events hidden within your text as well. The web service is free for commercial and non-commercial use, but reguires registration to obtain an API Key

At its core, these modules allow you to automatically tag your data. However, the Calais Web Service allows it to be taken a step further by not only identifying the terms in your content, but also identifying the context and relevancy of those terms. Many services can tell you that IBM was mentioned in your content, but no other service identifies that IBM is an Organization, then disambiguates all the various references to IBM (International Business Machines, etc.), and finally tells you with a scoring mechanism that your content is more about IBM than any other term identified. A truly innovative service, powered by incredible technology and extremely dedicated people.

Now what?

Okay, so you have this incredibly powerful semantic tagging service. What are some things you can actually do with it?

You can have Calais process all of your content automatically as it is saved by your users, which will generate a rich set of tags for your content. You can then create site tag clouds (using Tagadelic) to provide a glimpse into the topics being written about on the site. You could also have Calais process incoming feeds that are created as nodes via FeedAPI. This provides you with a fully automated site solution for content aggregation. The options are limitless.

Do you have more ideas? Submit them and we'll add them here.

Under the covers: Using The API

The Calais API module has a very straightforward API to use for all module developers. You can either use one of two function calls to access the Calais Web Service, or you can use the Calais object that wraps the service in an easy-to-use Object Oriented API.

Configuration

Before anything can be done with the Calais Web Service, you must first register for an API Key. Once the registration process is complete, access the Site Configuration > Calais API Settings page. Enter your API key in the text field, and specify a few other settings. (You can read the Calais documentation to learn what the settings mean.)

API Functions

Now that you have configured your API Key in the Calais API Settings, you can access the Calais Web Service using one of two functions. The final parameter to both of these functions is a list of input parameters to the Calais Web Service. These parameters are provided via an associative array, and the key names should equal the input parameter names from the Calais Documentation.

calais_api_analyze($content, $parameters = array())

Calais will analyze your content stripping any/all HTML and scripts as necessary.

calais_api_analyze_xml($title, $body, $date, $parameters = array())

This will analyze your content by first packaging it as XML and them submitting it to Calais. Title and Body should contain the content that will be processed by Calais. The importance of Date is that once detected, it is used to resolve relative date mentions (e.g., "yesterday") when such mentions appear in Calais' events and facts. If Document Date is not provided, relative dates will be resolved based on the "date of today."

Calais API Object

The function-based API is really just a wrapper for the Calais object. This encapsulates all interaction with the Calais Web Service and returns processed results. You can also use the $calais->rdf attribute to get a copy of the actual RDF returned from the service if alternative processing is desired.

Constructor

To use the Calais API object you must first instantiate a new instance of the object. This can be dome simply with:

$calais = new Calais();

Optionally, you can provide a set of parameter overrides to the Calais constructor. These parameters should be provided as an associative array and the key names should equal the input parameter names from the Calais Documentation.

$calais = new Calais(array('calculateRelevanceScore' => 'false', 'allowDistribution' => 'true'));

analyze($content)

Calais will analyze your raw content stripping any/all HTML and scripts as necessary.

analyzeXML($title, $body, $date)

This will analyze your content by first packaging it as XML and them submitting it to Calais. Title and Body should contain the content that will be processed by Calais. The importance of Date is that once detected, it is used to resolve relative date mentions (e.g., "yesterday") when such mentions appear in Calais' events and facts. If Document Date is not provided, relative dates will be resolved based on the "date of today."

API Return Values

The API functions and objects return an associative array of Calais Metadata. The key will be the the Entity name (or EventsFacts), and the value will be a CalaisMetadata object representing the collection of data around a particular piece of Calais metadata. The CalaisMetadata objects have an array of $terms which are CalaisTerm objects representing the Entitiy or EventsFacts.

The Holy Grail: Relevant Automated Tagging

The basic Calais module makes use of the API detailed previously to deliver rich semantic tagging to your Nodes. When content is created or updated on your site, that content will be sent to the Calais Web Service for analysis. The Entities, Events, and Facts that are identified in your content will then be applied. Each entity type is created as a Drupal Vocabulary and each entity term is created as a Taxonomy Term within the specific Vocabulary. This provides an automated way to have very detailed contextual tagging applied to your nodes in an automated process to ease the burden of producing richly tagged content.

Configuration

The Calais configuration, accessed at Calais Node Settings within Site Configuration, provides a variety of configuration options which are applied globally with specific overrides applicable on a per Content Type basis.

Global Configuration

The Global Calais Entities allow you to specify system wide which Calais Vocabularies apply to your Content Types. There are a wealth of Vocabularies available, but not all of them might be relevant to your domain, so just uncheck the Vocabularies that you wish to exclude.

Specific Content Type Processing Settings

At the Content Type level (Page, Book, Story, etc...) you specify how that content type is processed by the module.

Calais Processing
Determines if these nodes are analyzed via Calais, and if so, how that process is implemented. It ranges from not processing at all, to automatically processing on every update. Select the option that is most appropriate for your site and the level of involvement required.
Allow Calais Searching
Overrides the setting at the API level. Indicates whether future searches can be performed on the extracted metadata by Calais
Allow Calais Distribution
Overrides the setting at the API level. Indicates whether the extracted metadata can be distributed by Calais
Relevancy Threshold
Calais can provide a relevance score with each uniquely identified entity. The relevance capability detects the importance of each unique entity and assigns a relevance score in the range 0-1 (1 being the most relevant and important). The threshold set here will limit for the entity terms that apply by only displaying or automatically associating terms that have an equal or greater relevance than the threshold.
Use Calais Global Entity Defaults
When selected, the Vocabularies associated globally in the Global Calais Entities section will use used for this specific content type, however, you can override the associated Vocabularies for this particular content type.

Using the tagging and suggestions

For content types that have Calais processing enabled, upon creation there will be a new tab associated with a node, titled Calais. On this tab will be all of the associated entity Vocabularies. If the particular content type was setup for automated association, you will see the Vocabularies filled with the Calais terms (and all suggestions will be selected). However if manual association via suggestions was configured, then you will be presented a listing of the Calais Vocabularies with suggestions beneath the fields. Clicking on a suggestion will highlight it and insert that term it into the Vocabulary field. Clicking an already highlighted suggestion, will remove it from the Vocabulary field. When your taxonomy associations are finished, click Save at the bottom of the form to save your selections.

Hooks

The Calais module provides hooks that allow modules to gain access to the processing chain and modify the functionality by either pre or post processing the node and keywords during the Calais processing.

hook_calais_preprocess(&$node, &$keywords)

This hook gets called before any manual or automatic term association takes place. This allows for node or keyword modifications. The calais tagmod module makes use of this hook to modify keywords before processing.

hook_calais_postprocess(&$node, &$keywords)

This hook gets called after term association takes place. This allows for node or keyword modifications after processing.

Bulk Processing

The Calais Bulk Processing option, under Site Configuration, provides a mechanism to process all Nodes of a certain Content Type for Automatic Term Association. It allows you to specify a Relevance Threshold for bulk processing that might be different from your normal configured settings.

On execution, this uses the Batch API to process all nodes of the selected content type, submitting them individually to Calais for analysis and automated term association. This is an incredibly valuable piece of functionality for sites with large archives of content.

What is next...

The sky is the limit. Some short term goals will be RDFa integration, more end user focused functionality such as geomapping, and taking advantage of new Calais functionality as it rolls out. Please download the module and give it a try, don't forget to drop us a line and let us know what you are thinking.

Happy tagging.

Drupal modules

Comments

stevebeckam’s picture

As per the recent data "Google is now harvesting semantic metadata,". So as per thought every webmaster should aware of this and make use this useful section.

Ramiszaro’s picture

It Shows some value To the people indeed of making some optimistic things belonging to google the credit goes to google in the way of making the webmaster to aware of this update .