Harvesting

After installing and configuring the Drupal Toolkit, the next important step is to begin harvesting metadata records for import, implementing the Open Archives Initiative Protocol for Metadata Harvesting, shortly OAI-PMH.

OAI-PMH

The OAI-PMH's purpose is to transfer large amount of records from one computer "data repository" to another, in XML format. A data repository can support different "formats" and it may have independently harvestable by "sets".

The OAI-PMH supports six different information requests or "verbs":

  • Identify: requests information about the repository
  • GetRecord: requests one record by its identifier
  • ListIdentifiers requests information about the available record identifiers
  • ListMetadataFormats: requests information about the available metadata formats or XML schemes
  • ListSets: requests information about the available sets or collections
  • ListRecords: requests information about metadata records

Harvester Configuration

There are two steps to configure a harvest:

  • Set up a data repository
  • Set up a harvest schedule for the data repository

Repositories

Using the OAI Harvester module, a site administrator must first register an existing data repository by name and base URL. This is done by navigating to the form at admin/xc/harvester/repository/add.

The base URL specifies the Internet host, port, and path, of an HTTP server acting as a repository, without any parameter. It is combined with the type of request verb, using the "verb" parameter to form a valid OAI-PMH request. Keep in mind that the base URL on its own does not return a valid response. Therefore, in order to check whether your base URL is valid, you must use the following URL in your browser: [base URL]?verb=Identify.

After saving the repository, the OAI Harvester module will detect and save all the necessary information, such as the descriptive information about the data provider, the list of metadata formats, and sets, the availability of the server using the Identify verb.

Harvest schedules

After adding a repository, the next step is to create a harvest schedule. The purpose of a harvest schedule is to set the properties of what, when and how to harvest. This harvest schedule can then be launched either automatically, using the operating system's cron or task scheduling capability to run Drupal's cron file, or manually, by invoking harvest from the user interface.

The most useful way is to do so by configuring Drupal cron, see instructions at: http://drupal.org/cron. This way, the OAI Harvester module's implementation of the cron hook will examine whether schedule needs to run, based on the actual time and the timing settings on the harvest schedule. The module also has mechanisms in place to prevent concurrent harvests -- even when the time it takes to harvest overlaps multiple scheduled occurrances -- from the same harvest schedule by detecting when the harvest schedule is actually running. If all requirements are met, harvesting will start, otherwise nothing will happen. However, as mentioned earlier, the harvest can also always be launched manually, with the same checks applied to prevent concurrent harvests determine when, when, and how to harvest.

Creating or modifying a harvest schedule involves the following:

  1. Select a data repository
  2. Select the frequency of the harvest (hourly, daily, weekly)
  3. Select a date range for when the harvest should be launched automatically, using Drupal's cron hook
  4. Enter a name for the schedule
  5. Select set (optional) and metadata format (required)
  6. Select an XML parsing method
  7. Select post harvest steps to prepare metadata for search
  8. Select options for performance optimization

Performance

Depending on the system, harvesting may take a very long time, so there are a few options available simply for the purpose of speeding up the process of harvesting.

First, the choice wether to use robust, but slow DOM (Document Object Model) methods or much quicker regular expression methods for XML parsing. DOM parsing is well-known and the behavior is very clear and predictable. Regular expression parsing may have a slight chance of being less predictable, however, so far, we have had no problems.

Second, the choice of which, if any, post-harvest processes to run and in which order. Currently, the two post-harvest processes are:

  • Indexing metadata with Solr
  • Creating Drupal nodes from metadata

In order for metadata records to be searchable and make use of faceting and other next-generation catalog features, they must first be indexed with Solr. In addition, in order to have tight integration of metadata records as content Drupal, nodes must be created to represent them. These tasks can be enabled or disabled and reordered, providing four options:

  1. Option a: Create nodes, but not index with Solr
  2. Option b: Index with Solr, do not create nodes
  3. Option c: Index with Solr first, then create nodes
  4. Option d: Create nodes first, then index with Solr

Third, the choice of whether to use normal INSERT statements after every single record is processed or LOAD DATA statements after all records have been processed. The latter is faster and more performant, but requires general, and not database, level privilege in MySQL. If you chose this, be very careful with privilege settings. However, this level of privilege is not always available, so the traditional, and slower, INSERT command can be used as well as it requires only database level privileges. For this reason, it is also more secure.

If you have millions of records, and reducing the harvesting time is important, consider using LOAD DATA INFILE. However, if you have only few records to harvest, or the duration of harvest is not as important, use the INSERT command.

Caching

The raw XML responses form a data repository can be cached if so desired. This is good if your internet connection is slow, and you would like to harvest the same exact set of records again. This way, you can re-run the harvesting process without accessing the internet.

For the most part, this feature is useful only in the testing phase. It could be misleading to leave cached XML in a production server because it does not reflect the changes on the OAI-PMH server. If you previously cached responses and would like to remove them, simply change the setting by editing the harvest schedule and delete the cache directory by following the "Clear Cache" link for a particular harvest scheule. The cached files are stored in two folders within Drupal's default file directory oaiharvester_http_cache.

Further Configuration

Another useful feature is to limit the number of OAI-PMH requests. If you limit the requests, the harvester will not fetch all the available information. Instead it will only fetch as much as is provided with that number of request. Leaving this field empty or equal to 0 ("zero"), the default value, is the same as having no limit. Use other value only for testing reasons. Keep in mind that the number of records per requests is controlled by the data provider. The only thing controlled with this setting is the number of requests. For example, if Repository A provides 10 records per request and Repository B provides 1000 records, limiting the number of requests to 10 will return 100 records from Repository A and 10000 records from Repository B.

As with caching, this feature is useful primarily in the testing phase. Fetching all records may be time consuming, so if you only want to test the Drupal Toolkit or to see the indexing process taking place, you can start with only a limited number of records.

Harvesting Process

For testing and checking the Drupal Toolkit, it is important to understand what has happened during the harvest. The following information will give you a greater understanding of what occurs behind the scenes with the harvesting process.

First, the harvester submits a request to the repository and receives a response in XML format. Second, the XML is parsed according to the parsing method selected with the harvest schedule. Third, the harvester iterates over each record in the request, and calls a hook (see details in developer's guide) to process the record. If there are more records at the server the harvester uses an OAI-PMH parameter called a ("resumptionToken") to request more records. At the beginning and end of each request as well as the entire harvesting process, particular hooks are called to notify modules of what is going on.

One important module is the XC Harvester Bridge module, included in the Drupal Toolkit, which implements these hooks to provide necessary functionality linking the harvester to other modules in the Drupal Toolkit, such as the Metadata module, which stores metadata into Drupal, and the Solr module, which indexes metadata with Solr. Let us refer to this module as the "bridge" module.

First, the bridge module creates an internal object to handle the metadata record. Second, the module examines whether this record is meant to be deleted or not. If it is marked for deletion, the bridge module handles the deletion accordingly. If it is not marked for deletion, the bridge module checks whether the record already exists. If so, it performs an update, otherwise, new metadata record is created. An important thing to remember throughout this process is the harvest schedule configuration for storage. If you selected to use INSERT statements, then the metadata is created immediately. However, if you selected to use LOAD DATA statements and CSV files, the metadata is not saved immediately. Instead, it is saved after the last OAI request. To recap, updated and deleted records are always handled at the time the record is processed, however the creation of new records may be postponed until all record processing is complete, depending on the harvest schedule's configuration.

Post-Harvest Processes

After fetching the last record from data provider and processing all new, updated, and deleted records, the result is a collection of "raw" records. These records cannot be searched, displayed, printed, commented on, etc. The post-harvest steps of node creation and indexing with Solr must then be performed.

Node creation

As the node is the core concept of Drupal, representing a piece of content, Drupal provides a handful of functions to handle the nodes, such as adding comments, printing, defining internal structure, and so forth. The Drupal Toolkit respects this convention and therefore creates nodes as well, to represent metadata records. Currently, this is fully implemented with XC Schema records only, turning FRBR manifestation-level records into Drupal nodes.

Solr indexing

Furthermore, to provide next-generation catalog functionality, the metadata is indexed into Solr. One of the difficulties with this process is that metadata records are commonly in XML and may, therefore, certain elements may have attributes or other elements as children. Adding to the complexity, not all attributes define the element or record in the same. For this and other reasons, indexing metadata into Solr must be controlled carefully.

First, hierarchial data structures, although handled well with relational databases, are not so easily handled in Solr, which only allows documents with key-value pairs with no operation like SQL's JOIN command (as of now). Therefore, metadata records structures are "flattened" and combined with related records. A good analogy would be to say a collection of XC schema records becomes similar to a MARC record. To do this, the Solr module iterates over all manifestations, locating all parents -- the expression level record and the parents of those records, the work level records. It then merges the metadata from these records into a single Solr document.

In Solr we store different types of information:

  • Mandatory fields
  • Selected fields
  • Facets
  • Generated fields

Mandatory fields are common in all records. These fields are the same as the most important fields and store only values for the the base level record (manifestation), not of the parents.

  • OAI identifier
  • Node ID
  • Node Type
  • Metadata ID
  • Metadata Typee
  • Source ID
  • Type

Selected fields of the metadata are fields "worth indexing", for search capability. For example, if you would like to search for titles, you need to select it for indexing. Each selected schema field may became the origin of more Solr field. For more information, read the XC Index documentation.

Facets are created in several ways on the administrator's interface. What is important to understand is that facets are just normal fields in Solr. There are only two restrictions. First, they must have store the field value, not any other text. Second, they must use phrase indexing in most cases. For more information, read the XC Index documentation.

Generated fields are fields created by the Solr indexing. For example, a field called text contains all values of all fields of the merged metadata. It will be the base of a general search since it contains every information of the MARC-like record. This is the default search field if a user does not specify a particular field to search. Another generated field is the metadata field, which is a serialized version of the MARC-like record. The module will use it when displays search result lists or full record. It is a stored, but not searchable field.

Naming convention for Solr field

In Solr, many metadata field names are automatically transformed from the schema field. This helps with indexing. One very important thing to take into consideration is indexing metadata fields that require XML namespaces. Since the colon character (":") is reserved in Solr, the Solr module automatically replaces a colon with two underscore characters. These fields are dynamic fields, as it would be difficult to predict and register all possible fields.

This is simply an example of how Solr works. Adding a suffix to the end of fields will automatically tell Solr what type of field it is. These are the two rules the modules apply when they translate back and forth schema and Solr fields. First dcterms:title can be indexed as text and become dcterms__title_t. In Solr, the field indexed as phrase becomes dcterms__title_s.

Technical details

The big picture

Guide maintainers

pkiraly's picture