OAI Harvester database structure

Last updated on
30 April 2025

oaiharvester_providers table

Table that holds information about remote OAI-PHM data providers.

+-------------------------------+------------------+------+-----+
| Field                         | Type             | Null | Key | Default
+-------------------------------+------------------+------+-----+
| provider_id                   | int(10) unsigned | NO   | PRI | NULL
| name                          | varchar(255)     | NO   |     |
| server_name                   | varchar(255)     | NO   |     |
| oai_provider_url              | tinytext         | NO   |     | NULL
| description                   | text             | YES  |     | NULL
| admin_email                   | varchar(255)     | NO   |     |
| admin_phone                   | varchar(255)     | NO   |     |
| is_service_ready              | tinyint(4)       | NO   |     | 0
| created_at                    | datetime         | NO   |     | NULL
| updated_at                    | datetime         | YES  |     | NULL
| last_harvest_end_time         | datetime         | YES  |     | NULL
| earliest_start_date_harvested | datetime         | YES  |     | NULL
| last_harvest_date             | datetime         | YES  |     | NULL
| last_validated                | datetime         | YES  |     | NULL
| protocol_version              | varchar(10)      | YES  |     | NULL 
| granularity                   | varchar(22)    | YES ||YYYY-MM-DDThh:mm:ssZ
| deleted_record                | varchar(12)      | YES  |     | NULL
| has_identify                  | tinyint(4)       | NO   |     | 0
| has_list_metadata_formats     | tinyint(4)       | NO   |     | 0
| has_list_sets                 | tinyint(4)       | NO   |     | 0
| records_added                 | int(11)          | NO   |     | 0
| errors                        | int(11)          | NO   |     | 0
| warnings                      | int(11)          | NO   |     | 0
| user_id                       | int(10) unsigned | NO   |     | NULL
| last_oai_request              | mediumtext       | YES  |     | NULL
| oai_identifier                | mediumtext       | YES  |     | NULL
| type                          | varchar(10)      | NO   |     | server
+-------------------------------+------------------+------+-----+
provider_id
The primary identifier for the repository.
This field is referenced by oaiharvester_harvester_schedules table's provider_id field.
name
The name of the repository.
server_name
The name of the repository given by the provider itself.
oai_provider_url
URL of the server.
description
Textual description of the repository, the nature of content.
admin_email
Email of server administrator.
admin_phone
Phone number of server administrator.
is_service_ready
Is the service ready to harvest?
created_at
When this provider were created
updated_at
When this provider were updated.
last_harvest_end_time
Latest end date of a successfull selective harvest.
earliest_start_date_harvested
Earliest start date of a successfull selective harvest.
last_harvest_date
Last date when this provider were harvested.
last_validated
The last time the provider was validated.
protocol_version
Which OAI-PMH protocol does the provider support.
granularity
The finest harvesting granularity supported by the repository (the pattern of from and until parameters). In the module we defined two constants: OAIHARVESTER_LONG_GRANULARITY for date/time level precision which is 'YYYY-MM-DDThh:mm:ssZ' and OAIHARVESTER_SHORT_GRANULARITY for date level precision which is 'YYYY-MM-DD'.
deleted_record
The manner in which the repository supports the notion of deleted records (no, transient, or persistent).
has_identify
Flag to indicate whether was the Identify verb request successful.
has_list_metadata_formats
Flag to indicate whether was the ListMetadataFormats verb request successful.
has_list_sets
Flag to indicate whether was the ListSets verb request successfull.
records_added
The number of new records harvested from the provider accross all harvest events.
records_added
The the number of updated or deleted records which were harvested.
errors
The the number of error messages.
warnings
The the number of warning messages.
user_id
The user, who added this provider.
last_oai_request
The last OAI-PMH request.
oai_identifier
Content of the Identify verb’s oai-identifier element. It includes scheme, repositoryIdentifier, delimiter, sampleIdentifier keys.
type
Repository type. Possible values are 'server' which means a real OAI-PMH server (our constant in the module is OAIHARVESTER_PROVIDERTYPE_SERVER) and 'cache' which means, that the URL is a directory in the file system with the cached files containing xml responses from a data provider (our constant is OAIHARVESTER_PROVIDERTYPE_CACHE).

oaiharvester_formats table

Information about the harvestable metadata formats. This table contains all formats from all data providers.

+-----------------+------------------+------+-----+---------+
| Field           | Type             | Null | Key | Default |
+-----------------+------------------+------+-----+---------+
| format_id       | int(10) unsigned | NO   | PRI | NULL    |
| name            | varchar(255)     | YES  |     |         |
| namespace       | tinytext         | YES  |     | NULL    |
| schema_location | tinytext         | YES  |     | NULL    |
+-----------------+------------------+------+-----+---------+
format_id
The primary identifier of a metadata format. This field is auto incremented.
This field is referenced by oaiharvester_harvest_schedule_steps table's format_id field and oaiharvester_formats_to_providers table's format_id field.
name
The name of supported format.
namespace
The namespace of metadata format.
schema_location
The url of schema location.

oaiharvester_formats_to_providers table

The connector table between the OAI data provider (oaiharvester_providers) and the harvestable metadata formats (oaiharvester_formats).

+--------------------+------------------+------+-----+---------+
| Field              | Type             | Null | Key | Default |
+--------------------+------------------+------+-----+---------+
| id                 | int(10) unsigned | NO   | PRI | NULL    |
| format_id          | int(10) unsigned | NO   |     | NULL    |
| provider_id        | int(10) unsigned | NO   |     | NULL    |
+--------------------+------------------+------+-----+---------+
id
The identifier of the format-data provider binding. This field is auto incremented.
format_id
The id of the metadata format. This field is a reference to oaiharvester_formats table's format_id field.
provider_id
The id of data provider. This field is a reference to oaiharvester_providers table's provider_id field.

oaiharvester_sets table

Information about the harvestable sets. This table contains all sets from all data providers. This table is fulfilled and updated automatically by the oaiharvester module from response of OAI-PMH's ListSets verb.

+-----------------+------------------+------+-----+---------+
| Field           | Type             | Null | Key | Default |
+-----------------+------------------+------+-----+---------+
| set_id          | int(10) unsigned | NO   | PRI | NULL    |
| display_name    | varchar(255)     | NO   |     |         |
| description     | text             | YES  |     | NULL    |
| set_spec        | varchar(255)     | NO   |     |         |
| is_provider_set | int(10) unsigned | NO   |     | 1       |
| is_record_set   | int(10) unsigned | NO   |     | 0       |
+-----------------+------------------+------+-----+---------+
set_id
The unique identifier of the harvestable set.
display_name
The displayable name of the set.
description
The short description about the set’s content.
set_spec
The OAI standard’s set specification.
is_provider_set
Flag to indicate whether it is a provider set.
is_record_set
Flag to indicate whether it is a record set.

oaiharvester_sets_to_providers table

The connector table between the OAI data provider (oaiharvester_providers) and the harvestable metadata formats (oaiharvester_sets).

+---------------------+------------------+------+-----+---------+
| Field               | Type             | Null | Key | Default |
+---------------------+------------------+------+-----+---------+
| id                  | int(10) unsigned | NO   | PRI | NULL    |
| set_id              | int(11)          | YES  |     | NULL    |
| provider_id         | int(11)          | YES  |     | NULL    |
+---------------------+------------------+------+-----+---------+
id
The identifier of a set-provider pair. This field is auto incremented.
set_id
The identifier of a set to harvest.
provider_id
The identifier of a data provider to harvest from.

oaiharvester_harvester_schedules table

Table that holds information about harvesting information (schedule)

+----------------------+----------------------+------+-----+---------+
| Field                | Type                 | Null | Key | Default |
+----------------------+----------------------+------+-----+---------+
| harvest_schedule_id  | int(10) unsigned     | NO   | PRI | NULL    |
| schedule_name        | varchar(255)         | NO   |     |         |
| provider_id          | int(11)              | NO   |     | NULL    |
| recurrence           | varchar(255)         | NO   |     |         |
| minute               | tinyint(4)           | NO   |     | NULL    |
| hour                 | tinyint(4)           | NO   |     | NULL    |
| day_of_week          | tinyint(4)           | NO   |     | NULL    |
| start_date           | datetime             | YES  |     | NULL    |
| end_date             | datetime             | YES  |     | NULL    |
| notify_email_address | varchar(255)         | NO   |     |         |
| parsing_mode         | varchar(10)          | YES  |     | dom     |
| is_cacheable         | tinyint(4)           | NO   |     | 0       |
| max_request          | smallint(5) unsigned | NO   |     | 0       |
| created_by           | varchar(255)         | YES  |     |         |
| created_date         | datetime             | NO   |     | NULL    |
| status               | varchar(20)          | YES  |     | passive |
| skip_main_task       | tinyint(4)           | NO   |     | 0       |
+----------------------+----------------------+------+-----+---------+
harvest_schedule_id
The primary identifier for a schedule. This is an auto incremented field. This field is referenced by oaiharvester_harvest_schedule_steps table's schedule_id field.
schedule_name
The name of the schedule.
provider_id
ID of the harvested data provider. This field is a reference to oaiharvester_providers table's provider_id field.
recurrence
Cron expression of the launch times.
minute
The minute part of the time.
hour
The hour part of the time.
day_of_week
The day of week part of the time.
start_date
Minimal date of the harvestable records.
end_date
Maximal date of the harvestable records.
notify_email_address
The mail address to sent notifications.
parsing_mode
The mode of parsing OAI responses. There are two modes: 'dom' and 'regex'.
is_cacheable
Do harvester chache responses?
max_request
The maximum number of OAI-PMH request. 0 means no limit. Use only for testing reasons.
created_by
The user, who created the schedule.
created_date
The time when this schedule was created.
status
The state of schedule. (possible values: 'active' which means that the schedule is currently running, and 'passive' which meand, that the schedule is not running now. The module defines two constants respectively OAIHARVESTER_STATUS_ACTIVE and OAIHARVESTER_STATUS_PASSIVE, which refers to the corresponding value. The default value is 'passive'.)
skip_main_task
Skip main task, and run only additional steps (if any)

oaiharvester_harvest_queue table

Information about the actual harvest schedules. This table behaves as a queue, so the selected harvests will be put into it until the task has been finished.

+---------------------+---------------------+------+-----+---------+
| Field               | Type                | Null | Key | Default |
+---------------------+---------------------+------+-----+---------+
| harvest_id          | int(10) unsigned    | NO   | PRI | NULL    |
| inserted_at         | int(10) unsigned    | NO   |     | NULL    |
| provider_url        | varchar(255)        | NO   |     |         |
| set_name            | varchar(255)        | YES  |     |         |
| metadata_prefix     | varchar(255)        | NO   |     |         |
| from_date           | varchar(255)        | YES  |     |         |
| until_date          | varchar(255)        | YES  |     |         |
| start_time          | datetime            | YES  |     | NULL    |
| end_time            | datetime            | YES  |     | NULL    |
| harvest_schedule_id | int(10) unsigned    | YES  |     | NULL    |
| status              | tinyint(3) unsigned | YES  |     | NULL    |
| parsing_mode        | varchar(10)         | YES  |     | dom     |
+---------------------+---------------------+------+-----+---------+
harvest_id
The identifier of the queue row. This is an auto incremented field.
inserted_at
When this item were created
provider_url
The base URL of OAI service.
set_name
The actual set to harvest.
metadata_prefix
The actual metadata format to harvest.
from_date
The selective harvesting’s form parameter (the modification date of the oldest record to harvest).
until_date
The selective harvesting’s until parameter (the modification date of the latest record to harvest).
start_time
The start time of harvesting.
end_time
The end time of harvesting.
harvest_schedule_id
The identifier of the correspondent schedule.
status
The status of the queue item: passive (=0) or active (=1).
parsing_mode
The mode of parsing OAI responses. There are two modes: dom and regex.

oaiharvester_harvest_schedule_steps

The connector table between the OAI schedules (oaiharvester_providers), the formats (oaiharvester_formats) and the metadata sets (oaiharvester_sets). Gives information about the last run of the current step.

+---------------------------+------------------+------+-----+---------+
| Field                     | Type             | Null | Key | Default |
+---------------------------+------------------+------+-----+---------+
| step_id                   | int(10) unsigned | NO   | PRI | NULL    |
| schedule_id               | int(11)          | NO   |     | NULL    |
| format_id                 | int(11)          | NO   |     | NULL    |
| set_id                    | int(11)          | YES  |     | NULL    |
| last_ran                  | datetime         | YES  |     | NULL    |
+---------------------------+------------------+------+-----+---------+
step_id
The identifier of a set-provider pair.
schedule_id
The identifier of a schedule. This is a reference to oaiharvester_harvester_schedules table's harvest_schedule_id field.
format_id
The identifier of a format to harvest. This is a reference to oaiharvester_formats table's format_id field.
set_id
The identifier of a set to harvest. This is a reference to oaiharvester_sets table's set_id field.
last_ran
The time of last harvest started. Note: if you set the "oaiharvester_skip_recording_last_ran" Drupal variable to TRUE (such as $conf['oaiharvester_skip_recording_last_ran'] = TRUE in the sites/default/settings.php), the variable won't be saved, and next time it will harvest all records again.

oaiharvester_batch table

Logging the harvester process. The record of this table is frequently referenced in the code as $saved_batch.

+-------------+-----------------------+------+-----+---------+
| Field       | Type                  | Null | Key | Default |
+-------------+-----------------------+------+-----+---------+
| id          | int(10) unsigned      | NO   | PRI | NULL    |
| sets        | mediumtext            | YES  |     | NULL    |
| reports     | mediumtext            | YES  |     | NULL    |
| schedule_id | mediumint(8) unsigned | YES  |     | 0       |
| status      | mediumtext            | YES  |     | NULL    |
| timestamp   | int(10) unsigned      | NO   |     | NULL    |
+-------------+-----------------------+------+-----+---------+
id
The identifier of the batch entry.
sets
The serialized array of operations. Each operation is an array with two values: the first element is the name of the function to call, the second element is the array of parameters to that function. An example:
array(
  array(
    'oaiharvester_harvest_as_batch',
    array('manual', 1, 'http://example.com/oai', 'xc', NULL, NULL, NULL,
          'regex', 1, 0, NULL)
  ),
  array(
    'xc_oaiharvester_bridge_load_csv_and_optimize',
    array(1, 1, 1)
  )
)
reports
The serialized content of reports. The report is an associative array, with the following keys: results: number of records harvested and total time. tech_details: list of technical informations in HTML. actions: list of actions to do after harvest, like go to to the schedule page or deleting the records.
schedule_id
The corresponding schedule identifier.
status
Information about the last operation in serialized format. It is an object with the following fields: operation_id: the identifier of the last set. function: the name of last funtion. status: the current status of the operation like 'FINISHED'. context: the Batch API's $context object.
timestamp
The time the record were created

Help improve this page

Page status: Not set

You can: