On this page
OAI Harvester database structure
- oaiharvester_providers
- oaiharvester_formats
- oaiharvester_formats_to_providers
- oaiharvester_sets
- oaiharvester_sets_to_providers
- oaiharvester_batch
- oaiharvester_harvester_schedules
- oaiharvester_harvest_schedule_steps
- oaiharvester_harvest_queue
oaiharvester_providers table
Table that holds information about remote OAI-PHM data providers.
+-------------------------------+------------------+------+-----+
| Field | Type | Null | Key | Default
+-------------------------------+------------------+------+-----+
| provider_id | int(10) unsigned | NO | PRI | NULL
| name | varchar(255) | NO | |
| server_name | varchar(255) | NO | |
| oai_provider_url | tinytext | NO | | NULL
| description | text | YES | | NULL
| admin_email | varchar(255) | NO | |
| admin_phone | varchar(255) | NO | |
| is_service_ready | tinyint(4) | NO | | 0
| created_at | datetime | NO | | NULL
| updated_at | datetime | YES | | NULL
| last_harvest_end_time | datetime | YES | | NULL
| earliest_start_date_harvested | datetime | YES | | NULL
| last_harvest_date | datetime | YES | | NULL
| last_validated | datetime | YES | | NULL
| protocol_version | varchar(10) | YES | | NULL
| granularity | varchar(22) | YES ||YYYY-MM-DDThh:mm:ssZ
| deleted_record | varchar(12) | YES | | NULL
| has_identify | tinyint(4) | NO | | 0
| has_list_metadata_formats | tinyint(4) | NO | | 0
| has_list_sets | tinyint(4) | NO | | 0
| records_added | int(11) | NO | | 0
| errors | int(11) | NO | | 0
| warnings | int(11) | NO | | 0
| user_id | int(10) unsigned | NO | | NULL
| last_oai_request | mediumtext | YES | | NULL
| oai_identifier | mediumtext | YES | | NULL
| type | varchar(10) | NO | | server
+-------------------------------+------------------+------+-----+
- provider_id
- The primary identifier for the repository.
This field is referenced by oaiharvester_harvester_schedules table's provider_id field. - name
- The name of the repository.
- server_name
- The name of the repository given by the provider itself.
- oai_provider_url
- URL of the server.
- description
- Textual description of the repository, the nature of content.
- admin_email
- Email of server administrator.
- admin_phone
- Phone number of server administrator.
- is_service_ready
- Is the service ready to harvest?
- created_at
- When this provider were created
- updated_at
- When this provider were updated.
- last_harvest_end_time
- Latest end date of a successfull selective harvest.
- earliest_start_date_harvested
- Earliest start date of a successfull selective harvest.
- last_harvest_date
- Last date when this provider were harvested.
- last_validated
- The last time the provider was validated.
- protocol_version
- Which OAI-PMH protocol does the provider support.
- granularity
- The finest harvesting granularity supported by the repository (the pattern of from and until parameters). In the module we defined two constants:
OAIHARVESTER_LONG_GRANULARITYfor date/time level precision which is 'YYYY-MM-DDThh:mm:ssZ' andOAIHARVESTER_SHORT_GRANULARITYfor date level precision which is 'YYYY-MM-DD'. - deleted_record
- The manner in which the repository supports the notion of deleted records (no, transient, or persistent).
- has_identify
- Flag to indicate whether was the Identify verb request successful.
- has_list_metadata_formats
- Flag to indicate whether was the ListMetadataFormats verb request successful.
- has_list_sets
- Flag to indicate whether was the ListSets verb request successfull.
- records_added
- The number of new records harvested from the provider accross all harvest events.
- records_added
- The the number of updated or deleted records which were harvested.
- errors
- The the number of error messages.
- warnings
- The the number of warning messages.
- user_id
- The user, who added this provider.
- last_oai_request
- The last OAI-PMH request.
- oai_identifier
- Content of the Identify verb’s oai-identifier element. It includes scheme, repositoryIdentifier, delimiter, sampleIdentifier keys.
- type
- Repository type. Possible values are 'server' which means a real OAI-PMH server (our constant in the module is
OAIHARVESTER_PROVIDERTYPE_SERVER) and 'cache' which means, that the URL is a directory in the file system with the cached files containing xml responses from a data provider (our constant isOAIHARVESTER_PROVIDERTYPE_CACHE).
oaiharvester_formats table
Information about the harvestable metadata formats. This table contains all formats from all data providers.
+-----------------+------------------+------+-----+---------+
| Field | Type | Null | Key | Default |
+-----------------+------------------+------+-----+---------+
| format_id | int(10) unsigned | NO | PRI | NULL |
| name | varchar(255) | YES | | |
| namespace | tinytext | YES | | NULL |
| schema_location | tinytext | YES | | NULL |
+-----------------+------------------+------+-----+---------+
- format_id
- The primary identifier of a metadata format. This field is auto incremented.
This field is referenced by oaiharvester_harvest_schedule_steps table's format_id field and oaiharvester_formats_to_providers table's format_id field. - name
- The name of supported format.
- namespace
- The namespace of metadata format.
- schema_location
- The url of schema location.
oaiharvester_formats_to_providers table
The connector table between the OAI data provider (oaiharvester_providers) and the harvestable metadata formats (oaiharvester_formats).
+--------------------+------------------+------+-----+---------+
| Field | Type | Null | Key | Default |
+--------------------+------------------+------+-----+---------+
| id | int(10) unsigned | NO | PRI | NULL |
| format_id | int(10) unsigned | NO | | NULL |
| provider_id | int(10) unsigned | NO | | NULL |
+--------------------+------------------+------+-----+---------+
- id
- The identifier of the format-data provider binding. This field is auto incremented.
- format_id
- The id of the metadata format. This field is a reference to oaiharvester_formats table's format_id field.
- provider_id
- The id of data provider. This field is a reference to oaiharvester_providers table's provider_id field.
oaiharvester_sets table
Information about the harvestable sets. This table contains all sets from all data providers. This table is fulfilled and updated automatically by the oaiharvester module from response of OAI-PMH's ListSets verb.
+-----------------+------------------+------+-----+---------+
| Field | Type | Null | Key | Default |
+-----------------+------------------+------+-----+---------+
| set_id | int(10) unsigned | NO | PRI | NULL |
| display_name | varchar(255) | NO | | |
| description | text | YES | | NULL |
| set_spec | varchar(255) | NO | | |
| is_provider_set | int(10) unsigned | NO | | 1 |
| is_record_set | int(10) unsigned | NO | | 0 |
+-----------------+------------------+------+-----+---------+
- set_id
- The unique identifier of the harvestable set.
- display_name
- The displayable name of the set.
- description
- The short description about the set’s content.
- set_spec
- The OAI standard’s set specification.
- is_provider_set
- Flag to indicate whether it is a provider set.
- is_record_set
- Flag to indicate whether it is a record set.
oaiharvester_sets_to_providers table
The connector table between the OAI data provider (oaiharvester_providers) and the harvestable metadata formats (oaiharvester_sets).
+---------------------+------------------+------+-----+---------+
| Field | Type | Null | Key | Default |
+---------------------+------------------+------+-----+---------+
| id | int(10) unsigned | NO | PRI | NULL |
| set_id | int(11) | YES | | NULL |
| provider_id | int(11) | YES | | NULL |
+---------------------+------------------+------+-----+---------+
- id
- The identifier of a set-provider pair. This field is auto incremented.
- set_id
- The identifier of a set to harvest.
- provider_id
- The identifier of a data provider to harvest from.
oaiharvester_harvester_schedules table
Table that holds information about harvesting information (schedule)
+----------------------+----------------------+------+-----+---------+
| Field | Type | Null | Key | Default |
+----------------------+----------------------+------+-----+---------+
| harvest_schedule_id | int(10) unsigned | NO | PRI | NULL |
| schedule_name | varchar(255) | NO | | |
| provider_id | int(11) | NO | | NULL |
| recurrence | varchar(255) | NO | | |
| minute | tinyint(4) | NO | | NULL |
| hour | tinyint(4) | NO | | NULL |
| day_of_week | tinyint(4) | NO | | NULL |
| start_date | datetime | YES | | NULL |
| end_date | datetime | YES | | NULL |
| notify_email_address | varchar(255) | NO | | |
| parsing_mode | varchar(10) | YES | | dom |
| is_cacheable | tinyint(4) | NO | | 0 |
| max_request | smallint(5) unsigned | NO | | 0 |
| created_by | varchar(255) | YES | | |
| created_date | datetime | NO | | NULL |
| status | varchar(20) | YES | | passive |
| skip_main_task | tinyint(4) | NO | | 0 |
+----------------------+----------------------+------+-----+---------+
- harvest_schedule_id
- The primary identifier for a schedule. This is an auto incremented field. This field is referenced by oaiharvester_harvest_schedule_steps table's schedule_id field.
- schedule_name
- The name of the schedule.
- provider_id
- ID of the harvested data provider. This field is a reference to oaiharvester_providers table's provider_id field.
- recurrence
- Cron expression of the launch times.
- minute
- The minute part of the time.
- hour
- The hour part of the time.
- day_of_week
- The day of week part of the time.
- start_date
- Minimal date of the harvestable records.
- end_date
- Maximal date of the harvestable records.
- notify_email_address
- The mail address to sent notifications.
- parsing_mode
- The mode of parsing OAI responses. There are two modes: 'dom' and 'regex'.
- is_cacheable
- Do harvester chache responses?
- max_request
- The maximum number of OAI-PMH request. 0 means no limit. Use only for testing reasons.
- created_by
- The user, who created the schedule.
- created_date
- The time when this schedule was created.
- status
- The state of schedule. (possible values: 'active' which means that the schedule is currently running, and 'passive' which meand, that the schedule is not running now. The module defines two constants respectively
OAIHARVESTER_STATUS_ACTIVEandOAIHARVESTER_STATUS_PASSIVE, which refers to the corresponding value. The default value is 'passive'.) - skip_main_task
- Skip main task, and run only additional steps (if any)
oaiharvester_harvest_queue table
Information about the actual harvest schedules. This table behaves as a queue, so the selected harvests will be put into it until the task has been finished.
+---------------------+---------------------+------+-----+---------+
| Field | Type | Null | Key | Default |
+---------------------+---------------------+------+-----+---------+
| harvest_id | int(10) unsigned | NO | PRI | NULL |
| inserted_at | int(10) unsigned | NO | | NULL |
| provider_url | varchar(255) | NO | | |
| set_name | varchar(255) | YES | | |
| metadata_prefix | varchar(255) | NO | | |
| from_date | varchar(255) | YES | | |
| until_date | varchar(255) | YES | | |
| start_time | datetime | YES | | NULL |
| end_time | datetime | YES | | NULL |
| harvest_schedule_id | int(10) unsigned | YES | | NULL |
| status | tinyint(3) unsigned | YES | | NULL |
| parsing_mode | varchar(10) | YES | | dom |
+---------------------+---------------------+------+-----+---------+
- harvest_id
- The identifier of the queue row. This is an auto incremented field.
- inserted_at
- When this item were created
- provider_url
- The base URL of OAI service.
- set_name
- The actual set to harvest.
- metadata_prefix
- The actual metadata format to harvest.
- from_date
- The selective harvesting’s form parameter (the modification date of the oldest record to harvest).
- until_date
- The selective harvesting’s until parameter (the modification date of the latest record to harvest).
- start_time
- The start time of harvesting.
- end_time
- The end time of harvesting.
- harvest_schedule_id
- The identifier of the correspondent schedule.
- status
- The status of the queue item: passive (=0) or active (=1).
- parsing_mode
- The mode of parsing OAI responses. There are two modes: dom and regex.
oaiharvester_harvest_schedule_steps
The connector table between the OAI schedules (oaiharvester_providers), the formats (oaiharvester_formats) and the metadata sets (oaiharvester_sets). Gives information about the last run of the current step.
+---------------------------+------------------+------+-----+---------+
| Field | Type | Null | Key | Default |
+---------------------------+------------------+------+-----+---------+
| step_id | int(10) unsigned | NO | PRI | NULL |
| schedule_id | int(11) | NO | | NULL |
| format_id | int(11) | NO | | NULL |
| set_id | int(11) | YES | | NULL |
| last_ran | datetime | YES | | NULL |
+---------------------------+------------------+------+-----+---------+
- step_id
- The identifier of a set-provider pair.
- schedule_id
- The identifier of a schedule. This is a reference to oaiharvester_harvester_schedules table's harvest_schedule_id field.
- format_id
- The identifier of a format to harvest. This is a reference to oaiharvester_formats table's format_id field.
- set_id
- The identifier of a set to harvest. This is a reference to oaiharvester_sets table's set_id field.
- last_ran
- The time of last harvest started. Note: if you set the "oaiharvester_skip_recording_last_ran" Drupal variable to TRUE (such as
$conf['oaiharvester_skip_recording_last_ran'] = TRUEin the sites/default/settings.php), the variable won't be saved, and next time it will harvest all records again.
oaiharvester_batch table
Logging the harvester process. The record of this table is frequently referenced in the code as $saved_batch.
+-------------+-----------------------+------+-----+---------+
| Field | Type | Null | Key | Default |
+-------------+-----------------------+------+-----+---------+
| id | int(10) unsigned | NO | PRI | NULL |
| sets | mediumtext | YES | | NULL |
| reports | mediumtext | YES | | NULL |
| schedule_id | mediumint(8) unsigned | YES | | 0 |
| status | mediumtext | YES | | NULL |
| timestamp | int(10) unsigned | NO | | NULL |
+-------------+-----------------------+------+-----+---------+
- id
- The identifier of the batch entry.
- sets
- The serialized array of operations. Each operation is an array with two values: the first element is the name of the function to call, the second element is the array of parameters to that function. An example:
array( array( 'oaiharvester_harvest_as_batch', array('manual', 1, 'http://example.com/oai', 'xc', NULL, NULL, NULL, 'regex', 1, 0, NULL) ), array( 'xc_oaiharvester_bridge_load_csv_and_optimize', array(1, 1, 1) ) ) - reports
- The serialized content of reports. The report is an associative array, with the following keys: results: number of records harvested and total time. tech_details: list of technical informations in HTML. actions: list of actions to do after harvest, like go to to the schedule page or deleting the records.
- schedule_id
- The corresponding schedule identifier.
- status
- Information about the last operation in serialized format. It is an object with the following fields: operation_id: the identifier of the last set. function: the name of last funtion. status: the current status of the operation like 'FINISHED'. context: the Batch API's $context object.
- timestamp
- The time the record were created
Help improve this page
You can:
- Log in, click Edit, and edit this page
- Log in, click Discuss, update the Page status value, and suggest an improvement
- Log in and create a Documentation issue with your suggestion