After spending some time cleaning up some cruft in our database, I was thinking it might be helpful to implement some more standardization, error checking and restriction to allowed values for some of the biblio fields. In particular, ISBNs and ISSNs are standardized formats; DOIs might be another candidate but it doesn't look like they follow a maxlength standard.

This would need to happen in a few places:

  • The node add/edit form, both in the maxlength of the form fields and in validating the form submission
  • In the database, when installing the schema and as a schema update
  • In the importing process

So it's not a trivial change, but I'd be willing to work on some patches to get it going if people think it's a worthwhile improvement.

Comments

rjerome’s picture

By all means, patch away!

I believe you are right regarding the DOIs in that AFAIK only the "10." part is standard, the rest is variable including the length. The following is from the DOI Handbook (http://doi.org/handbook_2000/appendix_1.html)

4. Format and Characteristics of the DOI

The DOI is composed of the prefix and the suffix. Within the prefix are the Directory Code <DIR> and the Registrant Code <REG>. The suffix is made up of the DOI Suffix String <DSS>.

The syntax of the DOI string is: <DIR>.<REG>/<DSS>

There is no practical limit on the length of a DOI string, or any of its components (the Handle System allows strings of up to 4 GB; under UTF-8 encoding each ASCII character takes one byte, hence in ASCII encoding a DOI may be approx 4 billion characters).

Characters 'a' - 'z' and 'A' - 'Z' in the DOI string are case insensitive (e.g. 10.123/ABC is identical to 10.123/AbC). These characters in the DOI string are converted to upper case upon registration and resolution. If a DOI were registered as 10.123/ABC, then 10.123/abc would resolve it and a later attempt to register 10.123/AbC would be rejected with an error message stating that the DOI was already in existence. Comparison of two DOIs (to decide if they match or not) should be done by first converting all characters 'a' - 'z' in DOI strings to upper case, followed by octet-by-octet comparison of the entire DOI string.

4.1 DOI Character Set

Legal characters are the legal graphic characters of Unicode. This specifically excludes the control character ranges 0x00-0x1F and 0x80-0x9F, which are therefore not valid characters for DOI strings, and will never be present in DOI conformant systems. Reserved characters, if any, are listed in the following descriptions of the prefix and suffix.

4.2 Prefix

<DIR> Directory Code (required)

See Appendix A for all valid values for the Directory Code. The Maintenance Agency is responsible for updating the list of valid values. The Directory Code is numeric; currently the only valid value is <DIR>=10.

<REG> Registrant's Code (required)

Separated from <DIR> by ".". This is assigned to the Registrant by the International DOI Foundation.

DOI Prefix Character Set

Any character within the DOI Character Set as defined above.

<DIR> and <REG> are assigned by the International DOI Foundation.

4.3 Suffix

<DSS> DOI Suffix String (required)

This is assigned by the Registrant.

DOI Suffix Character Set

Any character within the DOI Character Set as defined above, with the exception that the Suffix cannot start with */ where * is any single character. This is reserved for future use. The DSS is case insensitive.

john bickar’s picture

OK, I'll take a crack at this once I get some other stuff cleared off my plate.

Do you see any benefit in doing error checking on DOI for the "10." prefix?

rjerome’s picture

Do you see any benefit in doing error checking on DOI for the "10." prefix?

That's already being done, at least if you enter it by hand in the DOI lookup field.

liam morland’s picture

Issue summary: View changes
Status: Active » Closed (outdated)

This version is no longer maintained. If this issue is still relevant to the Drupal 7 version, please re-open and provide details.