Maxlength (and other standardization) of ISBN, ISSN, (and DOI?) fields [#703894]

After spending some time cleaning up some cruft in our database, I was thinking it might be helpful to implement some more standardization, error checking and restriction to allowed values for some of the biblio fields. In particular, ISBNs and ISSNs are standardized formats; DOIs might be another candidate but it doesn't look like they follow a maxlength standard.

This would need to happen in a few places:

The node add/edit form, both in the maxlength of the form fields and in validating the form submission
In the database, when installing the schema and as a schema update
In the importing process

So it's not a trivial change, but I'd be willing to work on some patches to get it going if people think it's a worthwhile improvement.

Comments

Comment #1

rjerome commented 3 February 2010 at 21:57

By all means, patch away!

I believe you are right regarding the DOIs in that AFAIK only the "10." part is standard, the rest is variable including the length. The following is from the DOI Handbook (http://doi.org/handbook_2000/appendix_1.html)

4. Format and Characteristics of the DOI

The DOI is composed of the prefix and the suffix. Within the prefix are the Directory Code <DIR> and the Registrant Code <REG>. The suffix is made up of the DOI Suffix String <DSS>.

The syntax of the DOI string is: <DIR>.<REG>/<DSS>

There is no practical limit on the length of a DOI string, or any of its components (the Handle System allows strings of up to 4 GB; under UTF-8 encoding each ASCII character takes one byte, hence in ASCII encoding a DOI may be approx 4 billion characters).

Characters 'a' - 'z' and 'A' - 'Z' in the DOI string are case insensitive (e.g. 10.123/ABC is identical to 10.123/AbC). These characters in the DOI string are converted to upper case upon registration and resolution. If a DOI were registered as 10.123/ABC, then 10.123/abc would resolve it and a later attempt to register 10.123/AbC would be rejected with an error message stating that the DOI was already in existence. Comparison of two DOIs (to decide if they match or not) should be done by first converting all characters 'a' - 'z' in DOI strings to upper case, followed by octet-by-octet comparison of the entire DOI string.

4.1 DOI Character Set

Legal characters are the legal graphic characters of Unicode. This specifically excludes the control character ranges 0x00-0x1F and 0x80-0x9F, which are therefore not valid characters for DOI strings, and will never be present in DOI conformant systems. Reserved characters, if any, are listed in the following descriptions of the prefix and suffix.

4.2 Prefix

<DIR> Directory Code (required)

See Appendix A for all valid values for the Directory Code. The Maintenance Agency is responsible for updating the list of valid values. The Directory Code is numeric; currently the only valid value is <DIR>=10.

<REG> Registrant's Code (required)

Separated from <DIR> by ".". This is assigned to the Registrant by the International DOI Foundation.

DOI Prefix Character Set

Any character within the DOI Character Set as defined above.

<DIR> and <REG> are assigned by the International DOI Foundation.

4.3 Suffix

<DSS> DOI Suffix String (required)

This is assigned by the Registrant.

DOI Suffix Character Set

Any character within the DOI Character Set as defined above, with the exception that the Suffix cannot start with */ where * is any single character. This is reserved for future use. The DSS is case insensitive.