Enable marc record, bib-ids, and item-ids to come from a different source [#597048]

I have successfully imported over 100 000 bibliographic records from a combination of a marc dump file and a comma separated list of bib-ids, item-ids and other info exported from millennium.

This required the modifications I will present in the attached patch: in short, I just refactored the function millennium_import_update_item very slightly and moved part of it into another function millennium_import_update_bibrecord.

You can directly pass a marc record -- in the same textual format that the webpac offers -- as an argument to the latter function (in addition to the relevant ids).

This makes the latter function callable from other code, that is code which reads a text file with the dumps. It would of course also allow other sources for the information as well. I will also attach an example file importer which calls this new function.

And by the way: it seems Drupal handles this amount of nodes just fine, assuming you use Apache Solr instead of the core search.

This modification in no way affects the harvesting functionality, just introduces another public function to be called.

The complete metadata flow I currently have is this: Marc+list from Millennium -> Marc4J -> a single text file with ids and marc -> millennium file importer (new code) -> Millennium Integration + Drupal database.

Comment	File	Size	Author
#2	example_file_importer.txt	2.68 KB	tituomin
#7	marcmerge.tar_.gz	386.25 KB	tituomin
#6	example_dump.txt	3.48 KB	tituomin
#1	millennium-597048.patch	4.34 KB	tituomin
#1

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

tituomin CreditAttribution: tituomin commented 6 October 2009 at 10:41

File	Size
millennium-597048.patch	4.34 KB

Comment #2

tituomin CreditAttribution: tituomin commented 12 October 2009 at 10:57

File	Size
example_file_importer.txt	2.68 KB

Here is a sample code for a simple file importer. It reads files with many records, all in the following format:

bnumber
inumber
inumber
inumber
inumber
*MARC RECORD AS PLAIN TEXT*
*newline*

Please note that this code isn't finished yet. (For example there is no UI, the filename isn't parametrized yet..) It also requires that you have the correctly formatted file in the first place -- for this I have a small Marc4J-based program. (Will publish if needed.)

Should I make another (sub)project for the file importer or could we include it as a (sub)module inside Millennium Integration?

Note fileimporter shouls use stream_get_line instead of fgets.

Comment #3

janusman CreditAttribution: janusman commented 7 October 2009 at 22:55

I love this idea, this could even be an included (optional) module...! Will review soon.

Comment #4

tituomin CreditAttribution: tituomin commented 8 October 2009 at 08:29

I think this might take some burden off the web harvesting performance, because harvesting would be used mainly to harvest newly arrived record. A bonus is that you can use the same marc-parsing code for both methods -- the mappings and metadata processing remain the same.

Comment #5

janusman CreditAttribution: janusman commented 9 October 2009 at 18:42

Could you also post a sample dump.txt file and how to get it?

Comment #6

tituomin CreditAttribution: tituomin commented 12 October 2009 at 10:30

File	Size
example_dump.txt	3.48 KB

Okay, here's a sample with data from Project Gutenberg. (the id numbers are not valid, just hand written)

The first record looks like this:

b1000235
i1002352
i1023525
LEADER 00374     2200121   4500
042 1  dc
100 1  Younghusband, G. J.|d
245 14 The Story of the Guides /|cYounghusband, G. J.
500 1  Project Gutenberg
506 1  Freely available.
516 1  Electronic text
830 1  Porject Gutenberg|v16808
856 1  |uhttp://www.gutenberg.org/etext/16808|uhttp://www.gutenberg.org/license

Comment #7

tituomin CreditAttribution: tituomin commented 12 October 2009 at 10:35

File	Size
marcmerge.tar_.gz	386.25 KB

..and here's my marc4j-based java program which produces the files. Essentially, the program just parses the two different files (standard binary marc dump and comma-separated list of ids). I have overriden two marc4j's string outputting functions to make the output match WebPacs marc display.

The instructions are included, but the fields in the text file shoud be following:
"RECORD #(BIBLIO)","001","020|a","RECORD #(ITEM)"

(020 is not currently used; 001 is used to verify that the records in the two files are indeed the same.)

Comment #8

tituomin CreditAttribution: tituomin commented 12 October 2009 at 10:39

The key is that the two files have the same records in the same order. And this works out of the box with Millennium, you just have to output the same list of records, in the two different formats.

Comment #9

janusman CreditAttribution: janusman commented 12 October 2009 at 19:56

Status:

Needs review

» Active

Patch from #1 committed.

Leaving open for new code as a contrib module... one thing is the importer, another the extractor (which, we could just leave up to the administrators). Perhaps a simple form that asks for:
* A MARC file with N records.
* A tab-separated file with N lines: bib items in one column, comma-separated item numbers in another.

Comment #10

tituomin CreditAttribution: tituomin commented 13 October 2009 at 10:20

Alright! Thanks for the commit.

Just to make sure I understand: do you think it's better to have to whole process to be achievable through Drupal forms? Since the first part of the flow (extract) is Java-based, php would have to call the Java-program on the server.. This might be possible via php functions exec() or system() (serious security threat possible here..).

This would require that the jar files be distributed with Millennium Integration.

Or, would it be okay if only the latter part, import, would be Drupal-based? In this case, the form would only ask for one file, since the two files would already have been combined.

Comment #11

janusman CreditAttribution: janusman commented 13 October 2009 at 22:23

I think it's far easier (and more common) for Millennium users to simply make up a list, and then export it as the required MARC, item numbers and record numbers (maybe into three files). The extra step of using Java to put one after the other seems like an extra requirement which not many will understand how/what to do, and I'm thinking it's simple enough to do in PHP inside the module instead of requiring the extra Java "steps".

Comment #12

tituomin CreditAttribution: tituomin commented 16 October 2009 at 08:03

There are two separate questions here:

Merging the separate files into one file (debatable)
Parsing the MARC dump

My thinking went like this: since Java and Marc4j are already required -- at least that's what I thought -- to parse the standard *binary* Marc dump exported from Millennium (2), then it's not a big deal to make the Java step even more useful by:

Merging the files into one file
Verifying that the files "fit together", and do not belong to different record sets by mistake (by comparing
marc fields and list output.

Of course, I'm perfectly willing to leave these extra features out, if that's better. This was not the main point of the Java step.

But I might be mistaken that Marc4j is the only way to go. I would appreciate more info on this. I must say I don't know Millennium that well. Is it possible to output the Marc records in textual format (same format as webpac outputs?). Or, is there a PHP library for parsing binary marc data? Of course, the first option would be the best.

Comment #13

tituomin CreditAttribution: tituomin commented 16 October 2009 at 13:11

To be precise, by binary I mean ISO 2709: http://en.wikipedia.org/wiki/ISO_2709

Comment #14

janusman CreditAttribution: janusman commented 16 October 2009 at 15:34

Well, there is some code available already in the MARC module we could build upon or reuse (maybe by requiring that module for this feature?) ... but I have yet to export some MARC directly from Millennium and see if this code can parse it.

Currently I do not have access to the univ's Library's Millennium interface (I switched to another job a year ago) so I'm asking them for a sample file. Or, you could also upload a (small) file here for testing purposes. It'd be better if the file includes diacritics (ñäáà, etc) and other probably-troublesome characters in it =)

Comment #15

tituomin CreditAttribution: tituomin commented 16 November 2009 at 17:27

Good to know about the MARC module. Perhaps this functionality could somehow even be delegated to that module, since decoding standard MARC is a more general problem than integrating with Millennium.

However, looking at the current state of the MARC module (see http://drupal.org/node/427722 ), I'm not sure it's mature enough. It seems that the File_MARC Pear module is currently the most up-to-date PHP library for decoding binary MARC files, but the MARC module is using an earlier version, currently not developed further.

BTW, I'd prefer we use the Project Gutenberg marc dump at http://www.cucat.org/library/pgmarc.mrc.zip for testing (avoiding any possible licensing/copyright issues at this point).

I must say I kind of understand the preference of simple dependencies and reliance on one language (php). But on the other hand, implementing the same uninteresting features (ISO 2709 decoding) over and over on different languages seems to double the efforts of the open source library community, which could be used for something more interesting. At a glance, not having made an in-depth comparison, the Java-based Marc4J does seem quite mature. And it is very fast (I've tried!). For example, decoding a 160 000 record ISO 2709 dump file took less than 10 minutes. Parsing the same data in textual format in php takes hours (granted, it's a different task).

I think any realistic and high-performance library web app can't rely on php alone -- even Drupal.org depends on Apache Solr for scalable search.

But I'm willing to try out File_MARC when I have the time.. The benefit of Marc4j is that it works right now, and it only requires the JVM + two jar files.. Not overly complex to me.

(Edited to clarify the distinction between parsing and decoding iso 2709)