Recovering XML attributes from terms [#807562]

Hi. When I import the following fragment of SKOS (as part of a complete file):

Homophobic bullying

false
3
42e7344ace6b57697a8fdcb22fa9f1e3

And try to grab it in the term pre-save hook to do further processing on it, it ends up looking like this:

[prefLabel] => Array
(
[0] => Homophobic bullying
)

[scopeNote] => Array
(
[0] => false
[1] => 3
[2] => 42e7344ace6b57697a8fdcb22fa9f1e3
)

I need to switch on the label="[...]" attribute, but it's not showing up for me at that point.

I'm kinda surprised attributes aren't preserved... am I missing something obvious?

Comment	File	Size	Author
#14	term_edit_screen-with_taxonomy_enhancer_fields.png	62.44 KB	dman
#12	ian-multiple_vocabs.png	21.23 KB	dman
#12	taxonomy_xml-retain_bnodes.patch	1.15 KB	dman
#11	ian3.xml_.txt	1.53 KB	dotton
#5	NS-Role.xml_.txt	44.46 KB	dotton

Comments

Comment #1

dotton commented 24 May 2010 at 09:42

Apologies, the XML tags got eaten. Lets see if this works:

<skos:prefLabel>Homophobic bullying</skos:prefLabel>
<zthes:termCategory/>
<skos:scopeNote label="browseRoot">false</skos:scopeNote>
<skos:scopeNote label="displayOrder">3</skos:scopeNote>
<skos:scopeNote label="globallyUniqueId">42e7344ace6b57697a8fdcb22fa9f1e3</skos:scopeNote>

Comment #2

dman commented 24 May 2010 at 13:16

Well, this is the first time I've ever seen 'scope notes' with an additional 'label' parameter. I'd be interested in seeing the documentation for that. All the inputs we've been able to try only occasionally had one 'scope note' which - if anything - fit best when put into the 'description'. In all the extant examples, a "scope note" is a "scope note" - literal, no modifier.
SKOS already supports 7 unique types of 'note' that can be added to a term. Subdividing that collection down with new extensions seems to be expressing something that is specific to the implementation rather than the spec. Not that it should be unavailable of course, just that there is no useful way of handling it if it's found.

Generally, there has been no way to store anything beyond the basic term attributes that Drupal actually supports because ... where would it be stored? The idea would be for taxonomy_enhancer to try and retain it, but there's not been much call for that.

The data array you are seeing is a simple 'flattening' of the XML structure - in a way that can most easily be recovered if there is a place to put it.
The flattening happens within taxonomy_xml_convert_triples_to_sorted_objects() where the complex RDF triples are compressed into attribute values that can actually be used later. I don't even know what an attribute on a statement triple would look like in ARC terms. We so far support literals, URIs, or bnode values. I don't know what the new data would be.
There is code in there to try and support multiple nodes of the same type but with different 'lang' values (because 'lang' is encoded in an available way) - although there is nowhere to save those variations either. It's probably possible to index things even more there, in a similar way, although 'label' has never been seen before now.
The full collection of complex RDF triple objects is discarded as soon as the known attributes have been extracted from it, mainly due to the memory stress of thousands of terms.
Anyway, try dumping the triples before or during taxonomy_xml_convert_triples_to_sorted_objects() and see if you can see where that unexpected attribute ended up.

Comment #5

dotton commented 24 May 2010 at 16:40

Status	File	Size
new	NS-Role.xml_.txt	44.46 KB

To be honest, we're flailing a bit with regard to how to approach this. The information architecture experts are giving me various takes on SKOS, and my job is to wring the correct data out of them. We had a home-grown tool that did what taxonomy_xml does, but I really don't want to go down that route again.

The XML attributes don't turn up in $triples at any point, as far as I can tell. I'm leaning towards this being something Arc just doesn't do, but I'm hoping I'm wrong.

Attached is the current incarnation of one of our taxonomy files. We're using a zthes namespace where a label attribute on <zthes:termNote> should be completely valid, but I'm still not getting the attributes back.

At this point, any ideas would be gratefully received. Even a firm confirmation that Arc won't return attributes would be nice.

Comment #6

dman commented 24 May 2010 at 23:14

Well, I've never seen any RDF-XML files that used attributes (beyond the formal 'lang', 'about' and 'resource'. And thinking about how RDF is only triples, and how attributes on an XML node would be 'modifiers' on a simple triple (unsupported) or "a statement about a statement" (possible, but requires each statement to be a thing ... I'm pretty sure you just don't do arbitrary attributes in RDF-XML. Really.

The W3C RDF Validator is unhappy with that input, ARC parser also seems to not retain those attributes. If those two don't support what you are trying to do, then buggered if I'm expected to!

BUT the road for you is to abandon that incorrect attribute annotation method, and just use subclassing and a few of your own labels.

Instead of:

<zthes:termNote label="globallyUniqueId">NSRole-a89fed8a6b798f1f336f463341967724</zthes:termNote>

You may as well go:

<zthes:globallyUniqueId>NSRole-a89fed8a6b798f1f336f463341967724</zthes:globallyUniqueId>

... seeing as you are using your own namespace anyway (which is fine) you don't have to pack stuff that doesn't fit into a 'term_note'. Because that does not fit in a 'term note'.

As for

<zthes:termNote label="category" vocab="National Strategies: Initiative" identifier="003e4ac3dd2409155ad51dba7e1d955c">What Works Well</zthes:termNote>

Well, refactoring that into

<skos:Concept rdf:nodeID="a89fed8a6b798f1f336f463341967724">
...
<zthes:category>
<zthes:vocab>National Strategies: Initiative</zthes:vocab>
<zthes:identifier>003e4ac3dd2409155ad51dba7e1d955c</zthes:identifier>
<zthes:label>What Works Well</zthes:label>
</zthes:category>
</skos:Concept>

.. would make that legal, parseable RDF again. That runs through the validator and ARC safely and correctly for us. That nested complex type there becomes a local bnode, we can deal with that. It even carries through correctly (if slightly flattened) to the term predicates where you'd be able to pick it up later.

So there is hope, but validate your input first before expecting other systems to understand what you are trying to do.

PS, your usage of the zthes 'namespace' is incorrect. Need to end it with a hash or a slash.

Comment #7

dotton commented 25 May 2010 at 10:59

That strikes me as weird, as the transport layer is XML, not "XML with some bits taken out". Reading W3C specs is akin to chewing concrete, but I'm seeing lots of examples of namespaces in the RDF/XML recommendation ("This document defines an XML syntax for RDF called RDF/XML in terms of Namespaces in XML..."). But my understanding of how XML namespaces work is likely wrong.

However, you've given me exactly the information I need, and an example of how to make it work - thanks very much for helping us out.

(Out of curiosity, where is that hash/slash thing defined? I'm finding lots of examples in RDF/XML documents with a trailing hash/slash, but the XML spec seems to say the namespace should be an IRI (RFC 3987), which doesn't require a trailing #/. Eg:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ex="http://example.org/stuff/1.0/">

is from Example 10 here: http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax-datatyped-literals )

Comment #8

dman commented 25 May 2010 at 10:48

RDF/XML is a specific dialect of XML. The extra bits you have put in are still validating XML, but they are not valid RDF, and it's only RDF we are reading here.
The "transport layer" as you put it may be XML, but just adding things to XML doesn't mean it will be consumed properly at the other end, any more than adding a <sarcasm> tag or attribute to XHTML will make browsers recognize it later. You can even namespace it properly, and therefore make it validate. You can even hope it will be maintained unharmed on its trip through the "transport layer"... but you can't make it get interpreted or mean anything to the consumer.

The way you've extended it is legal XML - inasmuch as it doesn't break anything, and the namespaces are (almost) used correctly to add your own stuff. But it has no meaning outside of your own application. Because that has become your own dialect woven into or working outside of the rules of RDF, then any XML/RDF parser is formulated to ignore things it doesn't recognize. And arbitrary attributes are not recognized as having any meaning or even fitting the XML/RDF schema. (which is indeed "XML with bits taken out", or at least "XML with some rules")

Re namespaces, the example

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ex="http://example.org/stuff/1.0/">

Follows the convention precisely.
A node rdf:type resolves to the full URI http://www.w3.org/1999/02/22-rdf-syntax-ns#type and all is well and normal. 'type' is an item defined within a namespace '...22-rdf-syntax-ns'

As you will see from the W3C validator, your

<rdf:RDF xmlns:zthes="http://www.k-int.com/schema/zthes-1.2.xsd"  >
<zthes:thesNote label="ownersId">National Strategies: Role</zthes:thesNote>
</rdf:RDF>

Resolves to an element of type http://www.k-int.com/schema/zthes-1.2.xsdthesNote
From the semantics apparent here, there is no relationship between zthes-1.2.xsd and zthes-1.2.xsdthesNote, they are sibling identifiers if anything.
This is not actually illegal and pretty much can continue to work as a valid stand-alone URI. It's just ugly, awkward, and really looks like someone screwed up somewhere. It has implications on resolution and retrieval (as does the choice between slash and hash) and this one probably can't be resolved in any meaningful way, whereas the machine-discoverable utility of

<rdf:RDF xmlns:zthes="http://www.k-int.com/schema/zthes-1.2.xsd#"  >
<zthes:thesNote label="ownersId">National Strategies: Role</zthes:thesNote>
</rdf:RDF>

would be entirely clear.

As a point of definition (from the pedantic-web group) the resource itself - the schema or ontology identifier (eg when used in rdfs:isDefinedBy) - need not have the trailing slash/hash, but when used as a namespace to be prefixed - as you are doing here - it had better have it.

The example I gave above is not a formal recommendation, just OTTOMH formulation on how you are supposed to be able to express complex, multiple types when all you've really got to work with is triple-statements. So you are going to have to massage it a bit from there

Comment #9

dotton commented 25 May 2010 at 11:49

dman, rather than taking up any more of your time (and thank you, you've been very helpful) do you know of a halfway decent book on RDF?

We're going to try using skos:Collection, and if that doesn't work we'll try an XSLT to transform valid Zthes XML into something the Arc parser can understand (broadly, what you suggested).

Comment #10

dman commented 25 May 2010 at 12:15

Sorry, I don't know what book to recommend, and any one that's been printed is probably somewhat dated already :-B

What I learnt was from a few years of XSL getting me deep into XML and namespaces, then a few more years of semantic web stuff taking apart lots of half-baked schemas and implementations of emerging specifications and made-up XML dialects until RDF became the target (OMFG it's been 5 years already!)

Anything I think was worth making a note of I jammed into my documentation for this module, so maybe there will be some reading in my bibliography round-up in the distro. Start there I guess.

I've looked at zthes, even made a small protocol client to experiment with it, but couldn't find enough (publically available) resources published with it to be worth building support for, and as there was no solid XML formulation of it out there either (at the time I looked two years ago) it looked like it was on the way to retirement. I incorporate some of the Z39.19 spec in the supported terms within taxonomy_xml - just in case.

I guess I need to update the docs a little more to reflect advances made in SKOS in the last few years...

Comment #11

dotton commented 26 May 2010 at 11:50

Status	File	Size
new	ian3.xml_.txt	1.53 KB

Ok, sorry to revive this, but I really think this one is a bug.

The attached file is a minimal test case (a single term), and passes W3C's validation. When our Zthes tags look like this:

<zthes:termCategory>
<zthes:termCategoryVocab>National Strategies: Initiative</zthes:termCategoryVocab>
<zthes:termCategoryIdentifier>9fa25beeb12403b74def50940f444bb2</zthes:termCategoryIdentifier>
</zthes:termCategory>

We see zthes:termCategoryIdentifier in the parsed term object:

stdClass Object
(
  [predicates] => Array
    (
[...]
      [termCategory] => Array
        (
          [_:genid1] => _:genid1
        )

      [termCategoryIdentifier] => Array
        (
          [0] => 9fa25beeb12403b74def50940f444bb2
        )
    )
[...]
)

But when we swap the order of zthes:termCategoryVocab and zthes:termCategoryIdentifier, we see zthes:termCategoryVocab:

stdClass Object
(
  [predicates] => Array
    (
[...]
      [termCategory] => Array
        (
          [_:genid1] => _:genid1
        )

      [termCategoryVocab] => Array
        (
          [0] => National Strategies: Initiative
        )
    )
[...]
)

That is, we only see the last child of zthes:termCategory. Any thoughts?

Comment #12

dman commented 26 May 2010 at 14:17

Status	File	Size
new	taxonomy_xml-retain_bnodes.patch	1.15 KB
new	ian-multiple_vocabs.png	21.23 KB

Frankly, I'm surprised you are seeing that much :-)
There's a few things at play here.

First, that sample file did not really validate on the W3C thing ... though it gave it a good go.
Looking at that result (especially the graph) you'll see that only one of the termCategory subnodes got through. And that's the behaviour you are seeing.
I got the same when debugging via ARC. Only the zthes:termCategoryVocab, not the zthes:termCategoryIdentifier came out as triples we could look at.

So there's still a little way to go there.

Second, the reason you are even seeing the

      [termCategoryVocab] => Array
        (
          [0] => National Strategies: Initiative
        )

is a fluke, due to the very partial bnode support. (bnodes are anonymous, internal lumps of structured data)
Until now, the only bnodes we've had to deal with were single elements that were structured for no real reason, and my interpreter just grabs them an flattens them to get a string result out of them. That was a work-around for some (IIRC) Freebase input and worked only enough to get what we could predict from it.
I know this is not full, structured object support - because we've never encountered structured objects like this.

there is a chance - though I'm not betting on it - that if you fix the validation you may see the other value turn up too.

But for a robust solution...
From the schema you are aiming at (from what I can deduce from your samples) builds in the potential for there to be more than one termCategory element, with its own properties. This is a fine structure, but it is a few steps beyond what we've had to handle until now.
Here is what (I think) the model you are aiming at is expected to look like:

- I've added a second termCategory object to illustrate the potential. I assume there may be more than one, because you are choosing to structure/nest it. Otherwise you'd just be adding the simple string attributes at the concept level, right?

You are currently NOT just setting $term->termCategoryVocab so it looks like there is a reason to be having

$term->termCategory[x]->termCategoryVocab

I've dealt with systems (MeSH) where one term may be in several vocabs, so this is a reasonable architecture.

Now that, as I say is an eminently sensible approach.
But it's just not supported at the consumer end, because we have nowhere to put that sort of data!

Ideally, I'm guessing you want to see a PHP object something like

            [_:ID:10e0cca523fb6d710a92618de74886df] => stdClass Object
                (
                    [predicates] => Array
                        (
                            [type] => http://www.w3.org/2004/02/skos/core#Concept
                            [inScheme] => http://skeyn.com/schemes/EDVOC/
                            [prefLabel] => Functional skills lead in consortium
                            [termCategory] => Array  (
                                    [_:genid1] =>  Array  (
                                           termCategoryVocab => National Strategies: Initiative
                                           termCategoryIdentifier => 9fa25beeb12403b74def50940f444bb2
                                    )
                                    [_:genid2] =>  Array  (
                                           termCategoryVocab => Another Vocab Name
                                           termCategoryIdentifier => XXXXXXXXXXXXXXXXXXXX
                                    )
                                )
                            [termBrowseRoot] => true
                            [termNoteGloballyUniqueId] => 10e0cca523fb6d710a92618de74886df
                        )
                    [type] => http://www.w3.org/2004/02/skos/core#Concept
                )
        )

And that's one way of looking at it,
but 'predicates' are not going to contain structured data, so that's not going to happen.
That's why the predicate value(s) you will see return only 'genid1' etc, as they are pointers to the real structured object.

(results are currently being returned as arrays of (usually) one item, because the spec doesn't prevent us from having more than one 'label' or anyhting, and it's tidier to always assume an array so we can deal with conflicts later)

Unfortunately for you right now, those structured bnodes are being filtered out early on in the process. (because, as I said, they've never been useful)

We can probably patch the taxonomy_xml_rdf_parse() routine to retain them a little bit longer - at the cost of a bunch of memory. Most modifications I've made in the last year have been towards scaling for thousands of terms, so it may be a bit stingy in what it discards.
... looks like taxonomy_xml_convert_triples_to_sorted_objects() is where the flattening happens...

...maybe this patch will reveil the missing data for your inspection!
The attached patch lets you access:

            [_:ID:10e0cca523fb6d710a92618de74886df] => stdClass Object
                (
                    [bnodes] => Array
                        (
                            [termCategory] => Array  (
                                    [_:genid1] =>  Array  (
                                           termCategoryVocab => National Strategies: Initiative
                                           termCategoryIdentifier => 9fa25beeb12403b74def50940f444bb2
                                    )
                                    [_:genid2] =>  Array  (
                                           termCategoryVocab => Another Vocab Name
                                           termCategoryIdentifier => XXXXXXXXXXXXXXXXXXXX
                                    )
                                )
                        )
                )

(Because I'm not going to pollute the 'predicates' array with structured objects right now. May break things)

You STILL have to adjust your schema so that it will get through the validator though!
FWIW, My version for testing looked like this (abridged):

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:zthes="http://skeyn.com/schemes/EDVOC/null#" xml:base="http://skeyn.com/schemes/EDVOC/">
  <skos:Concept >
    <skos:prefLabel>Functional skills lead in consortium</skos:prefLabel>
    <zthes:termCategory zthes:termCategoryIdentifier="9fa25beeb12403b74def50940f444bb2" zthes:termCategoryVocab="National Strategies: Initiative"></zthes:termCategory>
    <zthes:termCategory zthes:termCategoryIdentifier="Anotheridentifier" zthes:termCategoryVocab="Another Vocab Name"></zthes:termCategory>
  </skos:Concept>
</rdf:RDF>

I know that's not precisely what I suggested above - but I did say it was OTTOMH. attributes seem to pass the parser if they are correctly namespaced. Freetext complained about other things I couldn't be bothered to fix... though your current version could be made to work too, I think.

Comment #13

dotton commented 27 May 2010 at 10:32

Wow. Thanks for this, it's far more than I expected.

Ok, this is roughly what's going on at this end, more as an FYI than a plea for help:

We have an external tool that manages vocabularies. It slaps a name on each vocabulary, and a GUID on each term. Everything that sits outside Drupal uses these GUIDs to identify terms, so Drupal has to understand them too (a lot of those external tools do CRUD operations on terms via HTTP callbacks).

We do a fair amount with vocabularies - we use them for navigation, populating Q+A wizards, even describing other vocabularies and tags (our tags can be tagged). Those structures sometimes need additional metadata, which is all keyed off of the GUID field.

Our plan was to load all this additional metadata onto a SKOS file, write term_postsave hooks which would consume it, and build up all the additional db structures we rely on.

So, that's the use-case. Personally, I think if you're going to have a hook that fires for each term, it makes sense for there to be a way to pass extra metadata to that hook. Whether you think that's worth supporting is another question :) I can completely see, now, that what we're trying to do and what Arc is trying to do are two different things - neither is wrong, they're just incompatible.

Our current plan is to do one more XSLT tweak and use your patch - thank you once again for all your help.

(I really need to submit that term_postsave hook as a patch - it's more useful than the presave for us, as by that point you have the tid).

Comment #14

dman commented 27 May 2010 at 12:22

Status	File	Size
new	term_edit_screen-with_taxonomy_enhancer_fields.png	62.44 KB

Extra metadata, yes. But we'd only ever thought of that extra metadata as attribute-value pairs (y'know, RDF triples) so this extra structure was out of scope.
You'll see there was no trouble getting your termNoteGloballyUniqueId or termBrowsRoot into the array.

I've started porting to Drupal7, because that's got fully fieldable terms etc and a place to actually deal with this sore of annotation. In D5,D6 it was/is a bit tacked on.

Internally, taxonomy_xml does believe in maintaining a record of an external GUID for all its imported terms, so there is more scope for what you describe. However, it (and RDF) assumes that that GUID is a URI. This is how it can successfully 'update' over old imports.
All the batch jobs and partial relinking that taxonomy_xml does (in larger imports, via HTTP services etc) are indexed through a 'uri' field that is serialized by taxonomy_enhancer, or as an owl:sameAs relationship managed by the rdf module. (both methods are/were highly unstable, so it's not really out there as a feature)
It's also supported URNs as external identifiers LSIDs for the Life Sciences etc.
So the current structure is indeed built around remaining in sync with any external applications that maintain their own lists... It's (stable development of that API) just not really out there because there's not a lot of freely-available resources to talk to, so each dev project has not had a lot of scope for re-use outside of the individual cases.

Re postsave and hooks, they are just there for folk to find uses for. I'm not sure myself :-)
Though they are currently a little tacked on and duplicated throughout the code, because they were never exactly architected YOu are right that post-save, after you've got the tid is lost more use. - I may try to consolidate those hooks back into the base taxonomy_xml lib if I can. D7.

FWIW, this is what I see when I import (my modified version of) your file

That is using taxonomy_enhancer to serialize the extra fields, and edit_term and rdf.module for flavour.

Also - while you are working on that output, rdf:nodeID seems to produce a localized id that's only good for the duration of the parse and describes the XML node that contains the data. Not the one you want. rdf:ID is the identifier of the concept being described, and is probably what you mean. If you wanted to refer to that term in that document somewhere on the web, it could be
http://skeyn.com/schemes/EDVOC/ian3.xml#10e0cca523fb6d710a92618de74886df but only if it was

  <skos:Concept rdf:ID="10e0cca523fb6d710a92618de74886df">

and not

  <skos:Concept rdf:nodeID="10e0cca523fb6d710a92618de74886df">

(I think that confusion is also what was causing some 'invalid name' complaints from that parser. I can't explain beyond that, I'm just deducing from example and trials here myself. W3C and ARC put a _ infront of your one when they find it. That means it's only being used for parsing by the tool, not for reference.

Recovering XML attributes from terms