RFC: XMLnode

By pbryan on 29 Aug 2005 at 18:54 UTC

I am considering creating a new module for Drupal. I would be interested in any feedback the Drupal development community may have.

XMLnode is a module that will allow node content to be expressed as XML, then transformed to XHTML through XSLT. This allows flexible authoring, storage and consistent presentation of arbitrarily complex content.

In the XMLnode configuration, XML document types can be specified and associated with an XSLT filter. When an XML node is requested, XMLnode module detects the document type and filters the content through the associated XSLT.

XSLT transformation output is cached to ensure that large volume of requests for nodes do not need to be continuously transformed.

I am also considering a subsequent module that allows an XForms implementation to drive authoring of XML node documents in Drupal's create content form, allowing fill-in-the-blanks style XML content creation.

I'm guessing a common question could be: why not just use Flexinode? Flexinode is a nice tool, but has some limitations. Some reasons for XMLnode:

XML format is well defined, and many tools support authoring, presentation and transformation.
Flexinode has a fairly constrained method of defining presentation. Different XML types can be transformed in various ways, based on the defined stylesheets.
XML documents are more "future proof" and portable than more obscure formats. One can deploy content on another platform in the future with far less transormation of conent.
Ease of transformation to other content media, such as PDF, using existing technologies such as XSL:FO.

I'd be interested in hearing comments about whether I am barking up the wrong tree, whether this would be of value to anyone else, whether someone is already undertaking such a project, etc.

Thanks in advance for your feedback.

Comments

Sounds great!

gjost commented 31 August 2005 at 18:09

You asked for feedback, so here goes:

Are you going to use Drupal filters for this? If so, it might be nice for the XMLnode filters to NOT appear in the filters list under each textarea...

It would be great if the adminsitrator could create a library of XSLT filters, which would be available in a select box on the XMLnode edit page.

To take it to another level...
- Make it possible to run the XML through multiple XSLT filters.
- Allow creation of multi-XSLT chains, and make THESE appear in the filters list on the node edit page.

Pretty soon we'll have a PHP version of Cocoon... within Drupal! :)

Re: Sounds great!

pbryan commented 2 September 2005 at 19:14

gjost wrote:

Are you going to use Drupal filters for this? If so, it might be nice for the XMLnode filters to NOT appear in the filters list under each textarea...

So far, my plan is to make it a different content type. I imagine, it would require a unfiltered, arbitrary XML markup.

It would be great if the adminsitrator could create a library of XSLT filters, which would be available in a select box on the XMLnode edit page.

Hmm. So, one would select the presentation to use when inputting the content. Interesting. Can you think of examples where you may want a particular type of content to be transformed one way or another when inputting it, rather than a more global XML document type - XSLT mapping?

To take it to another level...
- Make it possible to run the XML through multiple XSLT filters.
- Allow creation of multi-XSLT chains, and make THESE appear in the filters list on the node edit page.

Wow, chains. How about this:

If an XSLT filter produces XML output, its document type is searched for in the XSLT mapping and if there's a mapping found, runs through that template, and so on...

Pretty soon we'll have a PHP version of Cocoon... within Drupal! :)

Yikes!

This sounds very interesting

Boris Mann commented 31 August 2005 at 18:36

I know of some desktop XML tools that it might be interesting to expose this to remotely (that is, before you got he XForms route). Think of it like desktop blogging, except of structured XML formats.

Very cool!

Go for more generality

puregin commented 31 August 2005 at 19:00

I've been thinking along the same lines. However, all the world's not XML. It would be nice to have the same kinds of filtering/caching/editor extension mechanisms available for any kind of specialized content - e.g., TeX/LaTeX, various graphical/drawing formats, etc.

I'd envisioned setting things up so that nodes would store content in various representations, and have associated proxies for display and other uses/interactions.

Other nodes can store elements such as stylesheets, macros, scripts, or other ancilliary content.

By the way, this is how the DocBook XML/PDF generation stuff I'm working on does things, so I'd be very happy to see these kinds of mechanisms formalized.

The default case could be a text node, with an HTML proxy.

By the way, are the results of filters currently cached? If so, where and how?

--
puregin

puregin wrote: I've been

pbryan commented 2 September 2005 at 19:25

puregin wrote:

I've been thinking along the same lines. However, all the world's not XML. It would be nice to have the same kinds of filtering/caching/editor extension mechanisms available for any kind of specialized content - e.g., TeX/LaTeX, various graphical/drawing formats, etc.

That sounds interesting. If we can generalize without adding too much complexity, I'm game. Can you think of a generalized transformation methodology or framework that could be used to support arbitrary content types and not XML?

[snip]

By the way, this is how the DocBook XML/PDF generation stuff I'm working on does things, so I'd be very happy to see these kinds of mechanisms formalized.

Could you point me to your code? I'd be very interested in ensuring that I don't reinvent any wheels and can dovetail with other development efforts.

By the way, are the results of filters currently cached? If so, where and how?

I'm not yet up to speed on Drupal's caching mechanism. Perhaps someone else from the development team can comment.

Flexinode appears to "cache" its transformation by writing XHMTL into the node, so that when it's displayed, the node content is dumped right from the node table.

I'm not a huge fan of that approach. I would prefer the result of the transformation be cached transiently.

I've had similar thoughts.

styro commented 31 August 2005 at 23:05

But initially at least I was thinking along the ideas of using some sort of 'external content' module to read XHTML files from disk. The XSLT would be done separately by some sort of build process (eg Ant) then uploaded to the server. The reasons for that was mainly thinking that reinventing Cocoon/Forrest/Lenya stuff was a little ambitious for me :)

Drupal would then cache them to allow them to be searched etc, and would somehow check for updates occasionally. Drupal will still be running the web site, forums, user logins, access control, search etc etc - it's just we would no longer have to create the content in Drupal.

We intend to start doing more single source publishing / content resuse, and looking into using DITA (http://dita-ot.sourceforge.net/ , http://xml.coverpages.org/dita.html) as the native XML format for our content. Drupal and DITA seem to be quite compatible in their approaches to information architecture.

My ideas involved the addition of two modules one for DITA Topics which would be a Drupal node like module, and one for DITA Maps which would be an extension of the Drupal book module.

I'd be keen to help out where I can with your ideas, and improve my module development skills :)

Re: I've had similar thoughts.

pbryan commented 2 September 2005 at 19:28

Thanks for your thoughts.

I'd be keen to help out where I can with your ideas, and improve my module development skills :)

I plan on solidifying some design and putting together a brief roadmap next week, after which, I'll need all the help I can get!

More thoughts about architecure and different XML formats

styro commented 3 September 2005 at 00:42

Actually the more I think about how DITA works, the more I think it would require the external build process and wouldn't be very easy to get Drupal doing the transforms internally.

Docbook like nodes should work fine with your approach though. I'm no Docbook expert, but I seem to recall that Docbook documents are quite self contained and that the transforms are a single pass type deal with a single (if very large) stylesheet (I could be wrong about that though - corrections welcome). That approach should suit storing the XML and XSLT in Drupal and caching the results.

DITA on the other hand has all kinds of cross referencing between topics, reusing of bits of content from other topics, different specialisations of topic types, directories full of transforms and a two pass build process. It's all very tricky and flexible, and usually handled by the Ant jobs included with it. I'd hate to try and reproduce all that build logic in Drupal.

What I'm getting at, is that ideally I'd prefer a solution that allowed external processes to do the XHTML conversion step, and using Drupal modules to work with that external content - eg a bit like what the XStatic module (from the Linux Journal) was supposed to do. This approach would probably also allow for LaTeX and other non XML content as well - although probably in a different way than puregin was thinking about.

I did find that gjosts idea of chaining transforms as filters intriguing though :)

Drupal + DITA = Big Win

puregin commented 15 November 2005 at 19:47

Styro, I agree that Drupal + DITA has huge potential, and that the two are well aligned in some ways.

There's lots of different ways in which one could integrate Drupal and DITA - from simple import/export, to setting up a purpose-built CMS built on a fixed DITA specialization, to building a complete collaborative authoring and review system for DITA.

The flexibility of DITA presents a challenge in terms of implementing workflow, user interaction, and integration with the processing toolchains. A general, integrated authoring system/workbench would require some way of authoring DITA topic and domain specializations (integrated XML editors + DITA in XML Schema? specialized UIs and relax-ng?), support for conditional processing, tools for such things as map authoring, support for dynamically generated maps (e.g., maps generated from feeds, search results, taxonomy searches, groups, ...)

Drupal offers lots of entry points, though. For example, as Boris pointed out, there are desktop XML editors that could be integrated via XMLRPC. Taxonomy is a synergy with DITA metadata.

DITA could be the 'killer app' for Drupal, and vice versa...

--
puregin

Yep, many approaches

styro commented 16 November 2005 at 00:13

I suppose Drupal has many different purposes also. Usually with most installations these are all combined to a certain extent.

There's:

Content Creation
Content Management
Content Publishing
Site Management
Community Management

etc

I haven't done any work recently on DITA or Drupal (too many other things getting in the way), but my less ambitious approach was to only use Drupal for the last three tasks and take care of the first two on the LAN with desktop XML editors and ??? (3. Profit. hehe)

DITA just seemed too complex for someone of my experience to get Drupal to handle the first two tasks. If someone manages that though - my hat is off to them :)

There are some good ideas being floated around here, and I realise that helping out when they get off the ground a bit more would be a better use of my efforts than doing my own thing.

--
Anton

Thanks!

pbryan commented 2 September 2005 at 19:28

Thanks everyone for your comments so far. They have been very helpful in directing my thoughts.

I'm ready to launch into this again

puregin commented 15 November 2005 at 19:22

Looks like we've some good ideas, insightful comments, and various bits of code happening.

I think it would be good if we could arrange to hop onto Drupal IRC, and see if we can come up with a plan to coordinate some action.

Contact me via my contact form http://drupal.org/user/9170/contact
and let's try to set something up.

Djun

Hm, sounds like something I built yesterday.

dman commented 27 September 2005 at 13:44

Well, as I mentioned in passing recently, I've got a similar project going on.

I'm doing it single-pass import xhtml via xsl to nodes, rather than anything fully dynamic.
But I've already got a cocoon-style pipeline (including content scrapers and translators that kick in depending on source URL, source regexps or source xpaths) that enable me to look at a stream and decide which xsl filters to pass it through until I get to a data structure I can map into a node.
(I do need to smooth the user admin of that part out still, but the process is sound)

This pipeline can obviously take any XML input. I have XSL converters for RSS and RecipeML that work just as effectively as HTML.

I'm curently starting on the structural side of things.

Points on the current discussion:

XML is for moving stuff about and between places - almost never a suitable format for internal storage.
Export to XML, Import from XML, but for Gawds sakes don't try to use it as a native format.
What's all this talk about using external processors when PHP has workable XSL already? Granted, the differences between PHP4 & 5 are annoying (as are the diffs between 4.3 and 4.4) but they work fine after a poke at php.ini!

My intention is a little different from yours. I just want to get info in and compatable with Nodes. From a variety of sources I want to massage it to be homogenous.
If I wanted a general-purpose database that could store arbitrary content, (which is what you'd end up with) I'd roll my own.

I hope to have my module polished up enough to demo in a week or so. I'm still learning about the hooks and internals that I can plug into - which is taking me longer than writing my own, but much tidier.

Can anybody suggest a good reason for me to keep working with Tomas Mandys import-export schema? It seems a bit arbitrary. That's what I go into to import, but is anyone else actually using it?

Is portability a real reason to avoid using the php xsl modules? I know they are not enabled by default, but surely web hosts are getting more XML-savvy this year?

.dan.

_{.dan. is the New Zealand Drupal Developer working on Government Web Standards}

One teeny answer :)

styro commented 28 September 2005 at 01:10

What's all this talk about using external processors when PHP has workable XSL already? Granted, the differences between PHP4 & 5 are annoying (as are the diffs between 4.3 and 4.4) but they work fine after a poke at php.ini!

To me, the reason for wanting an external processor is because you already need an external processor for existing or intended workflow.

For an organisation who's only publishing medium is their Drupal website, then yes there doesn't seem to be a problem getting Drupal to do their transforms internally.

But if Drupal is just one place for your content to end up eg also in online help in a software product, other web applications, PDFs for printed collateral etc etc, it's much better for your internal build/publishing systems to just send the final HTML content to Drupal than to replicate everything you've already done elsewhere in Drupal.

For those tracking this thread

dman commented 23 January 2006 at 13:20

My new Import HTML module is XML based.

Announcement and discussion here

It uses XSL to extract content from existing web pages, but can easily do the same to non-XHTML input.

.dan.

http://www.coders.co.nz/

_{.dan. is the New Zealand Drupal Developer working on Government Web Standards}

Why not store XML?

bmargulies commented 21 February 2006 at 20:57

We can't be the only people in creation that really use an XML markup as the reference version of documents, then use XSL to get to HTML or whatever else we need. Thus, for us, storing XML in nodes and using a filter to yield the HTML for display is just fine. Just doing this as a plain old filter module is, as discussed, clumsy due to the desire for support of different stylesheets.

I may create a rather specialized form of such a filter just to get some work done while waiting to for a more general facility to emerge.

node_xml

peterx commented 4 October 2005 at 10:43

I can understand everyone wanting something different from XML. I am looking at reusing code from another project to add XML to Drupal. The code stores the input in a separate table, say node_xml, and a directory for documents. node_xml is transformed to node when the XML is loaded or edited. The XML in node is then transformed again on presentation.

The second transformation applies dynamic aspects of the site. The first transformation converts the source format to a common format and applies some transformations from modules. As an example the input from Abiword can be transformed to the common format on input. If the Abiword document is edited then the input is transformed again.

The node_xml table handles the management of input.

Multiple transformations are ok when performed on input. If there are multiple transformations performed on output then some of the transformations could be merged in various ways to reduce the overhead. I used to do it with the PHP 4 XSLT and have not yet tried it with the PHP 5 XSL extension.

I look forward to seeing your module when it is ready for testing.

http://petermoulding.com/technology/content_management_systems/drupal/

petermoulding.com/web_architect

Pros and Cons of separate xml_node

puregin commented 15 November 2005 at 19:29

Interesting idea - having a separate node table for XML nodes makes some things easier. But it looks like one would lose the integration with many of the other things Drupal does with nodes - search, feeds, revisions, structures such as books, taxonomy, etc...

On the balance I think I'd want to look pretty hard at making things work with the existing node table, if at all possible.

Djun

--
puregin

coll

moshe weitzman commented 15 October 2005 at 00:34

sounds interesting ... note that you need not do any caching in your module. drupal has a filter cache which will auto-cache the rendered version of your node. all you have to do is implement this using the filter system.

looking for the cache discussion nodes.

peterx commented 19 October 2005 at 05:09

The Drupal cache sounds nice. There modules that render items for every page view which means the Drupal cache has an override to not cache some nodes. I believe cache is automatically turned off for all nodes when you are logged in.

A separate XMLcache could be a good idea or perhaps the XML cache could tell the Drupal cache when to cache a node and when to update a node.

When I test with Drupal cache I find cases where it seems to cache stuff that has changed and does not cache stuff that remains the same. There are discussions on cache that suggest the developers tried to build a really smart cache but dropped back to a simpler cache because there were too many variables. Caching is easier if a node process tells the cache when a change occurs.

Nodes are cached by language to allow for multiple languages. Nodes are cached by user to allow for user choosing different themes. Node 123 is stored for user Brian as Brian/en/123 when Brian chooses English as the language. If the XML processor then changes node 123 the cache has to delete every version of node 123. I think that is why cache is switched off for logged in users.

http://drupal.org/node/13503 Menu cache is not locale aware
http://drupal.org/node/19298 Alternative caching for high traffic websites
http://drupal.org/node/23797 Cache exclusion

XSLT transformations could merge multiple XML documents in to one node which would mean the XML cache could save components of the node.

Take the example of a document that is transformed from RTF to XML and then transformed from XML to XHTML in a node. You change the XSLT for the XML to XHTML transformation. You then want to transform the XML to HTML without redoing all the RTF to XML transformations. If the XML is cached then you can run just the XML to XHTML transformations.

http://petermoulding.com/technology/content_management_systems/drupal/

petermoulding.com/web_architect

Instead of a node type ...

bmargulies commented 21 February 2006 at 21:02

I think you should consider giving all nodes a MIME content-type. The content format is completely orthogonal from all the existing node/flexinode characteristics. Well, in the flexinode case, each field of the node would have a content type.

Input filters could then be grouped by the content type they accept.

logical

sulleleven commented 10 March 2006 at 14:19

that would be interesting to have this option.
maybe useful inside the CCK as well.

i also think that the template engine (phptemplate) should handle arbitrary output formats of nodes.... xml flavors etc... simply by creating a new template file in the theme directory and calling it in a url.

DITA XML and Drupal

coupet commented 31 March 2006 at 16:35

DITA defines an XML architecture for designing, writing, managing, and publishing many kinds of information in print and on the Web.

DITA is an architecture for creating topic-oriented, information-typed content that can be reused and single-sourced in a variety of ways. It is also an architecture for creating new information types and describing new information domains based on existing types and domains.

From the above quotes, there is no mention of the word storage.

Apache is bandwidth limited, PHP is CPU limited, and MySQL is memory limited.