Add XPath query Feeds Tamper plugin. [#1459870]

This is a new plugin for Feeds Tamper to run XPath queries on a certain set of data.

We faced a problem where there was html inside xml-element, but the html content was encoded. This made it difficult to use Feeds XPath Parser, because it wouldn't recognize the elements.

So the solution was to choose the proper element with Feeds XPath Parser (or Feeds Querypath parser), decode it with Feeds Tamper and then use XPath queries again to get the proper body, image, category etc. from the html encoded swamp of characters.

I tried to generalize it as much as possible to remove any custom solutions we had for our own use. Sponsored by Mearra

Comment	File	Size	Author
	xpath_query.inc_.zip	1.42 KB	mErilainen

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

derekw CreditAttribution: derekw commented 10 August 2012 at 23:25

After working on the same problem (image embedded in feed description, html entities encoded) for two days, I found your xpath_query plugin for feeds tamper.

Would you post the details of your feeds_tamper for your use case?

Here's mine that's not working (error). Not sure if the HTML is not getting decoded first or what.

PARSER:
xpathparser:5 Image

Xpath query: description/text()

TAMPER:
xpathparser:5 -> Image Description

HTML entity decode
Xpath query //img[1]/@src
(Also tried Xpath query //img/@src[1], all sorts of "contains raw" variations, etc.)

Warning: DOMDocument::loadXML() [domdocument.loadxml]: Premature end of data in tag img line 1 in Entity, line: 5 in feeds_tamper_xpath_query_callback() (line 59 of /home/tradelin/public_html/sites/all/modules/feeds_tamper/plugins/xpath_query.inc).

Here's the fdebug output from the Feeds Xpath Parser:

<img src='http://assets.bizjournals.com/nashville/blog/genericschoolNSH*100.jpg?v=1'>The longer the school funding debate in Sumner County drags on, the tougher it becomes for those trying to lure companies to town. Companies look for strong, stable school systems when deciding where to relocate or expand, said James Fenton, executive director of the Gallatin Economic Development Agency. "Right now just Google Sumner and it's popping up," Fenton said. "If it drags on, then the headlines are going to have a very negative impact on us. If everyone can come together quickly and move…<div class="feedflare"> <a href="http://feeds.bizjournals.com/~ff/industry_22?a=PV4pryx-1WA:EdH0TiDHrgg:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/industry_22?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/industry_22?a=PV4pryx-1WA:EdH0TiDHrgg:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/industry_22?i=PV4pryx-1WA:EdH0TiDHrgg:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/industry_22?a=PV4pryx-1WA:EdH0TiDHrgg:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/industry_22?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/industry_22?a=PV4pryx-1WA:EdH0TiDHrgg:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/industry_22?i=PV4pryx-1WA:EdH0TiDHrgg:gIN9vFwOqvQ" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/industry_22?a=PV4pryx-1WA:EdH0TiDHrgg:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/industry_22?i=PV4pryx-1WA:EdH0TiDHrgg:F7zBnMyn0Lo" border="0"></img></a> </div>

Comment #2

twistor CreditAttribution: twistor commented 14 August 2012 at 04:38

Awesome, get to this soon.

Comment #3

mErilainen CreditAttribution: mErilainen commented 14 August 2012 at 16:22

Not sure where the problem is, looks like you still have some html entities in the output of fdebug.
I have used xpath selectors such as //p[@id='text'], but your case is different, so the first selector seems correct.

Comment #4

richsky CreditAttribution: richsky commented 5 September 2012 at 13:16

I would love to see this one working! Do you think I could use it to match images in something like this:

<xmldoc>
<xmlelement>
<mykeyreturnedwithfeedsxpathparser>
&lt;h1&gt;My title&lt;/h1&gt;
&lt;p&gt;My text &lt;img src="myimage1.jpg" /&gt;&lt;p&gt;
&lt;p&gt;Other text &lt;img src="myimage2.jpg" /&gt;&lt;p&gt;
</mykeyreturnedwithfeedsxpathparser>
</xmlelement>
</xmldoc>

So far I could not.

//img/@src

Comment #5

mErilainen CreditAttribution: mErilainen commented 6 September 2012 at 10:28

I had exactly the same kind of document.
- First I use Feeds XPath XML Parser to select the field from a XML document and save it to a temporary target.
* In your case the context would be "//xmldoc" and the selector for the encoded mess "mykeyreturnedwithfeedsxpathparser"
- Then I HTML decode that in Feeds Tamper
- Then I use the Copy element value plugin to save it to whatever field I want to map it. (this is not required if you only need to extract one value from the encoded mess. You can map the field straight to your desired field in Drupal.)
- Then I add the XPath tamper plugin to select the content with the xpath selector because after decoding it should be proper XML / HTML.
I have used selectors such as
//a[@id='Original']/@href
and
//p[@id='text']

Comment #6

twistor CreditAttribution: twistor commented 7 September 2012 at 07:41

Status:

Needs review

» Needs work

I see a lot of issues with this. We will need a setting for XML vs HTML. Will need tests. I'm wondering if this should be supplied by the XPath Parser.

Comment #7

MilanAK CreditAttribution: MilanAK commented 16 May 2013 at 16:07

I have the same issue with the Feeds XPath Parser not being able to run queries on an encoded HTML RSS feed, in particular, "<" and ">", rather than < and >. Is there any work around for this now? Is there any fix planned in the near future? Can I somehow use variable substitution in an XPath query to interpret the "<" and ">" correctly? There are about 10 fields in this description that I need to extract. Thanks very much.

<description><div class="field field-type-content-taxonomy field-field-institutions"> <div class="field-label"> <h3> Associated institutions:&nbsp; </h3> </div> <div class="field-items"> <div class="field-item odd"> Cincinnati Children&#039;s Hospital Medical Center </div> <div class="field-item even"> University of Cincinnati </div> </div> </div> <div class="field field-type-text field-field-url"> <div class="field-label"> <h3> . . . <description>

Comment #8

twistor CreditAttribution: twistor commented 16 May 2013 at 18:57

The XPath parser can't parse encoded HTML. It's simply a blob to the parser. There's not really a way to fix it, which is why something like this is valid.

Comment #9

MilanAK CreditAttribution: MilanAK commented 16 May 2013 at 21:24

Thanks very much for your lightening response. I'm a bit desperate for a quick fix. Might it be possible to do a two stage RSS import somehow? The first stage using XPath's XML parser to parse the XML, begin creating a node based on the XML fields, and throwing the blob of encoded HTML somewhere. The second stage using XPath's HTML parser to read in the blob of HTML from somewhere, parse it, and append fields to the node created in the initial stage?

Just to be redundant, there's no way I can use variable substitution in an XPath query to parse the encoded HTML, correct? If I could just apply an XPath query on the unencoded HTML that shows up in the XPath query error....

It seems eventually it would be great if there were a radio button (Parse encoded HTML?) when users choose the XPath XML parser. If that button is selected a preprocessor converts > to > and < to <..... I'd be happy to test that feature.

Comment #10

MilanAK CreditAttribution: MilanAK commented 17 May 2013 at 23:23

In theory I think it should be possible to unencode the encoded HTML by replacing < and > with '<' and '>', respectively, by using variable substitution, intermediate fields and applying the XPath "translate" function twice. As a first pass, I've managed to get:

<div class="field field-type-content-taxonomy field-field-institutions"> <div class="field-label"> <h3> Associated institutions:&nbsp; </h3> </div> <div class="field-items"> <div class="field-item odd"> Cincinnati Children

translated to:

<div cass="fied fied-ype-conen-axonomy fied-fied-insiuions"<g <div cass="fied-abe"<g <h3<g Associaed insiuions:<ampnbsp </h3<g </div<g <div cass="fied-iems"<g <div cass="fied-iem odd"<g Cincinnai Chidren

by applying:

translate('$field_description', string('<') ,'<').

Unfortunately, I'm loosing "l" and "t" in the process and > is being translated to:"<g" . I'm going to see if the XPath replace function will do any better....

Comment #11

MilanAK CreditAttribution: MilanAK commented 21 May 2013 at 16:11

With guidance, I'd like to help with implementing this addition. I have a background in software development but am relatively new to Drupal and PHP. I do need to show results as soon as possible.

Comment #12

MilanAK CreditAttribution: MilanAK commented 23 May 2013 at 21:00

I added the following 3 lines of code (after the MAK comment) to FeedsXPathParserXML.inc to replace <, etc. with "<", etc. That is, I used the str_replace function to substitute symbolic HTML for encoded HTML in my input ($raw) RSS feed. What the best way to make this patch available to others?

<?php

/**
 * @files
 * Provides the FeedsXPathParserXML class.
 */
class FeedsXPathParserXML extends FeedsXPathParserBase {

  /**
   * Implements FeedsXPathParserBase::setup().
   */
  protected function setup($source_config, FeedsFetcherResult $fetcher_result) {

    if (!empty($source_config['exp']['tidy'])) {
      $config = array(
        'input-xml' => TRUE,
        'wrap'      => 0,
        'tidy-mark' => FALSE,
      );
      // Default tidy encoding is UTF8.
      $encoding = $source_config['exp']['tidy_encoding'];
      $raw = tidy_repair_string(trim($fetcher_result->getRaw()), $config, $encoding);
    }
    else {
      $raw = $fetcher_result->getRaw();
    }
    
    /* MAK 052213 Unencode embeded HTML so that it can be parsed using XPath */
    $encoded_html = array("&lt;", "&gt;", "&quot;", "&amp;nbsp;", "&amp;#039;");
    $unencoded_html = array("<", ">", "\"", " ", "'");
    $raw = str_replace($encoded_html, $unencoded_html, $raw); 
    
    $doc = new DOMDocument();
    $use = $this->errorStart();
    $success = $doc->loadXML($raw);
    unset($raw);
    $this->errorStop($use, $source_config['exp']['errors']);
    if (!$success) {
      throw new Exception(t('There was an error parsing the XML document.'));
    }
    return $doc;
  }

  protected function getRaw(DOMNode $node) {
    return $this->doc->saveXML($node);
  }
}

Comment #13

mErilainen CreditAttribution: mErilainen commented 8 January 2014 at 12:42

Replying to MilanAK: Why are you trying to do html entity decoding in code when there is a Feeds Tamper plugin for doing that? After decoding the html entities you should be able to use the plugin for selecting elements with XPath.