I am developing a site and part of it is aggregating info from other sites using the RSS feeds of those sites. I am not using the Feeds module since the migrate and migrate_plus offer more flexibility.

As it turns out, more that 50% of the sites that I read the RSS from (comming from WordPress platforms btw) have a problem (misconfiguration?) in that at the beginning of the RSS feed there is an empty line, which does not conform to the XML standard, which in turn makes the SimpleXML parser used in migrate_plus to fail. Otherwise the RSS's are correct.

I propose a simple modification of the SimpleXML.php to account for this types of situations, basically adding a trim to the simplexml_load_string function.

I think that is a minor code change that can make the XML parser more resilient and account for simple errors that are not in our reach.

Please find the patch attached and consider if it's useful to be added to the code.

Comments

devarch created an issue. See original summary.

devarch’s picture

StatusFileSize
new658 bytes
heddn’s picture

Status: Active » Needs work
Issue tags: +Needs tests

Could we add a test for this? NW for tests.

devarch’s picture

@heddn Should I create a test? I'm not very familiar with the Drupal testing framework, but the change proposed is really minor, adding a PHP trim to a string.

heddn’s picture

Yes, please do. Testing helps us make sure regressions don't come back into the code and mature the code base. There's already some testing on XML, so you will probably be able to extend something already there. For a quick guide, refer to https://www.drupal.org/docs/8/phpunit.

devarch’s picture

StatusFileSize
new9.48 KB

Ok, I have extended the simple_xml tests to cover for more cases. For now it covers the following cases:

  1. reduce single value (the already existing test case)
  2. 2 cases for non standard conforming xml's (the thing that we are trying to avoid in the present bug)
  3. 2 cases for broken xml's (non matching tags and tag not closed)
  4. one case when trying to parse a non xml file
  5. one case when trying to parse a non existing xml file

I have also modified a bit more the openSourceUrl method since the check for libxml errors should be done immediately after the call to simplexml_load_string. If the simplexml_load_string did not succeed, then $this->registerNamespaces($xml); would throw an error since a boolean is passed instead of an xml object. This is a bug and it happened always when reading a non existing xml or a broken one.

More, when this bug happened the migration error message (that a boolean was passed instead of an xml object) was strange and did not help to debug the actual problem (bad or non existing source xml).

devarch’s picture

StatusFileSize
new9.53 KB

I've updated the code for getting rid of coding standard errors. There are still some warnings but I don't know how to fix them.

devarch’s picture

StatusFileSize
new9.55 KB

Replaced expectException with setExpectedException. Both versions work locally, I'm not sure what's happening, but reading this issue I've done the modifications.

I hope this time works.

heddn’s picture

Issue tags: -Needs tests

Nice, really like the tests. Thanks for your contributions. Some minor feedback.

  1. +++ b/tests/src/Kernel/Plugin/migrate_plus/data_parser/SimpleXmlTest.php
    @@ -70,8 +61,123 @@ class SimpleXmlTest extends KernelTestBase {
    +    $url = $this->path . '/tests/data/simple_xml_reduce_single_value.xml';
    

    Can we use vfsStream instead of physical xml files? I think that would make for cleaner tests? We use it over in migrate_source_csv, if you are looking for an example.

  2. +++ b/tests/src/Kernel/Plugin/migrate_plus/data_parser/SimpleXmlTest.php
    @@ -70,8 +61,123 @@ class SimpleXmlTest extends KernelTestBase {
    +  public function testReadBrokenXml1() {
    ...
    +  public function testReadBrokenXml2() {
    

    How are these two differently broken? Let's be a little more descriptive, it will help our future selves so we won't have to refer to the xml files.

devarch’s picture

StatusFileSize
new9.82 KB

I've added documentation on what specific error each test is checking and I have renamed the BrokenXml1 and BrokenXml2 tests to be more self explanatory.

I don't think is useful in this case to use vfsStream, one should be able to extend and add xml tests very easy using an external XML editor, check and create XML errors or cases in it. I get that the test would be 'self contained' but the data folder already exists, there is a json test in there already, so I don't think is necessary to over-complicate something that should be straight forward.

devarch’s picture

Status: Needs work » Needs review
heddn’s picture

StatusFileSize
new7.2 KB
new10.38 KB

Nice work. Here's some more tweaks to enhance the tests.

heddn’s picture

Status: Needs review » Reviewed & tested by the community

Thanks for your contributions @devarch. This is looking good now.

  • heddn committed 589d02d on 8.x-4.x
    Issue #3046753 by devarch, heddn: Make XML parser more resilient
    
heddn’s picture

Status: Reviewed & tested by the community » Fixed

  • heddn committed c22a062 on 8.x-4.x
    Issue #3046753 by devarch, heddn: Make XML parser more resilient
    

  • heddn committed e5d638d on 8.x-4.x
    Issue #3046753 by devarch, heddn: Make XML parser more resilient
    

  • heddn committed 3c13a86 on 8.x-4.x
    Issue #3046753 by devarch, heddn: Make XML parser more resilient
    
devarch’s picture

Great, @heddn. Just have in mind that this issue is not just a feature request but also a bug fix in the openSourceUrl method as described in #6.

Committing this will resolve also bug 2974591 as I have described in comment #4 on that issue.

All in all, waiting for the 4.2 release.

heddn’s picture

Version: 8.x-4.1 » 8.x-4.x-dev

This should be in the latest release

2dareis2do’s picture

Its impressive to see how quickly this issue has been resolved and moved into the main release.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.