D8.0.5
search_api - 8.x-1.0-alpha12
search_api_solr - 8.x-1.0-alpha2
search_api_solr_attachments - 8.x-1.0-alpha2
SOLR 5.5 as a service

Text searching seems to operate correctly, however once I add in "Search api attachments: File" field. Indexing fails half way through with "-338 items could not be indexed" in the UI and in the logs
DOMException: Invalid Character Error in DOMDocument->createElement() (line 432 of /var/www/html/mySite/vendor/symfony/serializer/Encoder/XmlEncoder.php)..

Solr, apache, php log files do not show any errors.

I'm no Drupal expert but in doing some debugging withing XmlEncoder.php's appendNode() function, the value of $nodeName that is being passed into the function is "#text". I am guessing that the hash mark is invalid.
My question is where is this value being determined and how would I go about correcting?

Any assistance greatly appreciated!

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

ksavoie created an issue. See original summary.

drunken monkey’s picture

Project: Search API » Search API Solr
Version: 8.x-1.0-alpha12 » 8.x-1.x-dev
Component: General code » Code

If you were able to add debugging code for figuring this out, maybe you can also get a backtrace?
However, I'm almost certain this belongs into the Solr module's issue queue, and it sounds like some problem with Solarium, the Solr library we're using.

grahl’s picture

We're also seeing this issue with alpha14/alpha3. I was not able to extract any more debugging information than what the original reporter supplied.

I switched over to extraction method Tika with 1.13 and it works fine.

mkalkbrenner’s picture

Does the error still occur?

grahl’s picture

I tried testing this again but keep running into issues when using search_api_attachements dev with the current search_api dev:
Error: Call to undefined method Drupal\search_api_solr\Plugin\search_api\backend\SearchApiSolrBackend::getSolrConnection()

Tika still works fine.

mkalkbrenner’s picture

Status: Active » Fixed
Related issues: +#2831801: Add support for extract to Connector API

I assume the issue is fixed already in dev versions. Re-open the issue if not.
Just read through #2831801: Add support for extract to Connector API

simon_h’s picture

I'm also getting this error: DOMException: Invalid Character Error in DOMDocument->createElement().
I have installed the latest version of search_api_solr and search_api_attachments.
Switching to Tika is not an option, anyone else still struggling with this?

mkalkbrenner’s picture

I have installed the latest version of search_api_solr and search_api_attachments

latest releases or dev versions?

simon_h’s picture

I have installed the dev versions

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

ekes’s picture

Project: Search API Solr » Search API attachments

Believe this is actually search_api_attachments. After it has got the response:


      // Execute the query.
      $result = $client->extract($query);
      $response = $result->getResponse();
      $json_data = $response->getBody();
      $array_data = Json::decode($json_data);
      // $array_data contains json array with two keys : [filename] that contains
      // the extracted text we need and [filename]_metadata that contains some
      // extra metadata.
      $xml_data = $array_data[$filepath];

So I'd say search_api_solr has done it's job. Then search_api_attachments uses the Symfony XmlEncoder:

      $xmlencoder = new XmlEncoder();
      $dom_data = $xmlencoder->decode($xml_data, 'xml');                                        
      // We need to get only what is in body tag.
      $dom_data = $dom_data['body'];
      $htmlencoder = $xmlencoder->encode($dom_data, 'xml');                                     

Underlying the Symfony XmlEncoder is PHP's DOMDocument which when decoding adds nodes with name '#text' . Why it does this, and if it can be avoided I don't know. But directly encoding it breaks if you pass nodes with name '#text'.

ekes’s picture

Category: Support request » Bug report
ekes’s picture

If one of the maintainers would be so good as to re-open this issue. Better than starting a new one.

izus’s picture

Status: Closed (fixed) » Active

Thanks ekes
this is now reopened
i hope a patch will be attached soon by a contributor

ekes’s picture

As it's doing this just to get everything between <body> and </body>, or even remove everything up to <body> and everything after </body> would it not be much more efficient to just do some string replacement?

izus’s picture

Hi ekes,
i am for the "first fix it then make it better" method ;) , so yes for a string replacement if it solves the issue. but let's just have a good comment for it and if someone has any idea on a better fix we can just open a followup issue.

ekes’s picture

So I started writing something to extract the and putting some examples into tests. Until I realized that from all the examples, strip_tags alone did all that was desired. The only thing that's left is the body, and if there is a title, the title.

ekes’s picture

Status: Active » Needs review
ekes’s picture

Status: Needs review » Needs work

Actually. Why are we stripping the tags? Search indexing can use these. It is the body, and maybe the title from header (perhaps things like tag metadata, but that's another issue) that are wanted.

ekes’s picture

Status: Needs work » Needs review
FileSize
4.65 KB

I've been through the git log and I can't quite see the reason to remove tags. So figure I'll post this. Maintaining the tags, getting the body from the string. With as a backup just strip tags for some un-recognised content for any reason.

interX’s picture

Here's a reroll on the latest dev branch, works on 1.0.0-beta2 as well.
Happened for me with certain Word documents, not all. The patch solves the error.

interX’s picture

Here's a reroll on the latest dev branch, works on 1.0.0-beta2 as well.
Happened for me with certain Word documents, not all. The patch solves the error.

  • izus committed c22ce91 on 8.x-1.x authored by ekes
    Issue #2693661 by ekes, interX, mkalkbrenner, izus, grahl, ksavoie,...
izus’s picture

Status: Needs review » Fixed

#22 tested the patch and it is ok.
This is now merged and will be part of next release.
Thank you all for your contributions : reporting, helping for clarification, coding ....you are awsome contributors.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.