D8.0.5
search_api - 8.x-1.0-alpha12
search_api_solr - 8.x-1.0-alpha2
search_api_solr_attachments - 8.x-1.0-alpha2
SOLR 5.5 as a service
Text searching seems to operate correctly, however once I add in "Search api attachments: File" field. Indexing fails half way through with "-338 items could not be indexed" in the UI and in the logs
DOMException: Invalid Character Error in DOMDocument->createElement() (line 432 of /var/www/html/mySite/vendor/symfony/serializer/Encoder/XmlEncoder.php).
.
Solr, apache, php log files do not show any errors.
I'm no Drupal expert but in doing some debugging withing XmlEncoder.php's appendNode() function, the value of $nodeName that is being passed into the function is "#text". I am guessing that the hash mark is invalid.
My question is where is this value being determined and how would I go about correcting?
Any assistance greatly appreciated!
Comment | File | Size | Author |
---|---|---|---|
#21 | 693661-21-encode-decode-invalid-exception.patch | 4.6 KB | interX |
#20 | 2693661-20-encode-decode-invalid-exception.patch | 4.65 KB | ekes |
#17 | 2693661-17-encode-decode-invalid-exception.patch | 914 bytes | ekes |
Comments
Comment #2
drunken monkeyIf you were able to add debugging code for figuring this out, maybe you can also get a backtrace?
However, I'm almost certain this belongs into the Solr module's issue queue, and it sounds like some problem with Solarium, the Solr library we're using.
Comment #3
grahlWe're also seeing this issue with alpha14/alpha3. I was not able to extract any more debugging information than what the original reporter supplied.
I switched over to extraction method Tika with 1.13 and it works fine.
Comment #4
mkalkbrennerDoes the error still occur?
Comment #5
grahlI tried testing this again but keep running into issues when using search_api_attachements dev with the current search_api dev:
Error: Call to undefined method Drupal\search_api_solr\Plugin\search_api\backend\SearchApiSolrBackend::getSolrConnection()
Tika still works fine.
Comment #6
mkalkbrennerI assume the issue is fixed already in dev versions. Re-open the issue if not.
Just read through #2831801: Add support for extract to Connector API
Comment #7
simon_h CreditAttribution: simon_h as a volunteer commentedI'm also getting this error: DOMException: Invalid Character Error in DOMDocument->createElement().
I have installed the latest version of search_api_solr and search_api_attachments.
Switching to Tika is not an option, anyone else still struggling with this?
Comment #8
mkalkbrennerlatest releases or dev versions?
Comment #9
simon_h CreditAttribution: simon_h as a volunteer commentedI have installed the dev versions
Comment #11
ekes CreditAttribution: ekes as a volunteer commentedBelieve this is actually search_api_attachments. After it has got the response:
So I'd say search_api_solr has done it's job. Then search_api_attachments uses the Symfony XmlEncoder:
Underlying the Symfony XmlEncoder is PHP's DOMDocument which when decoding adds nodes with name '#text' . Why it does this, and if it can be avoided I don't know. But directly encoding it breaks if you pass nodes with name '#text'.
Comment #12
ekes CreditAttribution: ekes as a volunteer commentedComment #13
ekes CreditAttribution: ekes as a volunteer commentedIf one of the maintainers would be so good as to re-open this issue. Better than starting a new one.
Comment #14
izus CreditAttribution: izus commentedThanks ekes
this is now reopened
i hope a patch will be attached soon by a contributor
Comment #15
ekes CreditAttribution: ekes as a volunteer commentedAs it's doing this just to get everything between <body> and </body>, or even remove everything up to <body> and everything after </body> would it not be much more efficient to just do some string replacement?
Comment #16
izus CreditAttribution: izus commentedHi ekes,
i am for the "first fix it then make it better" method ;) , so yes for a string replacement if it solves the issue. but let's just have a good comment for it and if someone has any idea on a better fix we can just open a followup issue.
Comment #17
ekes CreditAttribution: ekes as a volunteer commentedSo I started writing something to extract the and putting some examples into tests. Until I realized that from all the examples, strip_tags alone did all that was desired. The only thing that's left is the body, and if there is a title, the title.
Comment #18
ekes CreditAttribution: ekes as a volunteer commentedComment #19
ekes CreditAttribution: ekes as a volunteer commentedActually. Why are we stripping the tags? Search indexing can use these. It is the body, and maybe the title from header (perhaps things like tag metadata, but that's another issue) that are wanted.
Comment #20
ekes CreditAttribution: ekes as a volunteer commentedI've been through the git log and I can't quite see the reason to remove tags. So figure I'll post this. Maintaining the tags, getting the body from the string. With as a backup just strip tags for some un-recognised content for any reason.
Comment #21
interX CreditAttribution: interX commentedHere's a reroll on the latest dev branch, works on 1.0.0-beta2 as well.
Happened for me with certain Word documents, not all. The patch solves the error.
Comment #22
interX CreditAttribution: interX commentedHere's a reroll on the latest dev branch, works on 1.0.0-beta2 as well.
Happened for me with certain Word documents, not all. The patch solves the error.
Comment #24
izus CreditAttribution: izus commented#22 tested the patch and it is ok.
This is now merged and will be part of next release.
Thank you all for your contributions : reporting, helping for clarification, coding ....you are awsome contributors.