Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
Indexing http://drupal.org/node/1212596/revisions/1562604/view caused our solr server to throw "Error 400 Invalid UTF-8 character 0xffff." Indexing retries after the failure, getting stuck on that document.
For Drupal.org, I just edited #1212596: schema.xml contains a funny character that causes xml parsing error when viewing the file with Solr admin UI to remove the offending character. Notice the node body is no longer truncated.
Maybe part of indexing should be stripping out bad characters.
Comment | File | Size | Author |
---|---|---|---|
#11 | apachesolr-1819312-11-strip_invalid_characters.patch | 1.71 KB | Nikolay Shapovalov |
| |||
#7 | apachesolr-1819312-strip_invalid_characters.patch | 647 bytes | manarth |
| |||
#5 | 1819312-hexeditor-output.png | 2.03 KB | manarth |
Comments
Comment #1
Nick_vhI'm not sure if I follow - what exactly did you edit as that issue that you are linking to mentions you should start jetty with a utf parameter. The same is valid for tomcat #1603122: Tomcat and utf-8
If we do have to check for more invalid characters I suppose we should handle this in apachesolr_clean_text()? Any suggestions of what we can add to this function to make it more clean?
Comment #2
drummhttp://drupal.org/node/1212596/revisions/view/1562604/2408728 is the specific edit to remove our bad characters. I didn't spend much time actually investigating them, just deleted and moved on with indexing.
I'll have to check with nnewton or someone else on our server configuration.
Comment #3
drummOur Jetty 6 is correctly configured for UTF8.
Comment #4
Nick_vhHaven't seen anyone else report this. Please reopen if you are still suffering from this problem. It could very well be some custom code that does not clean enough.
Comment #5
manarth CreditAttribution: manarth commentedI'm experiencing this issue with the current 7.x release (7.x-1.8).
The error message I get is:
The issue is that the character sequence exists in the content of a node. Within the content (the body field), there is this text (this is the XML-encoded version sent to Solr):
After the text
"_BLANK
is what appears at first glance to be 4 spaces. Inspecting this with a hex-editor shows the sequence 42 4C 41 4E 4B (i.e. "BLANK"), followed by EF BF BF (repeated 4 times).The sequence EF BF BF is the UTF-8 encoding of the Unicode character 0xffff, which is not a valid XML character.
There is already a function to strip certain characters that are valid UTF-8 but invalid XML:
ApacheSolrDocument::stripCtrlChars()
. This could probably be refactored to handle this invalid character too.I've experienced this issue before (with the same control character), and previously I've just edited the content to remove the invalid characters. They're certainly not an intentional part of the content (how they got entered is another question - I'm guessing some sort of combination of copy-paste and a rich-text editor), but it will be better if they're stripped automatically.
Comment #6
manarth CreditAttribution: manarth commentedI've hacked
ApacheSolrDocument::stripCtrlChars()
a little, and this seems to resolve it, but it could probably be cleaned up a little, and I'm not sure whether that 3-byte sequence could end up inadvertently matching a valid sequence.I added this code:
To make:
Comment #7
manarth CreditAttribution: manarth commentedComment #8
varshith CreditAttribution: varshith as a volunteer commentedI had a similar issue. But the above didnt seem to work in my case
On looking into it, the invalid character in my case was
@([\xef][\xbf][\xbe]) while the one in the patch is @([\xef][\xbf][\xbf])
Some info about the char here (U+FFFE)
Comment #9
sandra.ramil CreditAttribution: sandra.ramil commentedThe patch in comment #7 works great for me!
Comment #10
Nikolay ShapovalovPatch #7 can be applied, and it should work, but as I can see this is not supports all cases.
It just remove U+FFFF (0xEF 0xBF 0xBF (efbfbf)) character, but as mentioned at #8 character U+FFFE (0xEF 0xBF 0xBE (efbfbe)) will not be excluded.
As you can see here there is even more characters that should not be in XML
https://www.w3.org/TR/xml/#charsets
Supported characters are
One of solution that can help https://stackoverflow.com/a/66250135
But I still not sure if this code is 100% valid. I want to find solution which also covered with tests.
Comment #11
Nikolay ShapovalovI find a solution, for search_api_solr, didn't test it with apachesolr module, but it should work.
Comment #12
Nikolay ShapovalovComment #13
Nikolay ShapovalovI already made some changes to patch #11. I will not reupload patch, until changes will be commited in search_api_solr.
You can find latest version of patch in this issue https://www.drupal.org/project/search_api_solr/issues/3256878