Gracefully handle invalid UTF-8 [#1819312]

Indexing http://drupal.org/node/1212596/revisions/1562604/view caused our solr server to throw "Error 400 Invalid UTF-8 character 0xffff." Indexing retries after the failure, getting stuck on that document.

For Drupal.org, I just edited #1212596: schema.xml contains a funny character that causes xml parsing error when viewing the file with Solr admin UI to remove the offending character. Notice the node body is no longer truncated.

Maybe part of indexing should be stripping out bad characters.

Comment	File	Size	Author
#11	apachesolr-1819312-11-strip_invalid_characters.patch	1.71 KB	Nikolay Shapovalov
#11	7.x-1.x: PHP 7.3 & MySQL 5.6, D7 25 pass
#7	apachesolr-1819312-strip_invalid_characters.patch	647 bytes	manarth
#7	7.x-1.x: PHP 5.3 & MySQL 5.5, D7 25 pass
#5	1819312-hexeditor-output.png	2.03 KB	manarth

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

Nick_vh

he/him

Ghent

CreditAttribution: Nick_vh commented 22 October 2012 at 07:47

I'm not sure if I follow - what exactly did you edit as that issue that you are linking to mentions you should start jetty with a utf parameter. The same is valid for tomcat #1603122: Tomcat and utf-8

If we do have to check for more invalid characters I suppose we should handle this in apachesolr_clean_text()? Any suggestions of what we can add to this function to make it more clean?

/**
 * Strip html tags and also control characters that cause Jetty/Solr to fail.
 */
function apachesolr_clean_text($text) {
  // Remove invisible content.
  $text = preg_replace('@<(applet|audio|canvas|command|embed|iframe|map|menu|noembed|noframes|noscript|script|style|svg|video)[^>]*>.*</\1>@siU', ' ', $text);
  // Add spaces before stripping tags to avoid running words together.
  $text = filter_xss(str_replace(array('<', '>'), array(' <', '> '), $text), array());
  // Decode entities and then make safe any < or > characters.
  $text = htmlspecialchars(html_entity_decode($text, ENT_QUOTES, 'UTF-8'), ENT_QUOTES, 'UTF-8');
  // Remove extra spaces.
  $text = preg_replace('/\s+/s', ' ', $text);
  return $text;
}

Comment #2

drumm

he/him

NY, US

CreditAttribution: drumm commented 22 October 2012 at 17:54

http://drupal.org/node/1212596/revisions/view/1562604/2408728 is the specific edit to remove our bad characters. I didn't spend much time actually investigating them, just deleted and moved on with indexing.

I'll have to check with nnewton or someone else on our server configuration.

Comment #3

drumm

he/him

NY, US

CreditAttribution: drumm commented 22 October 2012 at 19:56

Our Jetty 6 is correctly configured for UTF8.

Comment #4

Nick_vh

he/him

Ghent

CreditAttribution: Nick_vh commented 22 May 2013 at 21:03

Status:

Active

» Closed (cannot reproduce)

Haven't seen anyone else report this. Please reopen if you are still suffering from this problem. It could very well be some custom code that does not clean enough.

Comment #5

manarth CreditAttribution: manarth commented 14 December 2015 at 11:28

Version:	6.x-3.x-dev	» 7.x-1.8
Issue summary:	View changes
Status:	Closed (cannot reproduce)	» Needs work

File	Size
1819312-hexeditor-output.png	2.03 KB

I'm experiencing this issue with the current 7.x release (7.x-1.8).

The error message I get is:

WD Apache Solr: Indexing failed on one of the following entity ids: node/52732 HTTP 400; ParseError at [row,col]:[45,148] Message: An invalid XML character (Unicode: 0xffff) was found in the element content of the
document.: ParseError at [row,col]:[45,148] Message: An invalid XML character (Unicode: 0xffff) was found in the
element content of the document.

The issue is that the character sequence exists in the content of a node. Within the content (the body field), there is this text (this is the XML-encoded version sent to Solr):

Follow us on&amp;nbsp;&lt;/strong&gt;&lt;a href="https://www.facebook.com/foo" target="_BLANK    "&gt;facebook&lt;/a&gt;&amp;nbsp;and know more

After the text "_BLANK is what appears at first glance to be 4 spaces. Inspecting this with a hex-editor shows the sequence 42 4C 41 4E 4B (i.e. "BLANK"), followed by EF BF BF (repeated 4 times).

The sequence EF BF BF is the UTF-8 encoding of the Unicode character 0xffff, which is not a valid XML character.

There is already a function to strip certain characters that are valid UTF-8 but invalid XML: ApacheSolrDocument::stripCtrlChars(). This could probably be refactored to handle this invalid character too.

I've experienced this issue before (with the same control character), and previously I've just edited the content to remove the invalid characters. They're certainly not an intentional part of the content (how they got entered is another question - I'm guessing some sort of combination of copy-paste and a rich-text editor), but it will be better if they're stripped automatically.

Comment #6

manarth CreditAttribution: manarth commented 14 December 2015 at 12:39

I've hacked ApacheSolrDocument::stripCtrlChars() a little, and this seems to resolve it, but it could probably be cleaned up a little, and I'm not sure whether that 3-byte sequence could end up inadvertently matching a valid sequence.

I added this code:

    $string = preg_replace('@([\xef\xbf\xbf])@', ' ', $string);

To make:

  /**
   * Replace control (non-printable) characters from string that are invalid to Solr's XML parser with a space.
   *
   * @param string $string
   * @return string
   */
  public static function stripCtrlChars($string) {
    // See:  http://w3.org/International/questions/qa-forms-utf-8.html
    // Printable utf-8 does not include any of these chars below x7F
    $string = preg_replace('@([\xef\xbf\xbf])@', ' ', $string);
    return preg_replace('@[\x00-\x08\x0B\x0C\x0E-\x1F]@', ' ', $string);
  }

Comment #7

manarth CreditAttribution: manarth commented 23 December 2015 at 11:31

Version:	7.x-1.8	» 7.x-1.x-dev
Status:	Needs work	» Needs review

File	Size
apachesolr-1819312-strip_invalid_characters.patch	647 bytes
7.x-1.x: PHP 5.3 & MySQL 5.5, D7 25 pass

Comment #8

varshith CreditAttribution: varshith as a volunteer commented 25 October 2016 at 03:30

I had a similar issue. But the above didnt seem to work in my case
On looking into it, the invalid character in my case was
@([\xef][\xbf][\xbe]) while the one in the patch is @([\xef][\xbf][\xbf])
Some info about the char here (U+FFFE)

Comment #9

sandra.ramil CreditAttribution: sandra.ramil commented 7 July 2017 at 10:05

The patch in comment #7 works great for me!

Comment #10

Nikolay Shapovalov

Russian

CreditAttribution: Nikolay Shapovalov at jobiqo - job board technology commented 9 November 2021 at 03:50

Status:

Needs review

» Needs work

Patch #7 can be applied, and it should work, but as I can see this is not supports all cases.
It just remove U+FFFF (0xEF 0xBF 0xBF (efbfbf)) character, but as mentioned at #8 character U+FFFE (0xEF 0xBF 0xBE (efbfbe)) will not be excluded.

As you can see here there is even more characters that should not be in XML
https://www.w3.org/TR/xml/#charsets
Supported characters are

Char	   ::=   	#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]	/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

One of solution that can help https://stackoverflow.com/a/66250135

$str = preg_replace(
    '/[\x00-\x08\x0B\x0C\x0E-\x1F]|\xED[\xA0-\xBF].|\xEF\xBF[\xBE\xBF]/',
    "\xEF\xBF\xBD",
    $str
);
This doesn't use the u Unicode regex modifier but works directly on UTF-8 encoded bytes for extra performance. The parts of the pattern are:

Invalid control chars: [\x00-\x08\x0B\x0C\x0E-\x1F]
UTF-16 surrogates: \xED[\xA0-\xBF].
Non-characters U+FFFE and U+FFFF: \xEF\xBF[\xBE\xBF]

But I still not sure if this code is 100% valid. I want to find solution which also covered with tests.

Comment #11

Nikolay Shapovalov

Russian

CreditAttribution: Nikolay Shapovalov at jobiqo - job board technology commented 7 April 2022 at 18:18

File	Size
apachesolr-1819312-11-strip_invalid_characters.patch	1.71 KB
7.x-1.x: PHP 7.3 & MySQL 5.6, D7 25 pass

I find a solution, for search_api_solr, didn't test it with apachesolr module, but it should work.

Comment #12

Nikolay Shapovalov

Russian

CreditAttribution: Nikolay Shapovalov at jobiqo - job board technology commented 7 April 2022 at 18:18

Status:

Needs work

» Needs review

Comment #13

Nikolay Shapovalov

Russian

CreditAttribution: Nikolay Shapovalov at jobiqo - job board technology commented 12 April 2022 at 10:10

I already made some changes to patch #11. I will not reupload patch, until changes will be commited in search_api_solr.
You can find latest version of patch in this issue https://www.drupal.org/project/search_api_solr/issues/3256878