Indexing http://drupal.org/node/1212596/revisions/1562604/view caused our solr server to throw "Error 400 Invalid UTF-8 character 0xffff." Indexing retries after the failure, getting stuck on that document.

For Drupal.org, I just edited #1212596: schema.xml contains a funny character that causes xml parsing error when viewing the file with Solr admin UI to remove the offending character. Notice the node body is no longer truncated.

Maybe part of indexing should be stripping out bad characters.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Nick_vh’s picture

I'm not sure if I follow - what exactly did you edit as that issue that you are linking to mentions you should start jetty with a utf parameter. The same is valid for tomcat #1603122: Tomcat and utf-8

If we do have to check for more invalid characters I suppose we should handle this in apachesolr_clean_text()? Any suggestions of what we can add to this function to make it more clean?

/**
 * Strip html tags and also control characters that cause Jetty/Solr to fail.
 */
function apachesolr_clean_text($text) {
  // Remove invisible content.
  $text = preg_replace('@<(applet|audio|canvas|command|embed|iframe|map|menu|noembed|noframes|noscript|script|style|svg|video)[^>]*>.*</\1>@siU', ' ', $text);
  // Add spaces before stripping tags to avoid running words together.
  $text = filter_xss(str_replace(array('<', '>'), array(' <', '> '), $text), array());
  // Decode entities and then make safe any < or > characters.
  $text = htmlspecialchars(html_entity_decode($text, ENT_QUOTES, 'UTF-8'), ENT_QUOTES, 'UTF-8');
  // Remove extra spaces.
  $text = preg_replace('/\s+/s', ' ', $text);
  return $text;
}
drumm’s picture

http://drupal.org/node/1212596/revisions/view/1562604/2408728 is the specific edit to remove our bad characters. I didn't spend much time actually investigating them, just deleted and moved on with indexing.

I'll have to check with nnewton or someone else on our server configuration.

drumm’s picture

Our Jetty 6 is correctly configured for UTF8.

Nick_vh’s picture

Status: Active » Closed (cannot reproduce)

Haven't seen anyone else report this. Please reopen if you are still suffering from this problem. It could very well be some custom code that does not clean enough.

manarth’s picture

Version: 6.x-3.x-dev » 7.x-1.8
Issue summary: View changes
Status: Closed (cannot reproduce) » Needs work
FileSize
2.03 KB

I'm experiencing this issue with the current 7.x release (7.x-1.8).

The error message I get is:

WD Apache Solr: Indexing failed on one of the following entity ids: node/52732 HTTP 400; ParseError at [row,col]:[45,148] Message: An invalid XML character (Unicode: 0xffff) was found in the element content of the
document.: ParseError at [row,col]:[45,148] Message: An invalid XML character (Unicode: 0xffff) was found in the
element content of the document.

The issue is that the character sequence exists in the content of a node. Within the content (the body field), there is this text (this is the XML-encoded version sent to Solr):

Follow us on&amp;nbsp;&lt;/strong&gt;&lt;a href="https://www.facebook.com/foo" target="_BLANK    "&gt;facebook&lt;/a&gt;&amp;nbsp;and know more

After the text "_BLANK is what appears at first glance to be 4 spaces. Inspecting this with a hex-editor shows the sequence 42 4C 41 4E 4B (i.e. "BLANK"), followed by EF BF BF (repeated 4 times).

Screenshot showing hexeditor output highlighting invalid sequence EF BF BF, repeated 4 times.

The sequence EF BF BF is the UTF-8 encoding of the Unicode character 0xffff, which is not a valid XML character.

There is already a function to strip certain characters that are valid UTF-8 but invalid XML: ApacheSolrDocument::stripCtrlChars(). This could probably be refactored to handle this invalid character too.

I've experienced this issue before (with the same control character), and previously I've just edited the content to remove the invalid characters. They're certainly not an intentional part of the content (how they got entered is another question - I'm guessing some sort of combination of copy-paste and a rich-text editor), but it will be better if they're stripped automatically.

manarth’s picture

I've hacked ApacheSolrDocument::stripCtrlChars() a little, and this seems to resolve it, but it could probably be cleaned up a little, and I'm not sure whether that 3-byte sequence could end up inadvertently matching a valid sequence.

I added this code:

    $string = preg_replace('@([\xef\xbf\xbf])@', ' ', $string);

To make:

  /**
   * Replace control (non-printable) characters from string that are invalid to Solr's XML parser with a space.
   *
   * @param string $string
   * @return string
   */
  public static function stripCtrlChars($string) {
    // See:  http://w3.org/International/questions/qa-forms-utf-8.html
    // Printable utf-8 does not include any of these chars below x7F
    $string = preg_replace('@([\xef\xbf\xbf])@', ' ', $string);
    return preg_replace('@[\x00-\x08\x0B\x0C\x0E-\x1F]@', ' ', $string);
  }
manarth’s picture

Version: 7.x-1.8 » 7.x-1.x-dev
Status: Needs work » Needs review
FileSize
647 bytes
varshith’s picture

I had a similar issue. But the above didnt seem to work in my case
On looking into it, the invalid character in my case was
@([\xef][\xbf][\xbe]) while the one in the patch is @([\xef][\xbf][\xbf])
Some info about the char here (U+FFFE)

sandra.ramil’s picture

The patch in comment #7 works great for me!

Nikolay Shapovalov’s picture

Status: Needs review » Needs work

Patch #7 can be applied, and it should work, but as I can see this is not supports all cases.
It just remove U+FFFF (0xEF 0xBF 0xBF (efbfbf)) character, but as mentioned at #8 character U+FFFE (0xEF 0xBF 0xBE (efbfbe)) will not be excluded.

As you can see here there is even more characters that should not be in XML
https://www.w3.org/TR/xml/#charsets
Supported characters are

Char	   ::=   	#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]	/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

One of solution that can help https://stackoverflow.com/a/66250135

$str = preg_replace(
    '/[\x00-\x08\x0B\x0C\x0E-\x1F]|\xED[\xA0-\xBF].|\xEF\xBF[\xBE\xBF]/',
    "\xEF\xBF\xBD",
    $str
);

This doesn't use the u Unicode regex modifier but works directly on UTF-8 encoded bytes for extra performance. The parts of the pattern are:

Invalid control chars: [\x00-\x08\x0B\x0C\x0E-\x1F]
UTF-16 surrogates: \xED[\xA0-\xBF].
Non-characters U+FFFE and U+FFFF: \xEF\xBF[\xBE\xBF]

But I still not sure if this code is 100% valid. I want to find solution which also covered with tests.

Nikolay Shapovalov’s picture

I find a solution, for search_api_solr, didn't test it with apachesolr module, but it should work.

Nikolay Shapovalov’s picture

Status: Needs work » Needs review
Nikolay Shapovalov’s picture

I already made some changes to patch #11. I will not reupload patch, until changes will be commited in search_api_solr.
You can find latest version of patch in this issue https://www.drupal.org/project/search_api_solr/issues/3256878