I have been trying to get the Search API / SOLR to use TIKA for attachments.

I am finding the following error in my watchdog logs:

Notice: iconv(): Detected an illegal character in input string in SearchApiAttachmentsAlterSettings->extract_simple() (line 146 of /ssd/www/drupal/sites/all/modules/search_api_attachments/includes/callback_attachments_settings.inc).

I am running the latest of:
Drupal 7.26
Search API
Search API SOLR
Search API Attachments with Fields module
TIKA 1.5

I have a FILE field (multiple) on a bundle.

I have tried both the remote and local implementation of Search API / SOLR / TIKA.

If I create a VERY SMALL 3-line text file, it works, however if I put an PDF or some other "document" it fails.

Hopefully someone else has already solve this problem. Thank you,

Respectfully,

Patrick O'Leary.

Comments

izus’s picture

izus’s picture

Status: Active » Closed (cannot reproduce)

hi,
tested it after issue mentionned in #1 was merged but can't reproduce it.
Please feel free to reopen it if there are more details on how to reproduce it with last code base

spadxiii’s picture

Version: 7.x-1.3 » 7.x-1.6
Priority: Major » Normal
Status: Closed (cannot reproduce) » Active

I just ran into this issue myself as well. If I upload a txt file which is not encoded in UTF-8, the notice is thrown.

This is because the method 'extract_simple' tries to convert the txt file from UTF-8 to UTF8//IGNORE. But the file is not UTF-8 to start with. In my case it was ISO-8859-14.

protected function extract_simple($file) {
  // ...
  $text = mb_convert_encoding($text, "UTF-8");
  $text = iconv("UTF-8", "UTF-8//IGNORE", $text);
  //..
}
izus’s picture

so to handle different types we probably need to detect file encoding and decide what to do with it

mb_check_encoding

can you please provide a patch if you fixed this locally or can we discuss another solution for this ?

thanks

spadxiii’s picture

I haven't fixed it locally yet. The uploaded files were too large to parse anyway. :)

The little bit of code I put in my previous comment was how I temporarily 'fixed' the error. Changing the encoding of the text twice doesn't seem like a good idea though. It might be better to check the encoding and only switch once.

grimreaper’s picture

Status: Active » Postponed (maintainer needs more info)

Hello,

Could you upload a lightweight version of your encoded file please?

I have used the following command to build a file encoded as your one and I didn't get the notice.

iconv -f UTF-8 -t ISO-8859-14 README.txt > README_ISO.txt

izus’s picture

Status: Postponed (maintainer needs more info) » Closed (cannot reproduce)

hi,
i couldn't reproduce the issue here.
Also what was said in #3 is not correct as we are not assuming original string to be UTF8

  $text = mb_convert_encoding($text, "UTF-8");

is converting to UTF8

http://php.net/manual/fr/function.mb-convert-encoding.php

closing the issue for the moment as nobody could reproduce it, and no one uploaded a test file that is causing the issue.

Please feel free to reopen it if the issue subsists for you. We will try to look at it again.

spadxiii’s picture

@izus : the method extract_simple does not use the mb_convert_encoding-call.

protected function extract_simple($file) {
    $text = file_get_contents($this->get_realpath($file));
    $text = iconv("UTF-8", "UTF-8//IGNORE", $text);
// ..
}

I'll have to dig some to get the file that was throwing errors here, but when I find it, I'll re-open this issue if it's still not working then.