I have been trying to get the Search API / SOLR to use TIKA for attachments.
I am finding the following error in my watchdog logs:
Notice: iconv(): Detected an illegal character in input string in SearchApiAttachmentsAlterSettings->extract_simple() (line 146 of /ssd/www/drupal/sites/all/modules/search_api_attachments/includes/callback_attachments_settings.inc).
I am running the latest of:
Drupal 7.26
Search API
Search API SOLR
Search API Attachments with Fields module
TIKA 1.5
I have a FILE field (multiple) on a bundle.
I have tried both the remote and local implementation of Search API / SOLR / TIKA.
If I create a VERY SMALL 3-line text file, it works, however if I put an PDF or some other "document" it fails.
Hopefully someone else has already solve this problem. Thank you,
Respectfully,
Patrick O'Leary.
Comments
Comment #1
izus commentedComment #2
izus commentedhi,
tested it after issue mentionned in #1 was merged but can't reproduce it.
Please feel free to reopen it if there are more details on how to reproduce it with last code base
Comment #3
spadxiii commentedI just ran into this issue myself as well. If I upload a txt file which is not encoded in UTF-8, the notice is thrown.
This is because the method 'extract_simple' tries to convert the txt file from UTF-8 to UTF8//IGNORE. But the file is not UTF-8 to start with. In my case it was ISO-8859-14.
Comment #4
izus commentedso to handle different types we probably need to detect file encoding and decide what to do with it
can you please provide a patch if you fixed this locally or can we discuss another solution for this ?
thanks
Comment #5
spadxiii commentedI haven't fixed it locally yet. The uploaded files were too large to parse anyway. :)
The little bit of code I put in my previous comment was how I temporarily 'fixed' the error. Changing the encoding of the text twice doesn't seem like a good idea though. It might be better to check the encoding and only switch once.
Comment #6
grimreaperHello,
Could you upload a lightweight version of your encoded file please?
I have used the following command to build a file encoded as your one and I didn't get the notice.
iconv -f UTF-8 -t ISO-8859-14 README.txt > README_ISO.txtComment #7
izus commentedhi,
i couldn't reproduce the issue here.
Also what was said in #3 is not correct as we are not assuming original string to be UTF8
is converting to UTF8
http://php.net/manual/fr/function.mb-convert-encoding.php
closing the issue for the moment as nobody could reproduce it, and no one uploaded a test file that is causing the issue.
Please feel free to reopen it if the issue subsists for you. We will try to look at it again.
Comment #8
spadxiii commented@izus : the method extract_simple does not use the mb_convert_encoding-call.
I'll have to dig some to get the file that was throwing errors here, but when I find it, I'll re-open this issue if it's still not working then.