I have been processing web pages from RSS sources without problems.
Usually, those pages have set the charset in a meta tag like this:
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
But in others, there is not meta tag showing the charset, and charset is defined in the HTML header:
I downloaded an article from one of those sites to check the problem.
Then I moved it to localhost and I imported it using the XPath HTML parser.
The word Oñati is not shown correctly:
Then I modified the HTML and I put the meta tag with the charset, after import the word was shown correctly:
Does Drupal ignore the HTTP header charset? Internet browsers don't.
Thank you all.
Comment | File | Size | Author |
---|---|---|---|
#15 | extensible_parser_lang.png | 10.51 KB | Gotz84 |
#15 | previous_parser_lang.png | 10.39 KB | Gotz84 |
#13 | feed_modules_list.png | 34.64 KB | Gotz84 |
#12 | 2832484-html-malformed.png | 140.66 KB | MegaChriz |
#12 | 2832484-preview.png | 77.35 KB | MegaChriz |
Comments
Comment #2
MegaChriz CreditAttribution: MegaChriz as a volunteer commentedParsers by default expect the source to be UTF-8. If the charset is anything else, it will probably not be correctly parsed. There is already an issue open about this: #1220606: Add support for encoding conversions for any parser, so I'm closing this issue as a duplicate of that one.
I think the reason why setting the metatag helps here is that the XPath HTML parser uses Tidy to clean up the HTML.
Comment #3
Gotz84 CreditAttribution: Gotz84 commentedBut as I said in my question, the HTML is supposedly in UTF-8.
I got in touch with the webmaster of the site and he told me they send the charset in the headers not in the page as a metatag.
So, does Drupal ignore the header data? (the first image of my question).
Thank you @MegaChriz
Comment #4
MegaChriz CreditAttribution: MegaChriz as a volunteer commentedOkay, that sounds like a different issue than #1220606: Add support for encoding conversions for any parser, so reopening this one.
I've glanced through the code of both
http_request_get()
anddrupal_http_request()
. The only specific thing I saw about encoding is the following line inhttp_request_get()
:According to http://php.net/manual/en/function.curl-setopt.php, this should mean that it accepts any encoding. I don't know enough about cURL to know if it tries to convert encoding.
You say you downloaded the file and moved it to localhost before importing it. Could it be that the character encoding is converted by your editor or your localhost server? Maybe it helps if you specify the encoding explicitly in your editor? Some editors allow you to set the encoding of a file.
I'm just guessing here what the cause of the issue could be. Content in UTF-8 should get imported just fine.
Comment #5
MegaChriz CreditAttribution: MegaChriz as a volunteer commentedTo come back to your question: I don't know if Drupal ignores the header in which the character set is specified. But I didn't find any specific in the code that checks that header. It may be that cURL or PHP does something with that header, but I don't know anything about that.
Comment #6
Gotz84 CreditAttribution: Gotz84 commentedHello @MegaChriz
Sorry for the delay, I have been outside for a week.
I downloaded the example news article using wget, and I moved to localhost. Downloading with curl does the same.
Using file -bi command in the terminal says that the files are in utf-8
text/html; charset=utf-8
Then, if I open the article in localhost with firefox, as the server doesn't send any coding info in the header
Content-Type text/html
and there are no metatags that says that is utf-8, the article is showed with bad coded letters.
While in the original source, the header showed in firefox has the coding info
Content-Type text/html; charset=utf-8
.The example article is here
Thank you
Comment #7
MegaChriz CreditAttribution: MegaChriz as a volunteer commentedI have been able to reproduce the issue! Well, at first I didn't, but after fiddling somewhat with the parser settings, I finally got to the same result.
How I reproduced the issue
I could reproduce the issue using the following steps:
//body
//h1
//span[contains(concat(' ', @class, ' '), ' herria ')]
How to fix
To fix the issue, set the 'Tidy encoding' option to 'UTF8'. See also attached image.
Consider to use the Feeds extensible parsers module in the future, which is the successor of the Feeds XPath Parser module. It also has a nicer UI.
Feel free to reopen this issue if you still encounter the same problem after having set the tidy encoding option to 'UTF8'.
Comment #8
Gotz84 CreditAttribution: Gotz84 commentedHello @MegaChriz,
In my case, it doesn't fix the problem.
Is title correctly coded on the debug query when you import the feed?
I made a simple test only importing this article to and It didn't change.
I checked tidy, set UTF8 like in your picture (this is the default value).
I checked to debug title and it showed as:
Herriko bi puntutan trafiko neurketak egin ditu Oñatiko Udalak
I don't know why this fixes the problem for you, but not for me.
Comment #9
Gotz84 CreditAttribution: Gotz84 commentedComment #10
MegaChriz CreditAttribution: MegaChriz as a volunteer commentedCan you provide an export of your importer? Maybe there is an other configuration error in there. Or maybe it is a bug with the Xpath XML parser.
Comment #11
Gotz84 CreditAttribution: Gotz84 commentedI made this simple test getting only the title
Comment #12
MegaChriz CreditAttribution: MegaChriz commentedThanks for the importer export. Alas, I'm still not able to reproduce your issue. Not even on a clean install. See attached image:
When searching for similar issues on the web, I came across this (the XPath parser uses the DOMXPath class for executing xpath queries):
http://stackoverflow.com/questions/8993747/php-domxpath-encoding
So that sounds like there are differences in our environments which causes the issue in your case to appear and in my case not. You could try the XPath HTML parser from Feeds extensible parsers to see if that improves anything.
The HTML page in question is a bit malformed, however:
Anyway, seems more like Feeds Xpath parser issue, so moving this issue to that queue.
Comment #13
Gotz84 CreditAttribution: Gotz84 commentedHello @MegaChriz,
I tried to use Feeds Extensible Parser without luck.
I selected HTML Xpath parser as a parser, but then in mappings xpathparser is missing from source list.
First I made a backup of the database, then I tried several things:
- Clear caches
- Delete all previous feed importers
- deactivate Feeds XPath Parser
- deactivate all Feeds related modules and then reactivate
...
But still is missing.
Those are the modules we use and their version (maybe there are conflicts)
Comment #14
MegaChriz CreditAttribution: MegaChriz commentedWhen going from the Feeds XPath Parser module to the Feeds Extensible Parsers module, you have to redefine your mappings, as the sources will be named differently. Instead of xpathparser:0, xpathparser:1, etc. you can define the source names yourself on the parser settings (where you configure the xpaths).
Comment #15
Gotz84 CreditAttribution: Gotz84 commentedThank you @MegaChriz,
I thought the process was the same: first choose a bundle type, then mapping and at last configure xpath.
Now I see that I have to configure the xpath first and at last make the mapping.
I tried to make a test but I'm having problems trying to get the lang attribute of html with the HTML Xpath parser.
With the previous one I have no problems
But now it's empty
The xpath value is ancestor::html/@lang but I tried setting the context on html and then only @lang etc without success.
Am I doing something wrong or is it a bug?
(btw, I am making test in another site with the same problem, because now in Goiena they set the meta tag with charset attribute)
Comment #16
MegaChriz CreditAttribution: MegaChriz commentedI'm not sure how to select an attribute on the root element either. Maybe this is a bug. If I save the DOMNode object that is passed to
FeedsExXpathDomXpath::evaluate()
using$document->saveXML($context_node);
(sorry for the technical terms), then the html tag no longer has attributes.Comment #17
Gotz84 CreditAttribution: Gotz84 commentedOk, so I'll have to wait then to a solution of one of both problems!
Thank you for all @MegaChriz
Comment #18
Gotz84 CreditAttribution: Gotz84 commented@MegaChriz,
Do I have to open a new issue about the bug of html tag attributes tag in Feeds Extensible Parser or this issue can be used for both problems?
Thank you
Comment #19
MegaChriz CreditAttribution: MegaChriz commented@Gotz84
I think it should be a new issue. This issue should focus on fixing the encoding problem. For this problem, I do think that http://stackoverflow.com/questions/8993747/php-domxpath-encoding could be the same issue and therefore may help to come to a solution.
Perhaps this issue could be fixed with a change in Feeds (namely by creating a solution for #1220606: Add support for encoding conversions for any parser), but I'm not sure about that yet.
Comment #20
Gotz84 CreditAttribution: Gotz84 commentedHello @Megachriz,
would you like to check if you can get the title of this article correctly?
I think that the document is in utf-8,however the charset is set to "Windows-1252".
Although it is set to Windows-1252, the browsers shows the text well coded.
I tried with "HTML Xpath parser" and "XPath HTML parser". Both with/without tidy, source encoding utf-8, windows-1252...
I have no luck.
The title I got always is
La 37ª edición de la Regata Ingenieros Deusto ya tiene su cartel anunciador
Instead of
La 37ª edición de la Regata Ingenieros Deusto ya tiene su cartel anunciador
Can I do something? I wrote them to ask if they had had a mistake setting the charset to "Windows-1252".
Thank you
I forgot the link to the article, sorry:
http://www.deusto.es/cs/Satellite/deusto/es/universidad-deusto/vive-deus...(Universidad+de+Deusto-Notas+de+Prensa+de+Deusto)
Comment #21
leolandotan CreditAttribution: leolandotan as a volunteer and at Promet Source commentedHi guys,
I'm also having this issue specifically like @Gotz84 is experiencing. I'm fetching from a Spanish RSS feed.
So far I got no solution on this yet.