Hi,
I am trying - once again - to integrate HTMLTidy in my Drupal 5.
I tried different approaches - and the last one seemed simple enough (even though not the best one for what regards performance).
In my theme folder for the web site, in node.tpl.php, I have used this:
if (function_exists("tidy_repair_string"))
{
$xhtml = tidy_repair_string(trim($content), $GLOBALS["conf"]["tidy_config"]);
if(!empty($xhtml)) {
// $content = $xhtml . "<!-- (X)HTML sanitized -->";
$content = $xhtml;
}
}
print $content;
and in my settings.php file I have the following declaration:
$conf["tidy_config"] = array(
"alt-text" => "",
"break-before-br" => false,
"drop-proprietary-attributes" => true,
"indent-spaces" => "2",
"hide-endtags" => true,
"indent" => "auto",
"output-xhtml" => true,
"show-body-only" => true,
"tidy-mark" => false,
"wrap" => false,
"numeric-entities" => false,
"word-2000" => false,
"quote-nbsp" => true,
"input-encoding" => "raw",
"output-encoding" => "utf8"
);
The problem is - it works only in a basic way, as there are a lot of problems both with entities and with non-ascii chars.
If you have a look for example at this: http://baravalle.it/citazioni/Stefano%20Benni page you can see what I mean. It's full of squares instead of "—" signs, and just commenting the HTMLtidy call gives me back my "—" signs.
Any suggestions? I have been looking at it for a few hours without success.
Changing input-encoding to utf8 doesn't help - already tried that,
Andres
Comments
=-=
are you using the HTMLTidy.module ? http://drupal.org/project/htmltidy
_____________________________________________________________________
My posts & comments are usually dripping with sarcasm.
If you ask nicely I'll give you a towel : )
nope
I used the module some time ago, when I was still using some previous version of Drupal. I had updated and tweaked it the module (when I tried it was using a binary call to the tidy executable) - but wasn't working well enough (that's why I didn't submit back anything).
Today, after looking at some posts in the forums, I thought about giving it a second go, but in a simpler way, with the code I posted. And doesn't yet work...
Andres
Looking at your html output,
Looking at your html output, I notice that the em dash has been replaced by a pair of characters, hex 14-20. (20 is a space of course).
The Unicode hex for em dash is 20-14 (U+2014). Not sure if this helps.
Could it be a small-endian/big-endian issue? It shouldn't, since utf8 defines sequences of single bytes, but that's what it looks like.
Hex editor?
Hi,
what tool/proceduere did you use to see it? From firefox, I cannot get the hex code (or I don't know how to do it).
Thanks for the help,
Andres
I downloaded the html output
I downloaded the html output as text and opened it with a text editor which could show it in hex (Ultraedit). But you can easily find an open source hex editor or viewer with google (for example http://www.tech-faq.com/hex-editor.shtml).
Then I compared what I saw with this: http://www.fileformat.info/info/unicode/char/2014/index.htm
----- Update
Looking at that page again, I noticed that the em dash is hex 2014 only in UTF-16, which in a small-endian representation becomes 1420, as expected. But in UTF-8 it is different. It is
e2 80 94(you can see it by putting that sequence in the text with a hex editor). So, the output is not really UTF-8.Interesting
I followed the procedure you suggested - with some interesting results.
Another page in the site, for example (http://baravalle.it/ecommerceland) is full of �s. It's not a 1 to 1 match - different accented letters are replaced with �s.
�, in the hex editor, appear as ef bf bd, which I understand should be Unicode FFFD, a replacement character.
Which, I think, is saying that something went wrong...
But still not sure what, or why, and how to correct it.
Andres
Was none of these problems
Was none of these problems present without tidy?
nope...
If I comment the tidy lines, everything goes back to normal.
Andres
Maybe some php function
Maybe some php function unaware of Unicode?
Does the third argument ("utf8") like in this example make any difference?
http://php.net/manual/en/function.tidy-repair-string.php#66066
nope...
But I think I might have some idea.
If I test this code:
The first one, using entities, appears incorrect (both the è and the ). The second one appears correctly.
Looks like, for some reason, my entities are not translated correctly.
Andres
done!
Not sure if it's the most intelligent approach. Well, quite surely it isn't - but works.
I added html_entity_decode and now works.
not exactly sure why - I would have thought that tidy would have been able to deal with it,
Andres
HTMLTidy alternative htmLawed
You might be intersted in looking at the htmLawed module. It enables the use of the htmLawed filter, a simple stand-alone alternative to the HTMLTidy application.