I have html tidy installed and enabled via php 5.1.x. I tested the install by:

<?php
    $tidy = tidy_parse_string("<B>Hello</I> How are <U> you?</B>");
    tidy_clean_repair($tidy);
    echo $tidy;
?>

And am able to get the html fixed in the results. My php install doesn't have any values for open_basedir so I'm not sure if that is the problem.

I'm also not sure what to set the "Path to HTMLTidy executable" - the default value of /usr/bin/tidy seems to throw errors:

* warning: exec() has been disabled for security reasons in /nfs/disk/data/www/virtual/webdev.ous.edu/cws/modules/import_html/coders_php_library/install-htmltidy.inc on line 87.
* HTMLTidy executable is not available. Found 'tidy' binary, but it didn't run right. /usr/bin/tidy -v failed to respond correctly

Any ideas?

Comments

harriska2’s picture

Status: Active » Closed (fixed)

Never mind, this particular site does not allow executables. We will try to use it from a command line mode. Or try to setup one site for this.

Another site had the open_basedir restriction. What a pain. I ended up putting the tidy compiled file under one of my allowable directories and pointing import_html at it. I then had to put all files under the one directory without subdirectories (how nice). But it did work like a charm.

shiva7663’s picture

Am I correct in assuming that if you can convince your hosting admin to install the PHP Tidy extension, the module won't use exec() at all, thus successfully bypassing the issue?

dman’s picture

Um, yes. The php5 tidy extension is used by choice, the commandline version is a fallback.
If you can turn on the extension in your php.ini, it SHOULD be detected and used.

... because different server setups may or may not support different things. I cannot predict all possible configurations, but so far I think I've covered about 80% of cases. Not tried open_basedir problems, but that should be avoidable if you upload your source tree inside your files dir.
My use-cases are to build on a local test box (where I have control) before mirroring the result up to a commercial host.

shiva7663’s picture

I'm trying to use the PHP4 Tidy extension, and it's still falling back to try to use a Tidy executable (which fails because exec() is disallowed). Does this module only support the PHP5 Tidy extension?

dman’s picture

Status: Closed (fixed) » Postponed (maintainer needs more info)

I don't recall seeing a working version of a tidy extension in the official binary distributions of PHP4 I tried (Win32 & Ubuntu, Debian).
However, if you have got a compiled version, there's a chance that it used a different syntax which I've not been able to read about. ... Oh yeah, there was a version partially distributed under PECL... I never got that to work on my machine...

In tidy_functions I test

  if ( extension_loaded('tidy' ) && function_exists('tidy') ) {
    debug('Using tidy Extension', 3);

... If that is failing ... then the fallback is tried.

Looking at http://nz.php.net/tidy I think that means the php4 tidy was function based (Although I can't see the specific docs) where the php5 is OO.
The invocation may be just as different as the XML rewrite between php4-5. Bloody boring. Totally different code could be needed. I'm not very interested in continuing to support deprecated PHP4, but you may find a quick work-around.

Can you confirm extension_loaded('tidy' ) and that new tidy works for you?

shiva7663’s picture

Yep, looks like tidy for php4 is significantly different than for php5. Here's my test code:

<?php

ob_start();

echo '<pre>';
print_r(get_loaded_extensions());
print_r(get_extension_funcs('tidy'));
echo '</pre>';

$out1 = ob_get_contents();

ob_end_clean();

$html = '<HTML><HEAD></HEAD><BODY>' . $out1 . '</BODY></HTML>';

$config = array('indent'=> TRUE,
                'output-xhtml' => TRUE,
                'wrap' => 80);

tidy_set_encoding('UTF8');

foreach ($config as $key => $value) {
   tidy_setopt($key,$value);
}

tidy_parse_string($html);
tidy_clean_repair();
echo tidy_get_output();

?>

which gave me usable xhtml:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
<pre>
Array
(
    [0] =&gt; standard
    [1] =&gt; openssl
    [2] =&gt; apache
    [3] =&gt; bz2
    [4] =&gt; ctype
    [5] =&gt; curl
    [6] =&gt; ftp
    [7] =&gt; gd
    [8] =&gt; mcrypt
    [9] =&gt; mhash
    [10] =&gt; mysql
    [11] =&gt; overload
    [12] =&gt; pcre
    [13] =&gt; pdf
    [14] =&gt; posix
    [15] =&gt; session
    [16] =&gt; tokenizer
    [17] =&gt; xml
    [18] =&gt; xslt
    [19] =&gt; zlib
    [20] =&gt; tidy
)
Array
(
    [0] =&gt; tidy_setopt
    [1] =&gt; tidy_getopt
    [2] =&gt; tidy_parse_string
    [3] =&gt; tidy_parse_file
    [4] =&gt; tidy_get_output
    [5] =&gt; tidy_get_error_buffer
    [6] =&gt; tidy_clean_repair
    [7] =&gt; tidy_repair_string
    [8] =&gt; tidy_repair_file
    [9] =&gt; tidy_diagnose
    [10] =&gt; tidy_get_release
    [11] =&gt; tidy_get_config
    [12] =&gt; tidy_get_status
    [13] =&gt; tidy_get_html_ver
    [14] =&gt; tidy_is_xhtml
    [15] =&gt; tidy_is_xml
    [16] =&gt; tidy_error_count
    [17] =&gt; tidy_warning_count
    [18] =&gt; tidy_access_count
    [19] =&gt; tidy_config_count
    [20] =&gt; tidy_load_config
    [21] =&gt; tidy_load_config_enc
    [22] =&gt; tidy_set_encoding
    [23] =&gt; tidy_save_config
)

</pre>
</body>
</html>

So I suppose this means that I'm tough out of luck with this module until my ISP is forced to upgrade to PHP5 sometime in 2008 when official support runs dry (not sure they'll do it even then, I just don't know). Maybe by then I'll be experienced enough to backfill php4 tidy support myself. heh.

dman’s picture

Good debugging!

Well, the meat of the functions needed are all in the first half of tidy-functions.inc:xml_tidy_string() func, and you can see it looks almost identical. Shouldn't be too hard at all to test for a version and try your own way there.

Or if you are more comfortable, you can hack up a way to run tidy some other way on all your files before attempting the import. For paranoia reasons, and also so that we can import ANY XML dialect (like RDF or RecipeML) import_html only tries to run tidy if XML parsing fails at the first attempt. If you can feed it pure, already-valid code from source, tidy is never invoked.

The code is a bit redundant (it runs the parser in a try-catch block) and then tidies and parses again - but it's done that way so you can skip the tidy dependancy altogether ... If you have good input.

Just make sure that it's output-xml not html and ... well, see the configs I use.

    $config = array(
       'indent'           => true,
       'output-xml'       => true,
       'numeric-entities' => true,
       'add-xml-decl'     => false,
       'doctype'          => 'omit',
       'char-encoding'    => 'utf8',
       'wrap'             => 200,
       'repeated-attributes' => 'keep-last',
    );
lejon’s picture

If you're having trouble getting html tidy to work because your webhost prevents execution files, then see this post:

http://drupal.org/node/181652

UPDATE: This is currently not a solution, just another way into the problem...

dman’s picture

Status: Postponed (maintainer needs more info) » Closed (fixed)

Cleaning up issue queue by closing stuff from the Drupal-5 branch and over a year old.