I've got a fairly long HTML document with a long list of internal anchors. I *believe* what's happenning is that after print_pdf_generate_path converts the HREF elements to absolute links (by calling print.pages.inc's _print_rewrite_urls), it tries to restore anchor links to relative links again using a preg_replace call that is somehow failing.

I'm using the wkhtmltopdf-amd64 static binary, version 0.11.0 rc1 with the following options: -d 300 -s Letter --page-size Letter --outline --enable-internal-links --footer-font-size 7 --footer-right '[page]'. If I try to generate the page by calling the binary from the command line, my internal anchors are respected.

Here's the page I'm trying to render as PDF with internal links intact: http://www.oasis-open.org/policies-guidelines/tc-process

Thanks,

Jose

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

jcnventura’s picture

No, what's happening is that it's successfully being able to retain the ability to navigate the links in the print version (as you can see in the print link at the bottom of the page you indicated)..

However, this is clearly not working correctly for the PDF case (using at least wkhtmltopdf). The annoying part (for me) is that everything is based on the ouput of the print version, so I'll have to see where that's being done and disable it in this case.

hackwater’s picture

I'm focusing pretty tightly on print_pdf.pages.inc and the print_pdf_generate_path function therein. Rather than disabling the calls to _print_rewrite_urls, I think I have a solution. It works for this particular PDF, anyway; let's see if we can validate it for a more general case.

I noted a couple of things after adding a debugging function to the code (dumping the contents of $html to a file so I could do visual diffs against two points: after we convert the anchor elements, and after we attempt to make internal anchors relative again:

  // Convert the a href elements, to make sure no relative links remain
  $pattern = '!<(a\s[^>]*?)>!is';
  $html = preg_replace_callback($pattern, '_print_rewrite_urls', $html);

  // And make anchor links relative again, to permit in-PDF navigation
  $html = preg_replace("!${base_url}/". PRINTPDF_PATH ."/.*?#%2523!", '#', $html);

After dumping the $html into a file after each preg_replace, I found that the second preg_replace was not doing anything, at least in the case of my test HTML file: the two files were binary equal. This means that there isn't a double-encoded "#" in the initial $html; changing the %2523 to a # fixes this conversion. But it still wasn't working. More interestingly, opening the HTML output in a browser and clicking on the internal links ALSO wasn't working properly. I traced this to the <base> tag the Print conversion adds to the file; at least in the case of wkhtmltopdf, which respects the base tag, getting rid of the base tag solves the internal anchor problem:

  // And make anchor links relative again, to permit in-PDF navigation
  $html = preg_replace("!${base_url}/". PRINTPDF_PATH ."/.*?#!", '#', $html);
  $html = preg_replace("!<base[^>]+>!i", '', $html);

Patches rolled against 7.x-1.x and 6.x-1.x

hackwater’s picture

Status: Active » Needs review
jcnventura’s picture

Status: Needs review » Fixed

The base tag is absolutely necessary because of images and other media resources.

However, changing the %2534 does indeed make it work on wktmltopdf, but I don't think it's possible to make it work properly in tcpdf or dompdf.

I've committed the patch to git.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.