It appears that you can link to PDF documents, if the browser is using Acrobat Reader plugin, with a #page=nn URL fragment, which causes the PDF to open at the specified page (see http://helpx.adobe.com/acrobat/kb/link-html-pdf-page-acrobat.html, for example).

However this breaks linkchecker, which, although the link works in the browser, returns a 404 "URL fragment identifier not found in content" error.

wget --spider http://unfccc.int/resource/docs/2011/cop17/eng/09a02.pdf#page=16 says "200 OK" and "Remote file exists." but linkchecker says "404" and "URL fragment identifier not found in content".

This is technically correct, if we assume the URL points to an HTML document, but in this case it's a PDF and the fragment page=16 will never appear in the content. So the additional checking, and overriding of the 200 response with a 404 by linkchecker (added by #1875602: Check URL fragment identifiers in content), isn't wanted here.

Perhaps the checking added by #1875602: Check URL fragment identifiers in content should only be used if the returned document is of type text/html?

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

fonant’s picture

FileSize
1.45 KB

A quick fix is to add a content-type check on the $response and check that we have a suitable text response (as requested in the Accept request header) before we check for the fragment being present.

Simple patch attached.

hass’s picture

Version: 7.x-1.1 » 7.x-1.x-dev
Status: Active » Needs work

Please provide a git patch.

hass’s picture

I'm wondering why this Accept has no effect:

    // URL contains a fragment.
    if (in_array($link->method, array('HEAD', 'GET')) && !empty($uri['fragment'])) {
      // We need the full content and not only the HEAD.
      $link->method = 'GET';
      // Request text content only (like Firefox/Chrome).
      $headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
    }
hass’s picture

Status: Needs work » Needs review
FileSize
1.35 KB

Attaching Git patch

hass’s picture

hass’s picture

Version: 7.x-1.x-dev » 6.x-2.x-dev
Status: Fixed » Patch (to be ported)
hass’s picture

Status: Patch (to be ported) » Fixed
fonant’s picture

That Accept does have an effect, but it says:

"Prefer text/html or application/xhtml+xml resources (with a preference of 100%), if not then application/xml resources (with a preference of 90%), but if they aren't available I'm happy with any resource type that matches the request URL (with a preference of 80%).

It's a little easier to read if you insert spaces, as commas separate the options. The semicolons have lower precedence, and are used to specify the relative weighting of each option: 'text/html, application/xhtml+xml, application/xml;q=0.9, */*;q=0.8'

This is good, because we _do_ want to be able to test for the existence of non-text resources, such as images and PDF files.

The problem is merely in assuming that the response is HTML if there is a "#" in the URL, which is mostly the case, but not always: Acrobat allows the use of # to indicate a page number in a PDF.

hass’s picture

Yeah, I know, but I expected that the server may throw an error if I request foo and he can only deliver bar... however never tested it :-). This hopefully works now.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.