PDF link with #page= results in "URL fragment identifier not found in content" [#2088461]

It appears that you can link to PDF documents, if the browser is using Acrobat Reader plugin, with a #page=nn URL fragment, which causes the PDF to open at the specified page (see http://helpx.adobe.com/acrobat/kb/link-html-pdf-page-acrobat.html, for example).

However this breaks linkchecker, which, although the link works in the browser, returns a 404 "URL fragment identifier not found in content" error.

wget --spider http://unfccc.int/resource/docs/2011/cop17/eng/09a02.pdf#page=16 says "200 OK" and "Remote file exists." but linkchecker says "404" and "URL fragment identifier not found in content".

This is technically correct, if we assume the URL points to an HTML document, but in this case it's a PDF and the fragment page=16 will never appear in the content. So the additional checking, and overriding of the 200 response with a 404 by linkchecker (added by #1875602: Check URL fragment identifiers in content), isn't wanted here.

Perhaps the checking added by #1875602: Check URL fragment identifiers in content should only be used if the returned document is of type text/html?

Comment	File	Size	Author
#4	Issue-2088461-by-fonant-hass-PDF-link-with-page-resu.patch	1.35 KB	hass
#4
#1	linkchecker-2088461.patch	1.45 KB	fonant
#1

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

fonant CreditAttribution: fonant commented 13 September 2013 at 14:45

File	Size
linkchecker-2088461.patch	1.45 KB

A quick fix is to add a content-type check on the $response and check that we have a suitable text response (as requested in the Accept request header) before we check for the fragment being present.

Simple patch attached.

Comment #2

hass CreditAttribution: hass commented 13 September 2013 at 21:20

Version:	7.x-1.1	» 7.x-1.x-dev
Status:	Active	» Needs work

Please provide a git patch.

Comment #3

hass CreditAttribution: hass commented 13 October 2013 at 11:17

I'm wondering why this Accept has no effect:

    // URL contains a fragment.
    if (in_array($link->method, array('HEAD', 'GET')) && !empty($uri['fragment'])) {
      // We need the full content and not only the HEAD.
      $link->method = 'GET';
      // Request text content only (like Firefox/Chrome).
      $headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
    }

Comment #4

hass CreditAttribution: hass commented 13 October 2013 at 11:21

Status:

Needs work

» Needs review

File	Size
Issue-2088461-by-fonant-hass-PDF-link-with-page-resu.patch	1.35 KB

Attaching Git patch

Comment #5

hass CreditAttribution: hass commented 13 October 2013 at 11:35

Status:

Needs review

» Fixed

http://drupalcode.org/project/linkchecker.git/commit/82c810a

Comment #6

hass CreditAttribution: hass commented 13 October 2013 at 11:35

Version:	7.x-1.x-dev	» 6.x-2.x-dev
Status:	Fixed	» Patch (to be ported)

Comment #7

hass CreditAttribution: hass commented 13 October 2013 at 11:37

Status:

Patch (to be ported)

» Fixed

http://drupalcode.org/project/linkchecker.git/commit/0eaf68b

Comment #8

fonant CreditAttribution: fonant commented 17 October 2013 at 10:35

That Accept does have an effect, but it says:

"Prefer text/html or application/xhtml+xml resources (with a preference of 100%), if not then application/xml resources (with a preference of 90%), but if they aren't available I'm happy with any resource type that matches the request URL (with a preference of 80%).

It's a little easier to read if you insert spaces, as commas separate the options. The semicolons have lower precedence, and are used to specify the relative weighting of each option: 'text/html, application/xhtml+xml, application/xml;q=0.9, */*;q=0.8'

This is good, because we _do_ want to be able to test for the existence of non-text resources, such as images and PDF files.

The problem is merely in assuming that the response is HTML if there is a "#" in the URL, which is mostly the case, but not always: Acrobat allows the use of # to indicate a page number in a PDF.

Comment #9

hass CreditAttribution: hass commented 17 October 2013 at 13:54

Yeah, I know, but I expected that the server may throw an error if I request foo and he can only deliver bar... however never tested it :-). This hopefully works now.