Hi,

I noticed that some characters are urlencoded when pathologic is enabled

for example:
cart/add/p30_q1?destination=cart

are transform to this:
cart/add/p30_q1%3Fdestination%3Dcart

which breaks the url redirection...

If I disable pathologic the url is not encoded and everything works as usual...any ideas?

Regards,
Sebastian

Comments

Garrett Albright’s picture

Title: url encoded characters in links » Query fragment being encoded

Hi… Just want to acknowledge that I've seen your message, but I can't quite work on it just yet. Thanks for trying Pathologic, though.

cocoliso’s picture

Hey Garret thank for the reply and for making this module.
No rush :-)

Garrett Albright’s picture

Hi there. Still trying to use Pathologic and having this problem?

When the next dev release (*not* the beta release… the dev releases on the project page have a red background) is available, please download it and give it a try. It will hopefully fix this problem.

jrosen’s picture

Hi Garrett, I am having this issue with the following URL:

/civicrm/profile/edit&gid=1&reset=1

becoming:
/civicrm/profile/edit%2526amp%3Bgid%3D1%2526amp%3Breset%3D1

I downloaded the latest dev release, but it doesn't seem to fix it. Can you post the new dev release you referred to or at least post a patch or which lines of code to change to fix this.

Thanks.

Garrett Albright’s picture

Yeargh, okay. That's not how I was expecting query strings to look, but I guess it's legitimate. Thanks for noting it.

Garrett Albright’s picture

All right, I've committed what may be a fix. The packaging robot should package together a new dev release soon; please give it a try.

Garrett Albright’s picture

Status: Active » Fixed

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

meecect’s picture

Status: Closed (fixed) » Active

I'm not sure this is fixed. I am using drupal 6, wysiwyg, tinymce, civicrm and the dev branch of pathologic.

Please read this comment fully. It is long, but the research is solid.

I have a link like this:

http://ilraiseyourhand.org/civicrm/contribute/transact/?reset=1&id=4

This link is inserted via tinyMCE (via wysiwyg). Pathologic seems to mangle the link into this:

http://ilraiseyourhand.org/civicrm/contribute/transact/?reset=1&amp%3Bid=4

After a judicious use of the dpm() function, I have tracked the problem down to this section of code, near line 120:

  if (isset($parts['query'])) {
    parse_str($parts['query'], $qparts);
    if (isset($qparts['q'])) {
      $parts['path'] = $qparts['q'];
      unset($qparts['q']);
    }
  }
  else {
    $qparts = NULL;
  }

the problem arises with the parse_str function. This function yields this in $qparts:

... (Array, 2 elements)
reset (String, 1 characters ) 1
amp;id (String, 1 characters ) 4

You can see that it does not handle an encoded query parameter string; it merely splits on the '&'. This is actually mentioned in a comment on the php manual for parse_str:

http://php.net/manual/en/function.parse-str.php :

parse_str() is confused by ampersands (&) being encoded as HTML entities (&). This is relevant if you're extracting your query string from an HTML page (scraping). The solution is to run the string through html_entity_decode() before running it through parse_str().

(Editors: my original comment was a caution whose solution is obvious, but it has resulted in three replies ("so what?" "as intended" and "this is how to fix it"). Please remove the previous four posts dealing with this (69529, 70234, 72745, 74818) and leave just the above summary. This issue is too trivial to warrant the number of comments it has received.)

Now, some think that the URL string should not have encoded entities in it. Perusing this thread at the tinyMCE forums shows that the tinyMCE developers disagree:

http://www.tinymce.com/forum/viewtopic.php?pid=34821 :

What I (and others) are saying is that that assumption is incorrect. It is a perfectly valid for the HREF attribute within the HTML itself. In fact, if you go to validate your page, you'll actually get warning if your ampersands in HREF attributes are NOT properly converted to HTML entities.

I did not take their word for it though, since it seems odd to encode a URL like that, so I went to the specs.

http://www.w3.org/TR/xhtml1/guidelines.html#C_12 :

In both SGML and XML, the ampersand character ("&") declares the beginning of an entity reference (e.g., ® for the registered trademark symbol "®"). Unfortunately, many HTML user agents have silently ignored incorrect usage of the ampersand character in HTML documents - treating ampersands that do not look like entity references as literal ampersands. XML-based user agents will not tolerate this incorrect usage, and any document that uses an ampersand incorrectly will not be "valid", and consequently will not conform to this specification. In order to ensure that documents are compatible with historical HTML user agents and XML-based user agents, ampersands used in a document that are to be treated as literal characters must be expressed themselves as an entity reference (e.g. "&"). For example, when the href attribute of the a element refers to a CGI script that takes parameters, it must be expressed as http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user rather than as http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user.

The recommended solution is to pass the query string through html_entity_decode() first. On a similar issue in the queue (http://drupal.org/node/1303782), a similar patch was contributed, but at a different place in the code. My solution is to do this:

  if (isset($parts['query'])) {
    parse_str(html_entity_decode($parts['query']), $qparts);
    if (isset($qparts['q'])) {
      $parts['path'] = $qparts['q'];
      unset($qparts['q']);
    }
  }
  else {
    $qparts = NULL;
  }

, as recommended in several forums as a fix for parse_str.

While this works, I'm not sure it is perfect, as it appears (but I'm not sure) that it has the side-effect of essentially undoing the encoding that tinyMCE had done in the first place. I would need to investigate what url() does in more detail to see if it puts the encoding back.

It's a one line change, but I can roll a patch for it if necessary.

Garrett Albright’s picture

Issue summary: View changes
Status: Active » Closed (won't fix)

Unfortunately, I'm no longer maintaining the 6.x branch of Pathologic, so any bugs in there are going to stay in there. Sorry.