I don't know if this is a bug or a feature request. I think it is a bug. ;)

When I use path_auto module it creates url from the node title. If I use spec character in the url and I copy it to node body it will be ugly. Perhaps, it will be an url encoded string. But my browser shows it nice because the browser decodes that. I suggest filter module place decoded url in the caption variable and title property, but not in the href property.

If I paste the following:

http://drupal6/contact/Sales/ez%20egy%20pr%C3%B3ba%20%C3%A1rv%C3%ADzt%C5%B1r%C5%91%20t%C3%BCk%C3%B6rf%C3%BAr%C3%B3g%C3%A9p

I would like to see:

http://drupal6/contact/Sales/ez egy próba árvíztűrő tükörfúrógép

pp

Comments

pp’s picture

Status: Active » Needs review
StatusFileSize
new1.47 KB
gábor hojtsy’s picture

I am suspicious that this might get us to security problems. If we just decode what we get, people can enter encoded HTML which we will decode and display, getting a nice XSS.

pp’s picture

Title: Decode spec characters in the caption and title of the url » Nice text in the caption and title in the pasted url
StatusFileSize
new31.29 KB
new1.5 KB

I talked with Hojtsy Gábor and I think this issue isn't clear enough. I probe once more.

When I post a node I like put the link which contains special characters. Eg. Wikipedia page of "Árvíztűrő tükörfúrógép". It's a special words which contains all special characters of Hungarian language.

When i watch the browser url line I see the following:
http://hu.wikipedia.org/wiki/Árvíztűrő_tükörfúrógép

When I copy and paste this url the node body textarea I see the following:
http://hu.wikipedia.org/wiki/%C3%81rv%C3%ADzt%C5%B1r%C5%91_t%C3%BCk%C3%B...

When I watch the preview I see the following:
http://hu.wikipedia.org/wiki/%C3%81rv%C3%ADzt%C5%B1r%C5%91_t%C3%BCk%C3%B...

I like to see which I saw int the browser url line when I watch the preview and after the posted the node.

Please see the image(You see in the picture what do I like to see) and review my patch.

I know the patch is not perfect. It just work with my example and I hope it is safe for XSS attack, but not sure.
It doesn't work when url contains a brackets, because url filter doesn't recognize it.

pp’s picture

Hojtsy Gábor:
My new path safe for this bug. I probed it.

gábor hojtsy’s picture

Version: 6.x-dev » 7.x-dev

Well, looks better :) Since even bugfixes get to Drupal 7 first (to ensure that no old bug appears again in that version), we need this committed there first.

Status: Needs review » Needs work

The last submitted patch failed testing.

pp’s picture

Status: Needs work » Needs review
StatusFileSize
new1.5 KB

Here it is.

Status: Needs review » Needs work

The last submitted patch failed testing.

pp’s picture

Assigned: Unassigned » pp
Status: Needs work » Needs review
StatusFileSize
new1.49 KB

Here it is.

cburschka’s picture

Status: Needs review » Needs work

That inline comment needs to be a full sentence, with the first letter capitalized and ending in a period. I'm guessing that "spec" is supposed to mean special, and if so it must be written out. :)

You also have two blank lines after the decode command; the maximum is one blank line to separate logical chunks.

I see that the result of this function is run through check_plain(), so there are indeed no output security issues (I mention this because I had to look twice myself).

cburschka’s picture

Addendum: Disregard the previous, I see a problem after all. The result of _filter_url_trim() is run through check_plain() when generating the caption, but the title attribute gets un-encoded data.

-  return $match[1] . '<a href="' . $match[2] . '" title="' . $match[2] . '">' . $caption . '</a>' . $match[5];
+  return $match[1] . '<a href="' . $match[2] . '" title="' . urldecode($match[2]) . '">' . $caption . '</a>' . $match[5];

This allows a user to introduce URLs containing "><evilscript><a href="something. You seriously need to check_plain() that stuff.

pp’s picture

Status: Needs work » Needs review
StatusFileSize
new1.52 KB

Thanks for review!

The check_plain resolve the security problem. I tested it.

My English is not to good, please correct me if necessary!(and if you understand me:) ) Thanks!

pp

nevergone’s picture

Status: Needs review » Reviewed & tested by the community

reviewed and tested: all right

snufkin’s picture

Status: Reviewed & tested by the community » Needs work
  $match[2] = decode_entities($match[2]);
  $caption = check_plain(_filter_url_trim($match[2]));
  $match[2] = check_url($match[2]);
  return $match[1] . '<a href="' . $match[2] . '" title="' . check_plain(urldecode($match[2])) . '">' . $caption . '</a>' . $match[5];

For reference:
- decode_entities turns entities into utf-8 character.
- check_plain is essentially an htmlspecialchars()
- check_url does a check_plain through filter_xss_bad_protocol

So the return string has
- href with $match[2] which has 1) entities turned into utf-8, 2) check_plained.
- title with $match[2] 1) entity decoded, 2) check_plained, 3) urldecoded, 4) check_plained
- caption which is 1) entity decoded, 2) urldecoded, 3) check_plained

Why do we need a check_plain(urldecode($match[2]) on the title? By that stage this variable has been check_plained in check_url.

pp’s picture

Status: Needs work » Needs review
StatusFileSize
new1.52 KB

snufkin:
Please read #3

I must use encoded character (%C3%81 the Á ) in the url. If I show it to a user I must decode it. (please click the link in the #3 and see the browsers address/link/url bar. You don't see %C3%81 you see Á... if you not use IE6 or older browser perhaps )

$caption and $title for user -> use ulrdecode function which convert eg. %C3%81 to Á
$match[2] for the computer -> not decode it.

I created a new patch, please review it.

pp

snufkin’s picture

Status: Needs review » Needs work
StatusFileSize
new84.43 KB

Firefox (flock, but its basically firefox 3):
- with patch: link converts apparently fine (looks okay both in taskbar and address bar, but without the %20), but copying (and then pasting it somewhere) makes it look fugly.
- without patch: exact same thing

Safari:
- with patch: link converts into accents, but space is represented with %20 (both when i click and it appears in address bar and in the taskbar on the bottom of the window).
- without patch: exact same thing.

Text looks fine with and without the patch when i move mouse over the link and the link url appears...
It behaves like this in preview modes as well. I checked the image in #3, I get the exact same thing but without the patch as well (attachment).

Apart from the spaces the accents are handled fine already in Safari (and so i heard in other webkit based browsers such as konqueror), without the patch (e.g. I can already paste nice urls from drupal.hu which has accented urls without problem).

And a sidenote: the filter will not pick up accented urls automatically, and will not convert them into urls, with or without the patch (tested with flock (essentially firefox) and safari 4).

I tried all this with a beautiful fresh checkout of d7.

snufkin’s picture

Status: Needs work » Reviewed & tested by the community

It seems that during my tests I have not altered the body of the node thus the cache was not flushed between tests. The patch does make a pasted link appear with proper accents.

I would say its CNR.

So again as a summary:

The url filter takes the url match and formats it into a link. The old style was to apply check_plain to the string, this resulted in decoding the characters on display. The patch fixes this by first encoding them, then decoding via check_plain.

I was trying to check for XSS by closing the href both with encoded characters and without and then inserting a script, but the check_plain captures it every time. Also tried to rearrange the filters, but still it works fine.

Status: Reviewed & tested by the community » Needs work

The last submitted patch failed testing.

lilou’s picture

Status: Needs work » Reviewed & tested by the community

Setting to RTBC - testbot was broken.

Status: Reviewed & tested by the community » Needs work

The last submitted patch failed testing.

pp’s picture

Status: Needs work » Reviewed & tested by the community
StatusFileSize
new1.52 KB

It work's fine, but I saw two "Hunk message"s.

I rerolled the patch.

pp

webchick’s picture

Status: Reviewed & tested by the community » Needs work
Issue tags: +Needs tests

We need an automated test, please. :) Should put into code the test you're doing to ensure it works properly, as well as a couple that you've tried that it blocks successfully. If you need help with SimpleTest, drop by #drupal on IRC.

I would also feel better about committing this if we have final +1 from someone on the security team like Gábor, Heine, or chx.

webchick’s picture

Also, while we're at it, let's add some comments to explain the logic so someone well-meaning doesn't try and "optimize" this later and remove the feature. Tests will help with that, of course, but never hurts to explain tweaky logic that if we get wrong results in security holes.

pp’s picture

hmm... hmm... maybe it is a little bit bigger than this issue. I saw the filter.test. Just linebreak filter and html filter(see bellow :D) are presented. The url filter test isn't present.

  /**
   * Test the HTML filter
   */
  function testHtmlFilter() {

  }

I think i should review some issues before i do anything ( http://drupal.org/project/issues/drupal?text=test&status=Open&priorities... )

pp

pp’s picture

Status: Needs work » Postponed
pp’s picture

Status: Postponed » Needs review
Issue tags: +Security
StatusFileSize
new3 KB

Webchick,

First of all, thank you for the review.
I did what you asked of me. I hope it's right.

We will wait for someone in the security team to approve it.

chx’s picture

It's not the security team that should asses whether there is a danger to this, much rather there should be a test checking for it, that's what tests are for.

pp’s picture

StatusFileSize
new3.2 KB

I created the test. Please review.

My idea was following:
If my code put special character to a wrong place, the filter_xss function change my code result.

    // Check special characters "'<> 
    $f = _filter_url('http://www.example.com/%22%27%3C%3E', 'f');
    $this->assertEqual($f, filter_xss($f), t('Special charachters are in wrong place'));

Is it a god/correct idea?

pp

c960657’s picture

I think the title attribute should be omitted, as long as it only contains the URL of the link (even if it is in a slightly different format). I think it goes against the spirit of the HTML specification to simply duplicate the URL in the title attribute:

12.1.4 Link titles
The title attribute may be set for both A and LINK to add information about the nature of a link. This information may be spoken by a user agent, rendered as a tool tip, cause a change in cursor image, etc.

Status: Needs review » Needs work

The last submitted patch failed testing.

pp’s picture

Status: Needs work » Needs review
StatusFileSize
new3.21 KB

c960657

I think the title attribute should be omitted,

The core filter module include the url in the title, not my patch. Please see the patch:

- return $match[1] . '<a href="' . $match[2] . '" title="' . $match[2] . '">' . $caption . '</a>' . $match[5];

I think it is totally different issue, please create a new issue. Thank you.

Reroll the patch.

Please review.

pp

sun’s picture

sun’s picture

+++ modules/filter/filter.module	30 Aug 2009 00:52:33 -0000
@@ -911,9 +913,11 @@ function _filter_url_parse_full_links($m
+  // International and special characters are decoded in the title and the caption for humans.
+  $title = check_plain(urldecode($match[2]));
+  $caption = check_plain(_filter_url_trim(urldecode($match[2])));

Can we please elaborate a bit more on the decoding and re-encoding here? I basically mean to transfer knowledge from this issue into inline comments.

Additionally, I agree with c95234876346 that we want to remove the 'title' attribute altogether, which effectively means that we do not have to describe the re-encoding for it -- thus, this change can be done within this issue.

Powered by Dreditor.

sun’s picture

Title: Nice text in the caption and title in the pasted url » URL filter does not urldecode() the URL for link text

Better title.

cleaver’s picture

Status: Needs review » Reviewed & tested by the community

Reviewed the test - looks good and it passes locally

sun’s picture

Status: Reviewed & tested by the community » Needs work

um, no, patch contains trailing white-space, and #33 still needs to be addressed.

hingo’s picture

Hi

I'm hingo from #161217: URL filter breaks generated href tags. These bugs are not related, I just want to comment on this having become familiar URL filter myself. (I should also state that I'm Finnish, so I'm well aware of problems with special letters in non-English languages, although our dear sister language hungarian has more of them :-)

This post starts with an explanation of how Firefox works:

When i watch the browser url line I see the following:
http://hu.wikipedia.org/wiki/Árvíztűrő_tükörfúrógép

When I copy and paste this url the node body textarea I see the following:
http://hu.wikipedia.org/wiki/%C3%81rv%C3%ADzt%C5%B1r%C5%91_t%C3%BCk%C3%B...

...then proposes that the encoded "ugly url" should be decoded to "human readable url".

I strongly oppose the whole premise of this bug report! In my opinion, the job of Drupal's URL filter is to convert URL's into clickable links, without altering the text content. If I insert the "ugly url" into a Drupal post, then I wan't to see the ugly url as the caption. If I insert a human readable url, then I want to see the human readable url.

I'm not sure why Firefox insists on providing us with the "ugly url" when I'm actually copying a "human readable url". In my opinion this could be filed as a bug against Firefox. (And apparently Safari too?) In any case, Drupal should not include this patch just because Firefox is doing something weird!

It could however be argued that URL filter should do the opposite of what is proposed here. If I post a "human readable url", then the href of the generated link must be urlencoded. However, all current browsers seem to be just fine supporting human readable url's in links. If this weren't the case, someone would have reported this as a bug long ago.

Even if I'm proposing to close this as "by design", a small code review comment before I go: Why isn't the same change done for email links, just other URL's? Now email links would work incosistently with other types of URL's.

Finally, reading this thread I have collected the following interesting observations, that are useful even if this patch is (hopefully) not committed:
1) title attribute is superfluous. Yes, agree.

2) URL filter does: $match[2] = decode_entities($match[2]); then later $match[2] = check_plain($match[2]);
It is not clear to me why this is done back and forth. Note that the regexp that catches the URL's is written such that no XSS or other bad thing should happen anyway, if we just let the URL pass as it was written. For instance, a quotation mark will not be considered part of the URL, nor will a "less than" character. This would need testing, but it seems the code can be simplified here by removing unnecessary function calls.

Both of the above are also valid comments for my patch from #161217: URL filter breaks generated href tags, since it does not address this area at all and the lines are left functionally unchanged.

To re-iterate: I propose that this bug report is "by design".

pp’s picture

Assigned: pp » Unassigned
Status: Needs work » Closed (works as designed)

hingo,

You are right. It isn't general problem. I tried this only Firefox. I thought it is general. It is my mistake.

The Safari copy the "human readable url" on the clipboard.
When I write "human readable url" to the Galeon it convert it to "ugly url" immediately. (and copy the ugly url to the clipboard, perhaps)
I don't know the IE how works.

pp

sun’s picture

Status: Closed (works as designed) » Needs work

I don't think so.

http://hu.wikipedia.org/wiki/%C3%81rv%C3%ADzt%C5%B1r%C5%91_t%C3%BCk%C3%B...

Yes, this ^^ is the URL as copied from the browser's address bar. But it makes little sense to keep it encoded for link text. The current patch ensured that the link href stays the same, but the (visible) link text gets decoded, so the actual result would be:

<a href="http://hu.wikipedia.org/wiki/%C3%81rv%C3%ADzt%C5%B1r%C5[...]">http://hu.wikipedia.org/wiki/Árvíztűrő tükörfúrógép</a>
hingo’s picture

Thanks PP for agreeing. I also didn't test with other browsers, but if this indeed is a Firefox-only problem, I think it supports my argument even more. Sorry to see so much work on this go to waste, but it is always important to first discuss where the problem really is, so we don't fix it in the wrong place.

Sun: If you don't want to have such a string of text in your post, you should not copy it into your post. The fact that you apparently cannot easily copy the "human readable url" from Firefox is a problem with Firefox. For Drupal, we should present the text as the user told us to present it, not try to "fix" things automatically. Such fixes will always lead to problems. (Compare with the AutoText features in MS Word and OpenOffice. They are a little bit convenient for some, very annoying for some others.)

Sun: Now, if I want to insert an encoded string into my article, and have it shown in encoded form, what should I do when your patch is in? Encode it twice? What is the logic behind that?

hingo’s picture

Status: Needs work » Closed (works as designed)

Sun: To express myself more concisely, by way of example:

1) I want to write a Drupal post to express what happened in this thread.

2) I need to include the string...
http://hu.wikipedia.org/wiki/%C3%81rv%C3%ADzt%C5%B1r%C5%91_t%C3%BCk%C3%B...
...and I also need to include the string...
http://hu.wikipedia.org/wiki/Árvíztűrő_tükörfúrógép
...to explain the problem.

3) I use URL filter so both of them will be converted to clickable links. This is ok.

4) An obvious requirement is that the links work.

Supposing I'm using a Drupal with the proposed patch included, how is it even possible to do the above? I cannot think of a way, please let me know if there is one.

(PS: If you're wondering why the second URL above is not clickable, this is a bug in URL filter. It is being fixed #161217: URL filter breaks generated href tags)

sun’s picture

Also applies to IE7 and IE8, which both seem to support to input UTF-8 in URLs, but always display encoded URLs in the address bar (retained encoded on copy and paste).

Chrome, Safari, and Opera seem to get it right.

The given example of a blog post about this issue can be ignored, because that's not what regular Internet users do or want. Keep in mind that URL filter applies to everyone, everywhere; not just "Drupal users".

I think there's simply no point in displaying an encoded URL as link text.

Regardless of that; moved the title attribute issue over to #784790: URL filter displays URL as link title