I don't know if this is a bug or a feature request. I think it is a bug. ;)
When I use path_auto module it creates url from the node title. If I use spec character in the url and I copy it to node body it will be ugly. Perhaps, it will be an url encoded string. But my browser shows it nice because the browser decodes that. I suggest filter module place decoded url in the caption variable and title property, but not in the href property.
If I paste the following:
http://drupal6/contact/Sales/ez%20egy%20pr%C3%B3ba%20%C3%A1rv%C3%ADzt%C5%B1r%C5%91%20t%C3%BCk%C3%B6rf%C3%BAr%C3%B3g%C3%A9p
I would like to see:
http://drupal6/contact/Sales/ez egy próba árvíztűrő tükörfúrógép
pp
| Comment | File | Size | Author |
|---|---|---|---|
| #31 | 340776-10.patch | 3.21 KB | pp |
| #28 | 340776-9.patch | 3.2 KB | pp |
| #26 | 340776-8.patch | 3 KB | pp |
| #21 | 340776-7.patch | 1.52 KB | pp |
| #16 | Picture 3.png | 84.43 KB | snufkin |
Comments
Comment #1
pp commentedComment #2
gábor hojtsyI am suspicious that this might get us to security problems. If we just decode what we get, people can enter encoded HTML which we will decode and display, getting a nice XSS.
Comment #3
pp commentedI talked with Hojtsy Gábor and I think this issue isn't clear enough. I probe once more.
When I post a node I like put the link which contains special characters. Eg. Wikipedia page of "Árvíztűrő tükörfúrógép". It's a special words which contains all special characters of Hungarian language.
When i watch the browser url line I see the following:
http://hu.wikipedia.org/wiki/Árvíztűrő_tükörfúrógép
When I copy and paste this url the node body textarea I see the following:
http://hu.wikipedia.org/wiki/%C3%81rv%C3%ADzt%C5%B1r%C5%91_t%C3%BCk%C3%B...
When I watch the preview I see the following:
http://hu.wikipedia.org/wiki/%C3%81rv%C3%ADzt%C5%B1r%C5%91_t%C3%BCk%C3%B...
I like to see which I saw int the browser url line when I watch the preview and after the posted the node.
Please see the image(You see in the picture what do I like to see) and review my patch.
I know the patch is not perfect. It just work with my example and I hope it is safe for XSS attack, but not sure.
It doesn't work when url contains a brackets, because url filter doesn't recognize it.
Comment #4
pp commentedHojtsy Gábor:
My new path safe for this bug. I probed it.
Comment #5
gábor hojtsyWell, looks better :) Since even bugfixes get to Drupal 7 first (to ensure that no old bug appears again in that version), we need this committed there first.
Comment #7
pp commentedHere it is.
Comment #9
pp commentedHere it is.
Comment #10
cburschkaThat inline comment needs to be a full sentence, with the first letter capitalized and ending in a period. I'm guessing that "spec" is supposed to mean special, and if so it must be written out. :)
You also have two blank lines after the decode command; the maximum is one blank line to separate logical chunks.
I see that the result of this function is run through check_plain(), so there are indeed no output security issues (I mention this because I had to look twice myself).
Comment #11
cburschkaAddendum: Disregard the previous, I see a problem after all. The result of _filter_url_trim() is run through check_plain() when generating the caption, but the title attribute gets un-encoded data.
This allows a user to introduce URLs containing
"><evilscript><a href="something. You seriously need to check_plain() that stuff.Comment #12
pp commentedThanks for review!
The check_plain resolve the security problem. I tested it.
My English is not to good, please correct me if necessary!(and if you understand me:) ) Thanks!
pp
Comment #13
nevergonereviewed and tested: all right
Comment #14
snufkin commentedFor reference:
- decode_entities turns entities into utf-8 character.
- check_plain is essentially an htmlspecialchars()
- check_url does a check_plain through filter_xss_bad_protocol
So the return string has
- href with
$match[2]which has 1) entities turned into utf-8, 2) check_plained.- title with
$match[2]1) entity decoded, 2) check_plained, 3) urldecoded, 4) check_plained- caption which is 1) entity decoded, 2) urldecoded, 3) check_plained
Why do we need a check_plain(urldecode($match[2]) on the title? By that stage this variable has been check_plained in check_url.
Comment #15
pp commentedsnufkin:
Please read #3
I must use encoded character (%C3%81 the Á ) in the url. If I show it to a user I must decode it. (please click the link in the #3 and see the browsers address/link/url bar. You don't see %C3%81 you see Á... if you not use IE6 or older browser perhaps )
$caption and $title for user -> use ulrdecode function which convert eg. %C3%81 to Á
$match[2] for the computer -> not decode it.
I created a new patch, please review it.
pp
Comment #16
snufkin commentedFirefox (flock, but its basically firefox 3):
- with patch: link converts apparently fine (looks okay both in taskbar and address bar, but without the %20), but copying (and then pasting it somewhere) makes it look fugly.
- without patch: exact same thing
Safari:
- with patch: link converts into accents, but space is represented with %20 (both when i click and it appears in address bar and in the taskbar on the bottom of the window).
- without patch: exact same thing.
Text looks fine with and without the patch when i move mouse over the link and the link url appears...
It behaves like this in preview modes as well. I checked the image in #3, I get the exact same thing but without the patch as well (attachment).
Apart from the spaces the accents are handled fine already in Safari (and so i heard in other webkit based browsers such as konqueror), without the patch (e.g. I can already paste nice urls from drupal.hu which has accented urls without problem).
And a sidenote: the filter will not pick up accented urls automatically, and will not convert them into urls, with or without the patch (tested with flock (essentially firefox) and safari 4).
I tried all this with a beautiful fresh checkout of d7.
Comment #17
snufkin commentedIt seems that during my tests I have not altered the body of the node thus the cache was not flushed between tests. The patch does make a pasted link appear with proper accents.
I would say its CNR.
So again as a summary:
The url filter takes the url match and formats it into a link. The old style was to apply check_plain to the string, this resulted in decoding the characters on display. The patch fixes this by first encoding them, then decoding via check_plain.
I was trying to check for XSS by closing the href both with encoded characters and without and then inserting a script, but the check_plain captures it every time. Also tried to rearrange the filters, but still it works fine.
Comment #19
lilou commentedSetting to RTBC - testbot was broken.
Comment #21
pp commentedIt work's fine, but I saw two "Hunk message"s.
I rerolled the patch.
pp
Comment #22
webchickWe need an automated test, please. :) Should put into code the test you're doing to ensure it works properly, as well as a couple that you've tried that it blocks successfully. If you need help with SimpleTest, drop by #drupal on IRC.
I would also feel better about committing this if we have final +1 from someone on the security team like Gábor, Heine, or chx.
Comment #23
webchickAlso, while we're at it, let's add some comments to explain the logic so someone well-meaning doesn't try and "optimize" this later and remove the feature. Tests will help with that, of course, but never hurts to explain tweaky logic that if we get wrong results in security holes.
Comment #24
pp commentedhmm... hmm... maybe it is a little bit bigger than this issue. I saw the filter.test. Just linebreak filter and html filter(see bellow :D) are presented. The url filter test isn't present.
I think i should review some issues before i do anything ( http://drupal.org/project/issues/drupal?text=test&status=Open&priorities... )
pp
Comment #25
pp commentedwaiting for #276597: TestingParty08: filter.module
pp
Comment #26
pp commentedWebchick,
First of all, thank you for the review.
I did what you asked of me. I hope it's right.
We will wait for someone in the security team to approve it.
Comment #27
chx commentedIt's not the security team that should asses whether there is a danger to this, much rather there should be a test checking for it, that's what tests are for.
Comment #28
pp commentedI created the test. Please review.
My idea was following:
If my code put special character to a wrong place, the filter_xss function change my code result.
Is it a god/correct idea?
pp
Comment #29
c960657 commentedI think the title attribute should be omitted, as long as it only contains the URL of the link (even if it is in a slightly different format). I think it goes against the spirit of the HTML specification to simply duplicate the URL in the title attribute:
Comment #31
pp commentedc960657
The core filter module include the url in the title, not my patch. Please see the patch:
I think it is totally different issue, please create a new issue. Thank you.
Reroll the patch.
Please review.
pp
Comment #32
sunsubscribing
Comment #33
sunCan we please elaborate a bit more on the decoding and re-encoding here? I basically mean to transfer knowledge from this issue into inline comments.
Additionally, I agree with c95234876346 that we want to remove the 'title' attribute altogether, which effectively means that we do not have to describe the re-encoding for it -- thus, this change can be done within this issue.
Powered by Dreditor.
Comment #34
sunBetter title.
Comment #35
cleaver commentedReviewed the test - looks good and it passes locally
Comment #36
sunum, no, patch contains trailing white-space, and #33 still needs to be addressed.
Comment #37
hingo commentedHi
I'm hingo from #161217: URL filter breaks generated href tags. These bugs are not related, I just want to comment on this having become familiar URL filter myself. (I should also state that I'm Finnish, so I'm well aware of problems with special letters in non-English languages, although our dear sister language hungarian has more of them :-)
This post starts with an explanation of how Firefox works:
...then proposes that the encoded "ugly url" should be decoded to "human readable url".
I strongly oppose the whole premise of this bug report! In my opinion, the job of Drupal's URL filter is to convert URL's into clickable links, without altering the text content. If I insert the "ugly url" into a Drupal post, then I wan't to see the ugly url as the caption. If I insert a human readable url, then I want to see the human readable url.
I'm not sure why Firefox insists on providing us with the "ugly url" when I'm actually copying a "human readable url". In my opinion this could be filed as a bug against Firefox. (And apparently Safari too?) In any case, Drupal should not include this patch just because Firefox is doing something weird!
It could however be argued that URL filter should do the opposite of what is proposed here. If I post a "human readable url", then the href of the generated link must be urlencoded. However, all current browsers seem to be just fine supporting human readable url's in links. If this weren't the case, someone would have reported this as a bug long ago.
Even if I'm proposing to close this as "by design", a small code review comment before I go: Why isn't the same change done for email links, just other URL's? Now email links would work incosistently with other types of URL's.
Finally, reading this thread I have collected the following interesting observations, that are useful even if this patch is (hopefully) not committed:
1) title attribute is superfluous. Yes, agree.
2) URL filter does:
$match[2] = decode_entities($match[2]);then later$match[2] = check_plain($match[2]);It is not clear to me why this is done back and forth. Note that the regexp that catches the URL's is written such that no XSS or other bad thing should happen anyway, if we just let the URL pass as it was written. For instance, a quotation mark will not be considered part of the URL, nor will a "less than" character. This would need testing, but it seems the code can be simplified here by removing unnecessary function calls.
Both of the above are also valid comments for my patch from #161217: URL filter breaks generated href tags, since it does not address this area at all and the lines are left functionally unchanged.
To re-iterate: I propose that this bug report is "by design".
Comment #38
pp commentedhingo,
You are right. It isn't general problem. I tried this only Firefox. I thought it is general. It is my mistake.
The Safari copy the "human readable url" on the clipboard.
When I write "human readable url" to the Galeon it convert it to "ugly url" immediately. (and copy the ugly url to the clipboard, perhaps)
I don't know the IE how works.
pp
Comment #39
sunI don't think so.
http://hu.wikipedia.org/wiki/%C3%81rv%C3%ADzt%C5%B1r%C5%91_t%C3%BCk%C3%B...
Yes, this ^^ is the URL as copied from the browser's address bar. But it makes little sense to keep it encoded for link text. The current patch ensured that the link href stays the same, but the (visible) link text gets decoded, so the actual result would be:
Comment #40
hingo commentedThanks PP for agreeing. I also didn't test with other browsers, but if this indeed is a Firefox-only problem, I think it supports my argument even more. Sorry to see so much work on this go to waste, but it is always important to first discuss where the problem really is, so we don't fix it in the wrong place.
Sun: If you don't want to have such a string of text in your post, you should not copy it into your post. The fact that you apparently cannot easily copy the "human readable url" from Firefox is a problem with Firefox. For Drupal, we should present the text as the user told us to present it, not try to "fix" things automatically. Such fixes will always lead to problems. (Compare with the AutoText features in MS Word and OpenOffice. They are a little bit convenient for some, very annoying for some others.)
Sun: Now, if I want to insert an encoded string into my article, and have it shown in encoded form, what should I do when your patch is in? Encode it twice? What is the logic behind that?
Comment #41
hingo commentedSun: To express myself more concisely, by way of example:
1) I want to write a Drupal post to express what happened in this thread.
2) I need to include the string...
http://hu.wikipedia.org/wiki/%C3%81rv%C3%ADzt%C5%B1r%C5%91_t%C3%BCk%C3%B...
...and I also need to include the string...
http://hu.wikipedia.org/wiki/Árvíztűrő_tükörfúrógép
...to explain the problem.
3) I use URL filter so both of them will be converted to clickable links. This is ok.
4) An obvious requirement is that the links work.
Supposing I'm using a Drupal with the proposed patch included, how is it even possible to do the above? I cannot think of a way, please let me know if there is one.
(PS: If you're wondering why the second URL above is not clickable, this is a bug in URL filter. It is being fixed #161217: URL filter breaks generated href tags)
Comment #42
sunAlso applies to IE7 and IE8, which both seem to support to input UTF-8 in URLs, but always display encoded URLs in the address bar (retained encoded on copy and paste).
Chrome, Safari, and Opera seem to get it right.
The given example of a blog post about this issue can be ignored, because that's not what regular Internet users do or want. Keep in mind that URL filter applies to everyone, everywhere; not just "Drupal users".
I think there's simply no point in displaying an encoded URL as link text.
Regardless of that; moved the title attribute issue over to #784790: URL filter displays URL as link title