Early Bird Registration for DrupalCon Portland 2024 is open! Register by 23:59 PST on 31 March 2024, to get $100 off your ticket.
For example, a URL that should be:
http://www.iflry.org/modules.php?op=modload&name=News&file=article&sid=97
Is being shown as:
http://www.iflry.org/modules.php?op=modload&name=News&file=article&sid=97
--//--
Another problem is when someone doesn't put "http://" in their feed address. What should be:
www.eldr.org/modules.php?name=News&file=article&sid=731
Is being shown as:
http://www.liberal-social.org/www.eldr.org/modules.php?name=News&file=ar...
Comment | File | Size | Author |
---|---|---|---|
#17 | aggregator.module_12.patch | 3.22 KB | mfb |
#14 | aggregator.module_11.patch | 3.54 KB | mfb |
#10 | aggregator.module_10.patch | 3.5 KB | mfb |
#9 | aggregator.module_9.patch | 3.5 KB | mfb |
#8 | aggregator.module_8.patch | 941 bytes | mfb |
Comments
Comment #1
drubeedoo CreditAttribution: drubeedoo commentedNoticed problem still exists with 4.7.0-beta2 (as reported on 4.6.5 above).
Comment #2
drubeedoo CreditAttribution: drubeedoo commentedmy bad, sorry... slightly different problem on 4.7.0-beta2? (Can't tell due to HTML filtering here.) This issue may still be valid with 4.6.5.
4.7.0-beta2 aggregator breaks the following link:
to broken link:
Comment #3
mikhailian CreditAttribution: mikhailian commentedThat & appearing in the urls is dues to the call stack of
check_url -> filter_xss_bad_protocol -> check_plain -> htmlspecialchars
where htmlspecialchars replaces the ampersand by its entity.I just removed
check_url
as I trust my feed providers and do not see it as a major security threat.Comment #4
gregglesThis seems like two enhancement requests:
1. If a feed has an & in the URL then don't replace it
Workaround: turn off check_url
2. If a feed reports it's address without an http:// or https:// on the front, then add one.
Workaround: get the feed to fix their url
Item two seems like it is probably a variance from the standard and therefore something we might legitimately /not/ want to do. The link provided is no longer available so I can't really confirm. The questions are: is this a common variance? and does the fix cause problems that make it less desirable than the problem?
Are there any drawbacks to these proposals?
Comment #5
Zen CreditAttribution: Zen commentedDowngrading priority. I recall reading that this was fixed. Can someone please confirm this with the latest beta or HEAD? I am not familiar with the aggregator module myself.
-K
Comment #6
mfbIn current CVS head, an item link syndicated in the RSS file as literally
http://www.indybay.org/archives/archive_by_id.php?id=4181&category_id=12
is rendered by drupal as a link tohttp://www.indybay.org/archives/archive_by_id.php?id=4181&category_id=12
.URLs in RSS and other XML formats are already HTML-encoded, because valid XML requires
&
be encoded as&
. All content of XML feeds, including item links, must be HTML-decoded before being re-HTML-encoded by drupal.Comment #7
mfbMarking as critical as it should be fixed before 4.7 release.
Comment #8
mfbThis patch stores the item link URLs in HTML-decoded format, just as feed URLs are stored. This *should* be ok as links are run thru check_url() before being rendered in HTML format.
Comment #9
mfbOK Here is a more proper patch.. we remove filtering from the xml parsing, so unfiltered content is input into the database.
filtering is moved into the theme_ functions which render content. Also added some filtering in a couple places where it didn't already exist.
Comment #10
mfbsame patch, updated for HEAD
Comment #11
Dries CreditAttribution: Dries commentedI like the approach taken by this patch; i.e. moving the filtering to 'on output'. If Steven could double check the implementation of aggregator_filter_xss(), that would be superb.
Comment #12
Dries CreditAttribution: Dries commentedRelated patch: http://drupal.org/node/49673.
I like this approach better though so I'm inclined to mark that other patch as 'duplicate'.
Rather than using filter_xss() you might want to use check_markup(). Double-check with the documentation.
Comment #13
Steven CreditAttribution: Steven commentedcheck_markup() does not make sense, because there is no guarantee the admin has set up an HTML-format for use. They might well be using Textile or BBCode or some other custom filter.
However, the patch is bad. filter_xss() tags an array of HTML tag names (i.e. "a", "strong", "div") and not HTML tags (i.e. "", "", ...). It is also a bad idea to require spaces between each tag, as they are easily forgotten by the user. See filter.module for a better way to parse the tag list.
Finally, we need to look at the performance. Moving filtering to output is in principle good (because it allows changes in filtering settings to affect already aggregated content), but it means the aggregator output pages become a lot more complex.
Comment #14
mfbOK I fixed the patch with regards to aggregator_filter_xss();
Comment #15
Dries CreditAttribution: Dries commentedI leave this for Steven to review (and approve/commit).
Comment #16
Steven CreditAttribution: Steven commentedHold the phone... the calls to decode_entities() are unnecessary as far as I can tell. The XML parser in PHP already decodes entities when it returns character data and attributes from XML.
For example, point your browser at an xmlparse.php file in your Drupal dir, containing:
@ & <';
$xml_parser = drupal_xml_parser_create($data);
xml_set_element_handler($xml_parser, 'dump', 'dump');
xml_set_character_data_handler($xml_parser, 'dump');
function dump() {
print_r(func_get_args());
}
if (!xml_parse($xml_parser, $data, 1)) {
print 'error';
}
?>
Do 'view source' on the result: you will see a literal at-sign, ampersand and angle bracket, without entities. Calling decode_entities() on that again would mean that double escaped entities are decoded as well. This could mess up any aggregated posts that talk about XML or HTML.
Comment #17
mfbI removed decode_entities(), tested and works on php4 and php5. I was fooled by aggregator.module already using the function!
I did a few rounds of unscientific testing,
ab -kc 10 -t 30 http://127.0.0.1/drupalhead/aggregator
and filtering caused somewhere between 0% and 1% slowdown.
Comment #18
Dries CreditAttribution: Dries commentedCommitted to HEAD. Thanks.
Comment #19
(not verified) CreditAttribution: commented