Support from Acquia helps fund testing for Drupal Acquia logo

Comments

drubeedoo’s picture

Version: 4.6.5 » 4.7.0-beta2

Noticed problem still exists with 4.7.0-beta2 (as reported on 4.6.5 above).

drubeedoo’s picture

Version: 4.7.0-beta2 » 4.6.5

my bad, sorry... slightly different problem on 4.7.0-beta2? (Can't tell due to HTML filtering here.) This issue may still be valid with 4.6.5.

4.7.0-beta2 aggregator breaks the following link:

http://example.com/?some-article-name&id=115710

to broken link:

http://example.com/?some-article-name&id=115710
mikhailian’s picture

That & appearing in the urls is dues to the call stack of

check_url -> filter_xss_bad_protocol -> check_plain -> htmlspecialchars where htmlspecialchars replaces the ampersand by its entity.

I just removed check_url as I trust my feed providers and do not see it as a major security threat.

greggles’s picture

This seems like two enhancement requests:

1. If a feed has an & in the URL then don't replace it
Workaround: turn off check_url

2. If a feed reports it's address without an http:// or https:// on the front, then add one.
Workaround: get the feed to fix their url

Item two seems like it is probably a variance from the standard and therefore something we might legitimately /not/ want to do. The link provided is no longer available so I can't really confirm. The questions are: is this a common variance? and does the fix cause problems that make it less desirable than the problem?

Are there any drawbacks to these proposals?

Zen’s picture

Priority: Critical » Normal

Downgrading priority. I recall reading that this was fixed. Can someone please confirm this with the latest beta or HEAD? I am not familiar with the aggregator module myself.

-K

mfb’s picture

Version: 4.6.5 » 4.7.0-beta4

In current CVS head, an item link syndicated in the RSS file as literally http://www.indybay.org/archives/archive_by_id.php?id=4181&category_id=12 is rendered by drupal as a link to http://www.indybay.org/archives/archive_by_id.php?id=4181&category_id=12.

URLs in RSS and other XML formats are already HTML-encoded, because valid XML requires & be encoded as &. All content of XML feeds, including item links, must be HTML-decoded before being re-HTML-encoded by drupal.

mfb’s picture

Priority: Normal » Critical

Marking as critical as it should be fixed before 4.7 release.

mfb’s picture

Status: Active » Needs review
FileSize
941 bytes

This patch stores the item link URLs in HTML-decoded format, just as feed URLs are stored. This *should* be ok as links are run thru check_url() before being rendered in HTML format.

mfb’s picture

FileSize
3.5 KB

OK Here is a more proper patch.. we remove filtering from the xml parsing, so unfiltered content is input into the database.
filtering is moved into the theme_ functions which render content. Also added some filtering in a couple places where it didn't already exist.

mfb’s picture

Assigned: Unassigned » mfb
FileSize
3.5 KB

same patch, updated for HEAD

Dries’s picture

I like the approach taken by this patch; i.e. moving the filtering to 'on output'. If Steven could double check the implementation of aggregator_filter_xss(), that would be superb.

Dries’s picture

Related patch: http://drupal.org/node/49673.

I like this approach better though so I'm inclined to mark that other patch as 'duplicate'.

Rather than using filter_xss() you might want to use check_markup(). Double-check with the documentation.

Steven’s picture

Status: Needs review » Needs work

check_markup() does not make sense, because there is no guarantee the admin has set up an HTML-format for use. They might well be using Textile or BBCode or some other custom filter.

However, the patch is bad. filter_xss() tags an array of HTML tag names (i.e. "a", "strong", "div") and not HTML tags (i.e. "", "", ...). It is also a bad idea to require spaces between each tag, as they are easily forgotten by the user. See filter.module for a better way to parse the tag list.

Finally, we need to look at the performance. Moving filtering to output is in principle good (because it allows changes in filtering settings to affect already aggregated content), but it means the aggregator output pages become a lot more complex.

mfb’s picture

Status: Needs work » Needs review
FileSize
3.54 KB

OK I fixed the patch with regards to aggregator_filter_xss();

Dries’s picture

I leave this for Steven to review (and approve/commit).

Steven’s picture

Status: Needs review » Needs work

Hold the phone... the calls to decode_entities() are unnecessary as far as I can tell. The XML parser in PHP already decodes entities when it returns character data and attributes from XML.

For example, point your browser at an xmlparse.php file in your Drupal dir, containing:


include './includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

$data = '<?xml version="1.0" 

@ & <';
$xml_parser = drupal_xml_parser_create($data);
xml_set_element_handler($xml_parser, 'dump', 'dump');
xml_set_character_data_handler($xml_parser, 'dump');

function dump() {
print_r(func_get_args());
}

if (!xml_parse($xml_parser, $data, 1)) {
print 'error';
}

?>

Do 'view source' on the result: you will see a literal at-sign, ampersand and angle bracket, without entities. Calling decode_entities() on that again would mean that double escaped entities are decoded as well. This could mess up any aggregated posts that talk about XML or HTML.

mfb’s picture

Status: Needs work » Needs review
FileSize
3.22 KB

I removed decode_entities(), tested and works on php4 and php5. I was fooled by aggregator.module already using the function!

I did a few rounds of unscientific testing,
ab -kc 10 -t 30 http://127.0.0.1/drupalhead/aggregator
and filtering caused somewhere between 0% and 1% slowdown.

Dries’s picture

Status: Needs review » Fixed

Committed to HEAD. Thanks.

Anonymous’s picture

Status: Fixed » Closed (fixed)