In cases where the element of a feed is empty, Aggregator will try to generate a title using a portion of the first 40 characters:

$item['TITLE'] = preg_replace('/^(.*)[^\w;&].*?$/', "\\1", truncate_utf8($item['DESCRIPTION']), 40);

Unfortunately, the above often allows html tags and other garbage into the feed item's title.

This patch replaces the above with a simpler method that tested fine for me.

1. Run the description through strip_tags() to remove all html tags from it.
2. Run the tag-free description through truncate_utf8 with $wordsafe true, to encourage the truncation to be at a word boundary if possible.

$stripped = strip_tags(truncate_utf8($item['DESCRIPTION'], 500));
      $item['TITLE'] = truncate_utf8($stripped, 40, true);

I originally tried running strip_tags() *after* truncating the data to 40 characters for efficiency, but this failed in instances where the initial tag was a long image tag or similar where it already took more than 40 characters, and therefore made up the entire truncated text.

However, stripping the entire description tag is impractical when the feed item is extremely long - strip tags is essentially a regex. So I have tried to compromise by truncating to 500 characters before stripping tags, then re-truncating down to 40 characters for the title.

This should generate useful titles in most instances except when the first 500 characters are entirely HTML tags - quite possible, for instance, when a user posts a bunch of pictures in a blog post prior to writing anything. In that case, $item['TITLE'] would end up blank. Perhaps it may be necessary to increase the initial truncation limit to 1000 characters.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

brianV’s picture

Status: Active » Needs review
brianV’s picture

Just to clarify,

In terms of getting good results, stripping the entire description prior to truncation gives the maximum probability that there will be text left over to use as a title. ie., we don't have to worry about the first X many characters being completely composed of HTML tags.

In terms of sheer performance, pre-truncating the data prior to running strip tags yields the best performance. The smaller the data set we run strip_tags over, the better the performance.

I think in most cases, stripping the entire description prior to truncation is acceptable for most users; however, there are boundary cases where that just wouldn't be acceptable.

Boundary cases

  1. User is using wordpress as a photoblog of sorts. He adds 20-50 pictures with full alt and title attributes before writing any text. The first several thousand characters are all parts of tags. In this case, we would want to strip the entire article prior to truncating to have a good chance of getting a usable result.
  2. User's feed has 1000 items on it, and each one is written like an encyclopedia article with lots of emphasis and strong tags sprinkled through it. A good title could be generated from the first 100 characters, and running strip tags over 1000 feed items like this would be a very inefficient use of resources.

This is a tradeoff between performance and increasing the likelihood of good results. I would like to see some discussion from other users on this...

brianV’s picture

Here is a good feed to use to test this functionality:

http://alexnoriegasketchblog.blogspot.com/rss.xml

  • There are many empty title tags in the fields, so the titles are generated from the description for a lot of the feed items.
  • Many of the descriptions start with html tags.

The old handling results in a lot of the feed item titles containing parts of various html tags. With this patch, titles are very much improved.

alex_b’s picture

Status: Needs review » Needs work

Nice work.

I think the boundary cases are fine. I'd love to see the clarification of #2 in short form as a comment in the code - otherwise the double truncation looks confusing in the first place.

brianV’s picture

Status: Needs work » Needs review
FileSize
1.55 KB

New revision added with text 'Description is truncated to 500 characters prior to stripping tags to ensure performance with extremely long feed items.' included in comments for that code section.

alex_b’s picture

Status: Needs review » Reviewed & tested by the community

Great... RTBC from my point of view.

alex_b’s picture

For the record, there has been an early issue for this bug #19573: Aggregator: Corrupted title based on item description with HTML

webchick’s picture

Status: Reviewed & tested by the community » Needs work

Hm. Sorry, but I think we need to add a test or two to ensure this stays working as expected.

brianV’s picture

I've started putting together a test case, although I don't have it working right yet.

If someone else wants to finish it off - go right ahead. I don't think I'll be able to finish it off for a while.

New patch is previous revision, plus the test-case as it stands now. Leaving at needs work since the testcase fails.

brianV’s picture

Issue tags: +Needs tests

Just tagging as needs tests. The feature is RTBC quality, it just needs a test, and I am more or less incompetent with tests.

lambic’s picture

I have a working test for this, will upload when I find out why my uploading is broken.

lambic’s picture

Status: Needs work » Needs review
FileSize
4.54 KB

working test attached

Status: Needs review » Needs work

The last submitted patch, 404356.patch, failed testing.

lambic’s picture

FileSize
4.54 KB

ok, let's try that again

lambic’s picture

Status: Needs work » Needs review

Status: Needs review » Needs work

The last submitted patch, 404356.patch, failed testing.

lambic’s picture

Status: Needs work » Needs review
FileSize
5.27 KB

helps if I upload the right patch file

lambic’s picture

Issue tags: +D7csmtl

tagging for Montreal code sprint

brianV’s picture

lambic, that test looks perfect to me. However, since I am fairly involved in this issue, I'll wait for someone else to RTBC it.

jhodgdon’s picture

Status: Needs review » Needs work

Check
#200185: truncate_utf8() is used as a substring function
http://drupal.org/node/269911#comment-2795478

Basically, you cannot assume that spaces are word boundaries for CJK languages, and truncate_utf8() should not be used to make the substring.

Also, just as a coding style thing, use TRUE not true.

lambic’s picture

I guess someone else will have to come up with a test for that case, I don't know enough about CJK languages to feel confident about it.

jhodgdon’s picture

jhodgdon’s picture

Bump.

#768040 is on the verge of being committed (I hope), and in any case, is addressing the question of how to split both Latin and non-Latin languages. So it is fine to use trucate_utf8() to do word-safe truncation.

However, this patch is not asking for word-safe truncation, so it could split in the middle of an HTML tag. That's not good. It can probably be rewritten to do the right thing, but I don't think it should be committed as is.

brianV’s picture

To address jhodgon's concerns about wordsafe-ness, this is handled by use trimming to 500 characters, then stripping tags, then trimming to approximately 40 characters.

This is done like this to strike a balance between edge cases where the first few hundred characters are parts of long HTML tags (in which case we want to strip tags from as large a sample as possible in order to get usable text), and performance issues (the longer a chunk we send to strip_tags(), the worse the performance is).

500 characters is a good tradeoff between these two factors. While there will still be the very occasional edge case where the first 500 characters are all within HTML tags, I think at that length they will be rare enough that it's not worth the extra processing power to send a large chunk to strip_tags().

jhodgdon’s picture

Really? Why risk the occasional problem? Isn't this just happening occasionally during cron, and once per item that Aggregator is aggregating? How much more time are we talking about?

Also, just to point out, if it hasn't been before, the code needs a standards check. For instance, we use TRUE, not true.

Jody Lynn’s picture

Version: 7.x-dev » 8.x-dev
jhedstrom’s picture

Status: Needs work » Postponed (maintainer needs more info)
Issue tags: +Needs issue summary update

Is this still relevant with the move to Zend for most of the feeds heavy lifting?

Version: 8.0.x-dev » 8.1.x-dev

Drupal 8.0.6 was released on April 6 and is the final bugfix release for the Drupal 8.0.x series. Drupal 8.0.x will not receive any further development aside from security fixes. Drupal 8.1.0-rc1 is now available and sites should prepare to update to 8.1.0.

Bug reports should be targeted against the 8.1.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.9 was released on September 7 and is the final bugfix release for the Drupal 8.1.x series. Drupal 8.1.x will not receive any further development aside from security fixes. Drupal 8.2.0-rc1 is now available and sites should prepare to upgrade to 8.2.0.

Bug reports should be targeted against the 8.2.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.6 was released on February 1, 2017 and is the final full bugfix release for the Drupal 8.2.x series. Drupal 8.2.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.3.0 on April 5, 2017. (Drupal 8.3.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.3.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.6 was released on August 2, 2017 and is the final full bugfix release for the Drupal 8.3.x series. Drupal 8.3.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.4.0 on October 4, 2017. (Drupal 8.4.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.4.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.4 was released on January 3, 2018 and is the final full bugfix release for the Drupal 8.4.x series. Drupal 8.4.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.5.0 on March 7, 2018. (Drupal 8.5.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.5.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.6 was released on August 1, 2018 and is the final bugfix release for the Drupal 8.5.x series. Drupal 8.5.x will not receive any further development aside from security fixes. Sites should prepare to update to 8.6.0 on September 5, 2018. (Drupal 8.6.0-rc1 is available for testing.)

Bug reports should be targeted against the 8.6.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.6.x-dev » 8.8.x-dev

Drupal 8.6.x will not receive any further development aside from security fixes. Bug reports should be targeted against the 8.8.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.9.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.8.x-dev » 8.9.x-dev

Drupal 8.8.7 was released on June 3, 2020 and is the final full bugfix release for the Drupal 8.8.x series. Drupal 8.8.x will not receive any further development aside from security fixes. Sites should prepare to update to Drupal 8.9.0 or Drupal 9.0.0 for ongoing support.

Bug reports should be targeted against the 8.9.x-dev branch from now on, and new development or disruptive changes should be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

quietone’s picture

Status: Postponed (maintainer needs more info) » Closed (outdated)
Issue tags: +Bug Smash Initiative

I think this is outdated. Drupal switched to using third party parsing in #1839468: [Followup] Replace aggregator rss parsing with Zend Feed and in now using Laminas, #3104015: Replace ZendFramework/* dependencies with their Laminas equivalents.

Therefore, closing as outdated.

Thanks!