Aggregator should strip tags before using beginning of description tag as title [#404356]

In cases where the element of a feed is empty, Aggregator will try to generate a title using a portion of the first 40 characters:

$item['TITLE'] = preg_replace('/^(.*)[^\w;&].*?$/', "\\1", truncate_utf8($item['DESCRIPTION']), 40);

Unfortunately, the above often allows html tags and other garbage into the feed item's title.

This patch replaces the above with a simpler method that tested fine for me.

1. Run the description through strip_tags() to remove all html tags from it.
2. Run the tag-free description through truncate_utf8 with $wordsafe true, to encourage the truncation to be at a word boundary if possible.

$stripped = strip_tags(truncate_utf8($item['DESCRIPTION'], 500));
      $item['TITLE'] = truncate_utf8($stripped, 40, true);

I originally tried running strip_tags() *after* truncating the data to 40 characters for efficiency, but this failed in instances where the initial tag was a long image tag or similar where it already took more than 40 characters, and therefore made up the entire truncated text.

However, stripping the entire description tag is impractical when the feed item is extremely long - strip tags is essentially a regex. So I have tried to compromise by truncating to 500 characters before stripping tags, then re-truncating down to 40 characters for the title.

This should generate useful titles in most instances except when the first 500 characters are entirely HTML tags - quite possible, for instance, when a user posts a bunch of pictures in a blog post prior to writing anything. In that case, $item['TITLE'] would end up blank. Perhaps it may be necessary to increase the initial truncation limit to 1000 characters.

Comment	File	Size	Author
#17	404356.patch	5.27 KB	lambic
#14	404356.patch	4.54 KB	lambic
#12	404356.patch	4.54 KB	lambic
#9	aggregator-strip-tags-from-title-if-description-rev3.patch	4.69 KB	brianV
#5	aggregator-strip-tags-from-title-if-description-rev2.patch	1.55 KB	brianV
	aggregator-strip-tags-from-title-if-description.patch	1.1 KB	brianV

Comments

Comment #1

brianV commented 17 March 2009 at 04:40

Status:

Active

» Needs review

Comment #2

brianV commented 17 March 2009 at 05:02

Just to clarify,

In terms of getting good results, stripping the entire description prior to truncation gives the maximum probability that there will be text left over to use as a title. ie., we don't have to worry about the first X many characters being completely composed of HTML tags.

In terms of sheer performance, pre-truncating the data prior to running strip tags yields the best performance. The smaller the data set we run strip_tags over, the better the performance.

I think in most cases, stripping the entire description prior to truncation is acceptable for most users; however, there are boundary cases where that just wouldn't be acceptable.

Boundary cases

User is using wordpress as a photoblog of sorts. He adds 20-50 pictures with full alt and title attributes before writing any text. The first several thousand characters are all parts of tags. In this case, we would want to strip the entire article prior to truncating to have a good chance of getting a usable result.
User's feed has 1000 items on it, and each one is written like an encyclopedia article with lots of emphasis and strong tags sprinkled through it. A good title could be generated from the first 100 characters, and running strip tags over 1000 feed items like this would be a very inefficient use of resources.

This is a tradeoff between performance and increasing the likelihood of good results. I would like to see some discussion from other users on this...

Comment #3

brianV commented 17 March 2009 at 05:15

Here is a good feed to use to test this functionality:

http://alexnoriegasketchblog.blogspot.com/rss.xml

There are many empty title tags in the fields, so the titles are generated from the description for a lot of the feed items.
Many of the descriptions start with html tags.

The old handling results in a lot of the feed item titles containing parts of various html tags. With this patch, titles are very much improved.

Comment #4

alex_b commented 17 March 2009 at 11:25

Status:

Needs review

» Needs work

Nice work.

I think the boundary cases are fine. I'd love to see the clarification of #2 in short form as a comment in the code - otherwise the double truncation looks confusing in the first place.

Comment #5

brianV commented 17 March 2009 at 12:29

Status:

Needs work

» Needs review

Status	File	Size
new	aggregator-strip-tags-from-title-if-description-rev2.patch	1.55 KB

New revision added with text 'Description is truncated to 500 characters prior to stripping tags to ensure performance with extremely long feed items.' included in comments for that code section.

Comment #6

alex_b commented 17 March 2009 at 20:22

Status:

Needs review

» Reviewed & tested by the community

Great... RTBC from my point of view.

Comment #7

alex_b commented 26 March 2009 at 04:01

For the record, there has been an early issue for this bug #19573: Aggregator: Corrupted title based on item description with HTML

Comment #8

webchick

she/they

English

Vancouver 🇨🇦

commented 30 March 2009 at 04:59

Status:

Reviewed & tested by the community

» Needs work

Hm. Sorry, but I think we need to add a test or two to ensure this stays working as expected.

Comment #9

brianV commented 22 April 2009 at 16:38

Status	File	Size
new	aggregator-strip-tags-from-title-if-description-rev3.patch	4.69 KB

I've started putting together a test case, although I don't have it working right yet.

If someone else wants to finish it off - go right ahead. I don't think I'll be able to finish it off for a while.

New patch is previous revision, plus the test-case as it stands now. Leaving at needs work since the testcase fails.

Comment #10

brianV commented 16 September 2009 at 14:26

Issue tags:

+Needs tests

Just tagging as needs tests. The feature is RTBC quality, it just needs a test, and I am more or less incompetent with tests.

Comment #11

lambic commented 19 February 2010 at 21:40

I have a working test for this, will upload when I find out why my uploading is broken.

Comment #12

lambic commented 21 February 2010 at 03:23

Status:

Needs work

» Needs review

Status	File	Size
new	404356.patch	4.54 KB

working test attached

Comment #13

21 February 2010 at 03:40

Status:

Needs review

» Needs work

The last submitted patch, 404356.patch, failed testing.

Comment #14

lambic commented 21 February 2010 at 04:02

Status	File	Size
new	404356.patch	4.54 KB

ok, let's try that again

Comment #15

lambic commented 21 February 2010 at 04:02

Status:

Needs work

» Needs review

Comment #16

21 February 2010 at 04:30

Status:

Needs review

» Needs work

The last submitted patch, 404356.patch, failed testing.

Comment #17

lambic commented 21 February 2010 at 19:21

Status:

Needs work

» Needs review

Status	File	Size
new	404356.patch	5.27 KB

helps if I upload the right patch file

Comment #18

lambic commented 21 February 2010 at 20:51

Issue tags:

+D7csmtl

tagging for Montreal code sprint

Comment #19

brianV commented 24 February 2010 at 14:55

lambic, that test looks perfect to me. However, since I am fairly involved in this issue, I'll wait for someone else to RTBC it.

Comment #20

jhodgdon

she/her

English

commented 2 April 2010 at 16:33

Status:

Needs review

» Needs work

Check
#200185: truncate_utf8() is used as a substring function
http://drupal.org/node/269911#comment-2795478

Basically, you cannot assume that spaces are word boundaries for CJK languages, and truncate_utf8() should not be used to make the substring.

Also, just as a coding style thing, use TRUE not true.

Comment #21

lambic commented 2 April 2010 at 19:38

I guess someone else will have to come up with a test for that case, I don't know enough about CJK languages to feel confident about it.

Comment #22

jhodgdon

she/her

English

commented 11 April 2010 at 14:21

Comment #23

jhodgdon

she/her

English

commented 10 June 2010 at 15:26

Bump.

#768040 is on the verge of being committed (I hope), and in any case, is addressing the question of how to split both Latin and non-Latin languages. So it is fine to use trucate_utf8() to do word-safe truncation.

However, this patch is not asking for word-safe truncation, so it could split in the middle of an HTML tag. That's not good. It can probably be rewritten to do the right thing, but I don't think it should be committed as is.

Comment #24

brianV commented 23 June 2010 at 00:10

To address jhodgon's concerns about wordsafe-ness, this is handled by use trimming to 500 characters, then stripping tags, then trimming to approximately 40 characters.

This is done like this to strike a balance between edge cases where the first few hundred characters are parts of long HTML tags (in which case we want to strip tags from as large a sample as possible in order to get usable text), and performance issues (the longer a chunk we send to strip_tags(), the worse the performance is).

500 characters is a good tradeoff between these two factors. While there will still be the very occasional edge case where the first 500 characters are all within HTML tags, I think at that length they will be rare enough that it's not worth the extra processing power to send a large chunk to strip_tags().

Comment #25

jhodgdon

she/her

English

commented 23 June 2010 at 15:46

Really? Why risk the occasional problem? Isn't this just happening occasionally during cron, and once per item that Aggregator is aggregating? How much more time are we talking about?

Also, just to point out, if it hasn't been before, the code needs a standards check. For instance, we use TRUE, not true.

Comment #26

jody lynn

she/her

English

commented 5 January 2011 at 21:20

Version:

7.x-dev

» 8.x-dev

Comment #27

jhedstrom

English

Portland, OR

commented 7 March 2016 at 16:57

Status:	Needs work	» Postponed (maintainer needs more info)
Issue tags:		+Needs issue summary update

Is this still relevant with the move to Zend for most of the feeds heavy lifting?

Comment #28

7 March 2016 at 16:57

Version:

8.0.x-dev

» 8.1.x-dev

Drupal 8.0.6 was released on April 6 and is the final bugfix release for the Drupal 8.0.x series. Drupal 8.0.x will not receive any further development aside from security fixes. Drupal 8.1.0-rc1 is now available and sites should prepare to update to 8.1.0.

Bug reports should be targeted against the 8.1.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Comment #29

7 March 2016 at 16:57

Version:

8.1.x-dev

» 8.2.x-dev

Drupal 8.1.9 was released on September 7 and is the final bugfix release for the Drupal 8.1.x series. Drupal 8.1.x will not receive any further development aside from security fixes. Drupal 8.2.0-rc1 is now available and sites should prepare to upgrade to 8.2.0.

Bug reports should be targeted against the 8.2.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Comment #30

7 March 2016 at 16:57

Version:

8.2.x-dev

» 8.3.x-dev

Drupal 8.2.6 was released on February 1, 2017 and is the final full bugfix release for the Drupal 8.2.x series. Drupal 8.2.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.3.0 on April 5, 2017. (Drupal 8.3.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.3.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Comment #31

7 March 2016 at 16:57

Version:

8.3.x-dev

» 8.4.x-dev

Drupal 8.3.6 was released on August 2, 2017 and is the final full bugfix release for the Drupal 8.3.x series. Drupal 8.3.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.4.0 on October 4, 2017. (Drupal 8.4.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.4.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Comment #32

7 March 2016 at 16:57

Version:

8.4.x-dev

» 8.5.x-dev

Drupal 8.4.4 was released on January 3, 2018 and is the final full bugfix release for the Drupal 8.4.x series. Drupal 8.4.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.5.0 on March 7, 2018. (Drupal 8.5.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.5.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Comment #33

7 March 2016 at 16:57

Version:

8.5.x-dev

» 8.6.x-dev

Drupal 8.5.6 was released on August 1, 2018 and is the final bugfix release for the Drupal 8.5.x series. Drupal 8.5.x will not receive any further development aside from security fixes. Sites should prepare to update to 8.6.0 on September 5, 2018. (Drupal 8.6.0-rc1 is available for testing.)

Bug reports should be targeted against the 8.6.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Comment #34

7 March 2016 at 16:57

Version:

8.6.x-dev

» 8.8.x-dev

Drupal 8.6.x will not receive any further development aside from security fixes. Bug reports should be targeted against the 8.8.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.9.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Comment #35

7 March 2016 at 16:57

Version:

8.8.x-dev

» 8.9.x-dev

Drupal 8.8.7 was released on June 3, 2020 and is the final full bugfix release for the Drupal 8.8.x series. Drupal 8.8.x will not receive any further development aside from security fixes. Sites should prepare to update to Drupal 8.9.0 or Drupal 9.0.0 for ongoing support.

Bug reports should be targeted against the 8.9.x-dev branch from now on, and new development or disruptive changes should be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Comment #36

quietone commented 1 August 2021 at 00:46

Status:	Postponed (maintainer needs more info)	» Closed (outdated)
Issue tags:		+Bug Smash Initiative

I think this is outdated. Drupal switched to using third party parsing in #1839468: [Followup] Replace aggregator rss parsing with Zend Feed and in now using Laminas, #3104015: Replace ZendFramework/* dependencies with their Laminas equivalents.

Therefore, closing as outdated.

Thanks!