In cases where the element of a feed is empty, Aggregator will try to generate a title using a portion of the first 40 characters:
$item['TITLE'] = preg_replace('/^(.*)[^\w;&].*?$/', "\\1", truncate_utf8($item['DESCRIPTION']), 40);
Unfortunately, the above often allows html tags and other garbage into the feed item's title.
This patch replaces the above with a simpler method that tested fine for me.
1. Run the description through strip_tags() to remove all html tags from it.
2. Run the tag-free description through truncate_utf8 with $wordsafe true, to encourage the truncation to be at a word boundary if possible.
$stripped = strip_tags(truncate_utf8($item['DESCRIPTION'], 500));
$item['TITLE'] = truncate_utf8($stripped, 40, true);
I originally tried running strip_tags() *after* truncating the data to 40 characters for efficiency, but this failed in instances where the initial tag was a long image tag or similar where it already took more than 40 characters, and therefore made up the entire truncated text.
However, stripping the entire description tag is impractical when the feed item is extremely long - strip tags is essentially a regex. So I have tried to compromise by truncating to 500 characters before stripping tags, then re-truncating down to 40 characters for the title.
This should generate useful titles in most instances except when the first 500 characters are entirely HTML tags - quite possible, for instance, when a user posts a bunch of pictures in a blog post prior to writing anything. In that case, $item['TITLE'] would end up blank. Perhaps it may be necessary to increase the initial truncation limit to 1000 characters.
Comment | File | Size | Author |
---|---|---|---|
#17 | 404356.patch | 5.27 KB | lambic |
#14 | 404356.patch | 4.54 KB | lambic |
#12 | 404356.patch | 4.54 KB | lambic |
#9 | aggregator-strip-tags-from-title-if-description-rev3.patch | 4.69 KB | brianV |
#5 | aggregator-strip-tags-from-title-if-description-rev2.patch | 1.55 KB | brianV |
Comments
Comment #1
brianV CreditAttribution: brianV commentedComment #2
brianV CreditAttribution: brianV commentedJust to clarify,
In terms of getting good results, stripping the entire description prior to truncation gives the maximum probability that there will be text left over to use as a title. ie., we don't have to worry about the first X many characters being completely composed of HTML tags.
In terms of sheer performance, pre-truncating the data prior to running strip tags yields the best performance. The smaller the data set we run strip_tags over, the better the performance.
I think in most cases, stripping the entire description prior to truncation is acceptable for most users; however, there are boundary cases where that just wouldn't be acceptable.
Boundary cases
This is a tradeoff between performance and increasing the likelihood of good results. I would like to see some discussion from other users on this...
Comment #3
brianV CreditAttribution: brianV commentedHere is a good feed to use to test this functionality:
http://alexnoriegasketchblog.blogspot.com/rss.xml
The old handling results in a lot of the feed item titles containing parts of various html tags. With this patch, titles are very much improved.
Comment #4
alex_b CreditAttribution: alex_b commentedNice work.
I think the boundary cases are fine. I'd love to see the clarification of #2 in short form as a comment in the code - otherwise the double truncation looks confusing in the first place.
Comment #5
brianV CreditAttribution: brianV commentedNew revision added with text
'Description is truncated to 500 characters prior to stripping tags to ensure performance with extremely long feed items.'
included in comments for that code section.Comment #6
alex_b CreditAttribution: alex_b commentedGreat... RTBC from my point of view.
Comment #7
alex_b CreditAttribution: alex_b commentedFor the record, there has been an early issue for this bug #19573: Aggregator: Corrupted title based on item description with HTML
Comment #8
webchickHm. Sorry, but I think we need to add a test or two to ensure this stays working as expected.
Comment #9
brianV CreditAttribution: brianV commentedI've started putting together a test case, although I don't have it working right yet.
If someone else wants to finish it off - go right ahead. I don't think I'll be able to finish it off for a while.
New patch is previous revision, plus the test-case as it stands now. Leaving at needs work since the testcase fails.
Comment #10
brianV CreditAttribution: brianV commentedJust tagging as needs tests. The feature is RTBC quality, it just needs a test, and I am more or less incompetent with tests.
Comment #11
lambic CreditAttribution: lambic commentedI have a working test for this, will upload when I find out why my uploading is broken.
Comment #12
lambic CreditAttribution: lambic commentedworking test attached
Comment #14
lambic CreditAttribution: lambic commentedok, let's try that again
Comment #15
lambic CreditAttribution: lambic commentedComment #17
lambic CreditAttribution: lambic commentedhelps if I upload the right patch file
Comment #18
lambic CreditAttribution: lambic commentedtagging for Montreal code sprint
Comment #19
brianV CreditAttribution: brianV commentedlambic, that test looks perfect to me. However, since I am fairly involved in this issue, I'll wait for someone else to RTBC it.
Comment #20
jhodgdonCheck
#200185: truncate_utf8() is used as a substring function
http://drupal.org/node/269911#comment-2795478
Basically, you cannot assume that spaces are word boundaries for CJK languages, and truncate_utf8() should not be used to make the substring.
Also, just as a coding style thing, use TRUE not true.
Comment #21
lambic CreditAttribution: lambic commentedI guess someone else will have to come up with a test for that case, I don't know enough about CJK languages to feel confident about it.
Comment #22
jhodgdonSee also
#768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug)
Comment #23
jhodgdonBump.
#768040 is on the verge of being committed (I hope), and in any case, is addressing the question of how to split both Latin and non-Latin languages. So it is fine to use trucate_utf8() to do word-safe truncation.
However, this patch is not asking for word-safe truncation, so it could split in the middle of an HTML tag. That's not good. It can probably be rewritten to do the right thing, but I don't think it should be committed as is.
Comment #24
brianV CreditAttribution: brianV commentedTo address jhodgon's concerns about wordsafe-ness, this is handled by use trimming to 500 characters, then stripping tags, then trimming to approximately 40 characters.
This is done like this to strike a balance between edge cases where the first few hundred characters are parts of long HTML tags (in which case we want to strip tags from as large a sample as possible in order to get usable text), and performance issues (the longer a chunk we send to strip_tags(), the worse the performance is).
500 characters is a good tradeoff between these two factors. While there will still be the very occasional edge case where the first 500 characters are all within HTML tags, I think at that length they will be rare enough that it's not worth the extra processing power to send a large chunk to strip_tags().
Comment #25
jhodgdonReally? Why risk the occasional problem? Isn't this just happening occasionally during cron, and once per item that Aggregator is aggregating? How much more time are we talking about?
Also, just to point out, if it hasn't been before, the code needs a standards check. For instance, we use TRUE, not true.
Comment #26
Jody LynnComment #27
jhedstromIs this still relevant with the move to Zend for most of the feeds heavy lifting?
Comment #36
quietone CreditAttribution: quietone as a volunteer commentedI think this is outdated. Drupal switched to using third party parsing in #1839468: [Followup] Replace aggregator rss parsing with Zend Feed and in now using Laminas, #3104015: Replace ZendFramework/* dependencies with their Laminas equivalents.
Therefore, closing as outdated.
Thanks!