When an RSS item has no title node the aggregator module uses the description to provide one. It does this by taking the first 40 chars of the description and truncating that to a whole word boundary.

BUG:
However if the description is less than 40 characters in size the code used will always strip the last word from the title! This causes feeds like the audioscrobbler feed to display incomplete stuff!

FIX:
The code responsible is in aggregator.module, in the function aggregator_parse_feed. The line

      $title = preg_replace('/^(.*)[^\w;&].*?$/', "\\1", truncate_utf8($item['DESCRIPTION'], 40));

should be replaced with

      $title = preg_replace('/^(.*)[^\w;&].*?$/', "\\1", truncate_utf8($item['DESCRIPTION'] . ' ', 40));
CommentFileSizeAuthor
#3 aggregator_truncate.patch719 bytescburschka

Comments

magico’s picture

Version: 4.6.3 » 4.6.9
magico’s picture

Title: aggregator: Title truncation from description always removes last word » Title truncation from description always removes last word
Version: 4.6.9 » x.y.z
Status: Needs review » Active

Let's suppose that the description is "one three five seven nine eleven thirteen fithteen" when we do the truncate to 40 the result would be "one three five seven nine eleven thirtee", but it really is "one three five seven nine eleven".

I think that this is the expected behaviour, to avoid truncated titles with partial words.

I'm just confirming this.

cburschka’s picture

Version: x.y.z » 5.x-dev
StatusFileSize
new719 bytes

That's intended, but it's not what the reporter was complaining about as far as I understand it.

The issue was about shorter descriptions also losing their last word, even though they're already under 40 characters. This is because the regex simply assumes that the last word is a fragment.

The proposed change appends a space to the description. If the description is longer than 40, this makes no difference; if it is shorter it will cause the regex to avoid removing the last word.

The proposed change still misses these cases: A description that is exactly 40 characters (or where the 40th character is the last letter of a complete word) still gets truncated. Increasing the arbitrary truncation length to 41 sounds like it only shifts the problem by one character, but it actually fixes the behavior of the function so that it works as documented: If it is documented as "using the first 40 characters including only the last whole word", then it needs to truncate to 41 (not 40) and then remove the last word in the string.

Example:

0000000001111111111222222222233333333334 (10x)
1234567890123456789012345678901234567890 ( 1x)

This input is the same in both:
one three five seven nine eleven thirteen fifteen (input)
one three five seven nine eleven thirtee (characters)
one three five seven nine eleven (words)

This is shorter than 40 and truncated needlessly unless a space is appended:
one three five seven nine eleven (input)
one three five seven nine eleven (characters, incorrect)
one three five seven nine (words, incorrect)
one three five seven nine eleven_ (characters, correct, _ represents space)
one three five seven nine eleven (words, correct)

This is exactly 40 and truncated needlessly unless space is appended AND string is cut off at 41:
one three five seven nine eleven bananas (input)
one three five seven nine eleven bananas (characters, incorrect)
one three five seven nine eleven (words, incorrect)
one three five seven nine eleven bananas_ (characters, correct)
one three five seven nine eleven bananas (words, correct)

Patch is rolled for 5.x-dev, but succeeds on head as well.

cburschka’s picture

Assigned: Unassigned » cburschka
Status: Active » Needs review

[...]

cburschka’s picture

Version: 5.x-dev » 7.x-dev
Status: Needs review » Needs work

Wow, I own the oldest patch in the queue! Let's see if this still happens in D7.

jhodgdon’s picture

Just as a note, CJK languages do not use spaces as word boundaries. Also, this should probably use drupal_substr() instead of truncate_utf8(). See
#200185: truncate_utf8() is used as a substring function

jhodgdon’s picture

jhodgdon’s picture

Status: Needs work » Closed (duplicate)

The fix to #768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug) will fix this issue, without the above patch, so I am marking this issue as a duplicate (feel free to reopen if I am incorrect, but it should). The problem was the same bug in drupal_substr() that was identified in that issue.

Also, the comment in #6 is now wrong. truncate_utf8() is the correct function to use.