When an RSS item has no title node the aggregator module uses the description to provide one. It does this by taking the first 40 chars of the description and truncating that to a whole word boundary.
BUG:
However if the description is less than 40 characters in size the code used will always strip the last word from the title! This causes feeds like the audioscrobbler feed to display incomplete stuff!
FIX:
The code responsible is in aggregator.module, in the function aggregator_parse_feed. The line
$title = preg_replace('/^(.*)[^\w;&].*?$/', "\\1", truncate_utf8($item['DESCRIPTION'], 40));
should be replaced with
$title = preg_replace('/^(.*)[^\w;&].*?$/', "\\1", truncate_utf8($item['DESCRIPTION'] . ' ', 40));
| Comment | File | Size | Author |
|---|---|---|---|
| #3 | aggregator_truncate.patch | 719 bytes | cburschka |
Comments
Comment #1
magico commentedComment #2
magico commentedLet's suppose that the description is
"one three five seven nine eleven thirteen fithteen"when we do the truncate to 40 the result would be"one three five seven nine eleven thirtee", but it really is"one three five seven nine eleven".I think that this is the expected behaviour, to avoid truncated titles with partial words.
I'm just confirming this.
Comment #3
cburschkaThat's intended, but it's not what the reporter was complaining about as far as I understand it.
The issue was about shorter descriptions also losing their last word, even though they're already under 40 characters. This is because the regex simply assumes that the last word is a fragment.
The proposed change appends a space to the description. If the description is longer than 40, this makes no difference; if it is shorter it will cause the regex to avoid removing the last word.
The proposed change still misses these cases: A description that is exactly 40 characters (or where the 40th character is the last letter of a complete word) still gets truncated. Increasing the arbitrary truncation length to 41 sounds like it only shifts the problem by one character, but it actually fixes the behavior of the function so that it works as documented: If it is documented as "using the first 40 characters including only the last whole word", then it needs to truncate to 41 (not 40) and then remove the last word in the string.
Example:
Patch is rolled for 5.x-dev, but succeeds on head as well.
Comment #4
cburschka[...]
Comment #5
cburschkaWow, I own the oldest patch in the queue! Let's see if this still happens in D7.
Comment #6
jhodgdonJust as a note, CJK languages do not use spaces as word boundaries. Also, this should probably use drupal_substr() instead of truncate_utf8(). See
#200185: truncate_utf8() is used as a substring function
Comment #7
jhodgdonSee also
#768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug)
Comment #8
jhodgdonThe fix to #768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug) will fix this issue, without the above patch, so I am marking this issue as a duplicate (feel free to reopen if I am incorrect, but it should). The problem was the same bug in drupal_substr() that was identified in that issue.
Also, the comment in #6 is now wrong. truncate_utf8() is the correct function to use.