Title truncation from description always removes last word [#34171]

When an RSS item has no title node the aggregator module uses the description to provide one. It does this by taking the first 40 chars of the description and truncating that to a whole word boundary.

BUG:
However if the description is less than 40 characters in size the code used will always strip the last word from the title! This causes feeds like the audioscrobbler feed to display incomplete stuff!

FIX:
The code responsible is in aggregator.module, in the function aggregator_parse_feed. The line

      $title = preg_replace('/^(.*)[^\w;&].*?$/', "\\1", truncate_utf8($item['DESCRIPTION'], 40));

should be replaced with

      $title = preg_replace('/^(.*)[^\w;&].*?$/', "\\1", truncate_utf8($item['DESCRIPTION'] . ' ', 40));

Comment	File	Size	Author
#3	aggregator_truncate.patch	719 bytes	cburschka

Comments

Comment #1

magico commented 19 August 2006 at 15:13

Version:

4.6.3

» 4.6.9

Comment #2

magico commented 24 August 2006 at 15:41

Title:	aggregator: Title truncation from description always removes last word	» Title truncation from description always removes last word
Version:	4.6.9	» x.y.z
Status:	Needs review	» Active

Let's suppose that the description is "one three five seven nine eleven thirteen fithteen" when we do the truncate to 40 the result would be "one three five seven nine eleven thirtee", but it really is "one three five seven nine eleven".

I think that this is the expected behaviour, to avoid truncated titles with partial words.

I'm just confirming this.

Comment #3

cburschka

they

commented 1 July 2007 at 12:21

Version:

x.y.z

» 5.x-dev

Status	File	Size
new	aggregator_truncate.patch	719 bytes

That's intended, but it's not what the reporter was complaining about as far as I understand it.

The issue was about shorter descriptions also losing their last word, even though they're already under 40 characters. This is because the regex simply assumes that the last word is a fragment.

The proposed change appends a space to the description. If the description is longer than 40, this makes no difference; if it is shorter it will cause the regex to avoid removing the last word.

The proposed change still misses these cases: A description that is exactly 40 characters (or where the 40th character is the last letter of a complete word) still gets truncated. Increasing the arbitrary truncation length to 41 sounds like it only shifts the problem by one character, but it actually fixes the behavior of the function so that it works as documented: If it is documented as "using the first 40 characters including only the last whole word", then it needs to truncate to 41 (not 40) and then remove the last word in the string.

Example:

0000000001111111111222222222233333333334 (10x)
1234567890123456789012345678901234567890 ( 1x)

This input is the same in both:
one three five seven nine eleven thirteen fifteen (input)
one three five seven nine eleven thirtee (characters)
one three five seven nine eleven (words)

This is shorter than 40 and truncated needlessly unless a space is appended:
one three five seven nine eleven (input)
one three five seven nine eleven (characters, incorrect)
one three five seven nine (words, incorrect)
one three five seven nine eleven_ (characters, correct, _ represents space)
one three five seven nine eleven (words, correct)

This is exactly 40 and truncated needlessly unless space is appended AND string is cut off at 41:
one three five seven nine eleven bananas (input)
one three five seven nine eleven bananas (characters, incorrect)
one three five seven nine eleven (words, incorrect)
one three five seven nine eleven bananas_ (characters, correct)
one three five seven nine eleven bananas (words, correct)

Patch is rolled for 5.x-dev, but succeeds on head as well.

Comment #4

cburschka

they

commented 1 July 2007 at 12:22

Assigned:	Unassigned	» cburschka
Status:	Active	» Needs review

[...]

Comment #5

cburschka

they

commented 16 November 2008 at 20:57

Version:	5.x-dev	» 7.x-dev
Status:	Needs review	» Needs work

Wow, I own the oldest patch in the queue! Let's see if this still happens in D7.

Comment #6

jhodgdon

she/her

English

commented 2 April 2010 at 16:27

Just as a note, CJK languages do not use spaces as word boundaries. Also, this should probably use drupal_substr() instead of truncate_utf8(). See
#200185: truncate_utf8() is used as a substring function

Comment #7

jhodgdon

she/her

English

commented 11 April 2010 at 14:22

Comment #8

jhodgdon

she/her

English

commented 10 June 2010 at 15:46

Status:

Needs work

» Closed (duplicate)

The fix to #768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug) will fix this issue, without the above patch, so I am marking this issue as a duplicate (feel free to reopen if I am incorrect, but it should). The problem was the same bug in drupal_substr() that was identified in that issue.

Also, the comment in #6 is now wrong. truncate_utf8() is the correct function to use.

Title truncation from description always removes last word

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

News items

Our community

Documentation

Drupal code base

Governance of community