node_teaser() not breaking properly [#73910]

Unless I'm misunderstanding the intention, I don't think this code is doing what the comments say ...

  // In some cases, no delimiter has been specified (e.g. when posting using
  // the Blogger API). In this case, we try to split at paragraph boundaries.
  // When even the first paragraph is too long, we try to split at the end of
  // the next sentence.
  $breakpoints = array('</p>' => 4, '<br />' => 0, '<br>' => 0, "\n" => 0, '. ' => 1, '! ' => 1, '? ' => 1, '„ÄÇ' => 3, 'ÿü ' => 1);
  foreach ($breakpoints as $point => $charnum) {
    if ($length = strpos($body, $point, $size)) {
      return substr($body, 0, $length + $charnum);
    }
  }

The comments lead me to believe that if the end of the first paragraph occurs after $size chars, that it will attempt to break before the end of that paragraph, correct?

I have a case where there is a <p> </p> surrounding the entire body (with none in between anywhere). In this case, it returns the entire body, no matter how long it is. Am I missing something?

Comment	File	Size	Author
#16	node_teaser_47.patch	2.1 KB	JohnAlbin
#16
#13	node_teaser_2.patch	2.18 KB	JohnAlbin
#13
#10	node_teaser_1.patch	2.47 KB	gerd riesselmann
#10
#8	node_teaser_0.patch	2.38 KB	JohnAlbin
#8
#6	node.module.4-7.node_teaser.patch	1.01 KB	gerd riesselmann
#6
#4	node_teaser.patch	976 bytes	gerd riesselmann
#4

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

dman CreditAttribution: dman commented 16 July 2006 at 02:56

According to that block of code...
It will scan for punctuation if an end para is not found.
So your text should contain a '. ' (with trailing space) or a br or a newline etc, to break intelligently.

... however, yeah, looking closer, it seems that it starts scanning after the desired endpoint ... and yes, the first thing it sees is the very last [/p]

Probably not what you wanted, no.

It's making a guess that because you deliberatly wrapped the block in a single P, you never want that paragraph chopped up. This seems like sensible behaviour. More sensible than having all your input in one paragraph, anyway.
If you are clear enough to be using paras at all, you probably should be using them semantically.

.dan.

Comment #2

RayZ CreditAttribution: RayZ commented 17 July 2006 at 12:34

My point was that the comments say ...

  // When even the first paragraph is too long, we try to split at the end of
  // the next sentence.

... but that's not what the code does. The code will never split a long first paragraph.

Btw, I'm trying to use node_teaser() to generate a teaser for a CCK text field and I want to do it after the filtering has been applied so I'm passing field_mytextfield[0]['view'] to node_teaser(), and it appears that field_mytextfield[0]['view'] puts P tags around it, since field_mytextfield[0]['value'] does not have those P tags.

I'm interested in hearing if there is a better approach to my specific case, but I still believe there is a bug in node_teaser(), either in the code or in the comments.

Comment #3

magico CreditAttribution: magico commented 20 September 2006 at 22:59

Priority:

Normal

» Minor

Comment #4

gerd riesselmann CreditAttribution: gerd riesselmann commented 6 December 2006 at 16:12

Version:	x.y.z	» 4.7.4
Priority:	Minor	» Critical
Status:	Active	» Needs review

File	Size
node_teaser.patch	976 bytes

The OP is right: This code creates a teaser that has at least a length of teaser_length chars. But teaser_length is defined as the maximum chars the teaser is allowed to have, so this code is plain wrong.

I marked it as beeing critical, since a correct teaser length is crucial on portals, where space on the front page is limited.

The reason the code fails is in this line:

if ($length = strpos($body, $point, $size)) {

$size = teaser_length, so if teaser_length was 600, this code starts searching for a breakpoint after the 600th char. Instead, of course, we must search from the 600th chars backwards to find what we are looking for:

if ($length = strrpos(truncate_utf8($body, $size), $point)) {

An optimized version (calculating truncate_utf8 only once) is attached as a patch.

Comment #5

gerd riesselmann CreditAttribution: gerd riesselmann commented 6 December 2006 at 19:07

BTW: Does anybody knows what '„ÄÇ' and 'ÿü' mean?

May it be these once had a meaning but got crumbled by some character set conversion? On my system for example they read 'ă€‚' and 'Řź ' - and this was taken fresh out of CVS

Comment #6

gerd riesselmann CreditAttribution: gerd riesselmann commented 6 December 2006 at 22:37

File	Size
node.module.4-7.node_teaser.patch	1.01 KB

Patch above was against HEAD, this patch is for 4.7 branch.

Comment #7

JohnAlbin

he/they/他

English

Taipei, Taiwan

CreditAttribution: JohnAlbin commented 6 December 2006 at 23:55

Version:	4.7.4	» 5.x-dev
Status:	Needs review	» Needs work

Gerd, those are UTF-8 characters: 。؟ And I assume they are end-of-sentence markers in some language.

strrpos($haystack, $needle) can only use single characters as a needle in PHP4. (See http://www.php.net/manual/en/function.strrpos.php). So we'll need a work around for PHP4.

Comment #8

JohnAlbin

he/they/他

English

Taipei, Taiwan

CreditAttribution: JohnAlbin commented 7 December 2006 at 06:00

Status:

Needs work

» Needs review

File	Size
node_teaser_0.patch	2.38 KB

Here's an updated patch for 5.x-dev. It works around the PHP4 strrpos() issue. I believe I accounted for all edge cases, but please double check.

And it also gives all end-of-sentence characters equal standing when the teaser is shorter than the first paragraph; the old code would chop at a period even when chopping at an exclamation mark was better.

(This patch applies successfully to the 4.7.x-dev version as well. "Hunk #1 succeeded at 180 with fuzz 1 (offset 7 lines)." I can supply a 4.7-specific patch, if needed.)

Comment #9

ChrisKennedy CreditAttribution: ChrisKennedy commented 7 December 2006 at 06:33

Priority:

Critical

» Normal

This is a good bug to fix but it isn't critical. See http://drupal.org/node/45111 for more info.

Comment #10

gerd riesselmann CreditAttribution: gerd riesselmann commented 7 December 2006 at 11:14

File	Size
node_teaser_1.patch	2.47 KB

I looked up both the UTF-8-chars: The first one is the ideographic full stop, the second the arabic question mark. However, PHP-Eclipse with UTF-8 encoding only shows "???" for the first and "?? " for the second (that's why I was wondering). On CVS diff, the last one reverts the writing direction (looks funny).

I therefore would suggest to replace them with their hexadcimal representation 0xEF 0xBD 0xA1 and 0xD8 0x9F. This avoids any kind of unwanted interference with local charset. I modified John's patch accordingly (against 4.7 branch).

Additionally: What do you think about implementing all international sentence dividers like specified by Unicode STerm: http://www.sql-und-xml.de/unicode-database/sterm.html

Comment #11

JohnAlbin

he/they/他

English

Taipei, Taiwan

CreditAttribution: JohnAlbin commented 7 December 2006 at 16:25

Gerd, please see http://drupal.org/node/12864 (anyone else have a better link to the UTF-8 in source code standard?). UTF-8 characters are included in much of Drupal's source code; it's the standard. Also, I don't think your patch would work since PHP doesn't do hexadecimal escaping inside single quotes.

Please continue to use my previous patch.

Comment #12

ChrisKennedy CreditAttribution: ChrisKennedy commented 8 December 2006 at 08:54

Here's an earlier discussion of this issue: Teaser can be longer than maximum allowed..

Comment #13

JohnAlbin

he/they/他

English

Taipei, Taiwan

CreditAttribution: JohnAlbin commented 8 December 2006 at 21:07

File	Size
node_teaser_2.patch	2.18 KB

Chris, thanks for pointing out the previous discussion.

The only major issue I see in the previous discussion is that people are worried about invalid html caused by chopping off closing tags (such as </div> or </p>.) However, since that is a problem with the current code as well as with my patch, I think "invalid html in teaser" is a separate issue.

Looking at the code presented by Adam (http://drupal.org/node/28211#comment-52638), I decided to do timing tests on his idea (search for each breakpoint using strpos() and compare the found positions) versus my idea (generate a regex using all breakpoints and use preg_match()).

using strpos() inside a foreach loop: 0.14s
using preg_match() with no loops: 0.24s

So, I rerolled my patch using strpos()

Comment #14

Steven CreditAttribution: Steven commented 12 December 2006 at 08:34

Status:

Needs review

» Fixed

Tested and verified. We should indeed do as the description says.

I did clean up the comments and code a bit. It could be simplified. Committed to HEAD.

Comment #15

RayZ CreditAttribution: RayZ commented 12 December 2006 at 12:29

Version:	5.x-dev	» 4.7.x-dev
Status:	Fixed	» Patch (to be ported)

Could use a backport to 4.7.x

Comment #16

JohnAlbin

he/they/他

English

Taipei, Taiwan

CreditAttribution: JohnAlbin commented 12 December 2006 at 17:12

Status:

Patch (to be ported)

» Needs review

File	Size
node_teaser_47.patch	2.1 KB

Here is Steven's commit patch against 4.7.x-dev.

Steven, one extremely minor nitpick: In your comments, you say We use strpos on the reversed needle and haystack for speed. But it's really because strrpos() is useless/broken in PHP4.

Comment #17

killes@www.drop.org CreditAttribution: killes@www.drop.org commented 4 January 2007 at 19:50

Status:

Needs review

» Fixed

applied

Comment #18

bjaspan CreditAttribution: bjaspan commented 9 January 2007 at 17:26

IMHO, this is a substantial, possibly site-breaking, change. I agree that the previous method for computing teasers was not necessarily what was meant or expected, but the new method will result in incorrect behavior for sites that have optimized themselves to the old behavior.

For example, one client of mine wants one-sentence teasers, so we set the teaser length to 0. The old code found the first sentence/paragraph/whatever break after position 0, as desired. The new code (as I understand it) will result in empty teasers.

This deserves documentation.