Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
Unless I'm misunderstanding the intention, I don't think this code is doing what the comments say ...
// In some cases, no delimiter has been specified (e.g. when posting using
// the Blogger API). In this case, we try to split at paragraph boundaries.
// When even the first paragraph is too long, we try to split at the end of
// the next sentence.
$breakpoints = array('</p>' => 4, '<br />' => 0, '<br>' => 0, "\n" => 0, '. ' => 1, '! ' => 1, '? ' => 1, '„ÄÇ' => 3, 'ÿü ' => 1);
foreach ($breakpoints as $point => $charnum) {
if ($length = strpos($body, $point, $size)) {
return substr($body, 0, $length + $charnum);
}
}
The comments lead me to believe that if the end of the first paragraph occurs after $size
chars, that it will attempt to break before the end of that paragraph, correct?
I have a case where there is a <p> </p> surrounding the entire body (with none in between anywhere). In this case, it returns the entire body, no matter how long it is. Am I missing something?
Comment | File | Size | Author |
---|---|---|---|
#16 | node_teaser_47.patch | 2.1 KB | JohnAlbin |
#13 | node_teaser_2.patch | 2.18 KB | JohnAlbin |
#10 | node_teaser_1.patch | 2.47 KB | gerd riesselmann |
#8 | node_teaser_0.patch | 2.38 KB | JohnAlbin |
#6 | node.module.4-7.node_teaser.patch | 1.01 KB | gerd riesselmann |
Comments
Comment #1
dman CreditAttribution: dman commentedAccording to that block of code...
It will scan for punctuation if an end para is not found.
So your text should contain a '. ' (with trailing space) or a br or a newline etc, to break intelligently.
... however, yeah, looking closer, it seems that it starts scanning after the desired endpoint ... and yes, the first thing it sees is the very last [/p]
Probably not what you wanted, no.
It's making a guess that because you deliberatly wrapped the block in a single P, you never want that paragraph chopped up. This seems like sensible behaviour. More sensible than having all your input in one paragraph, anyway.
If you are clear enough to be using paras at all, you probably should be using them semantically.
.dan.
Comment #2
RayZ CreditAttribution: RayZ commentedMy point was that the comments say ...
... but that's not what the code does. The code will never split a long first paragraph.
Btw, I'm trying to use node_teaser() to generate a teaser for a CCK text field and I want to do it after the filtering has been applied so I'm passing field_mytextfield[0]['view'] to node_teaser(), and it appears that field_mytextfield[0]['view'] puts P tags around it, since field_mytextfield[0]['value'] does not have those P tags.
I'm interested in hearing if there is a better approach to my specific case, but I still believe there is a bug in node_teaser(), either in the code or in the comments.
Comment #3
magico CreditAttribution: magico commentedComment #4
gerd riesselmann CreditAttribution: gerd riesselmann commentedThe OP is right: This code creates a teaser that has at least a length of teaser_length chars. But teaser_length is defined as the maximum chars the teaser is allowed to have, so this code is plain wrong.
I marked it as beeing critical, since a correct teaser length is crucial on portals, where space on the front page is limited.
The reason the code fails is in this line:
$size = teaser_length, so if teaser_length was 600, this code starts searching for a breakpoint after the 600th char. Instead, of course, we must search from the 600th chars backwards to find what we are looking for:
An optimized version (calculating truncate_utf8 only once) is attached as a patch.
Comment #5
gerd riesselmann CreditAttribution: gerd riesselmann commentedBTW: Does anybody knows what '„ÄÇ' and 'ÿü' mean?
May it be these once had a meaning but got crumbled by some character set conversion? On my system for example they read '。' and '؟ ' - and this was taken fresh out of CVS
Comment #6
gerd riesselmann CreditAttribution: gerd riesselmann commentedPatch above was against HEAD, this patch is for 4.7 branch.
Comment #7
JohnAlbinGerd, those are UTF-8 characters: 。؟ And I assume they are end-of-sentence markers in some language.
strrpos($haystack, $needle)
can only use single characters as a needle in PHP4. (See http://www.php.net/manual/en/function.strrpos.php). So we'll need a work around for PHP4.Comment #8
JohnAlbinHere's an updated patch for 5.x-dev. It works around the PHP4
strrpos()
issue. I believe I accounted for all edge cases, but please double check.And it also gives all end-of-sentence characters equal standing when the teaser is shorter than the first paragraph; the old code would chop at a period even when chopping at an exclamation mark was better.
(This patch applies successfully to the 4.7.x-dev version as well. "Hunk #1 succeeded at 180 with fuzz 1 (offset 7 lines)." I can supply a 4.7-specific patch, if needed.)
Comment #9
ChrisKennedy CreditAttribution: ChrisKennedy commentedThis is a good bug to fix but it isn't critical. See http://drupal.org/node/45111 for more info.
Comment #10
gerd riesselmann CreditAttribution: gerd riesselmann commentedI looked up both the UTF-8-chars: The first one is the ideographic full stop, the second the arabic question mark. However, PHP-Eclipse with UTF-8 encoding only shows "???" for the first and "?? " for the second (that's why I was wondering). On CVS diff, the last one reverts the writing direction (looks funny).
I therefore would suggest to replace them with their hexadcimal representation 0xEF 0xBD 0xA1 and 0xD8 0x9F. This avoids any kind of unwanted interference with local charset. I modified John's patch accordingly (against 4.7 branch).
Additionally: What do you think about implementing all international sentence dividers like specified by Unicode STerm: http://www.sql-und-xml.de/unicode-database/sterm.html
Comment #11
JohnAlbinGerd, please see http://drupal.org/node/12864 (anyone else have a better link to the UTF-8 in source code standard?). UTF-8 characters are included in much of Drupal's source code; it's the standard. Also, I don't think your patch would work since PHP doesn't do hexadecimal escaping inside single quotes.
Please continue to use my previous patch.
Comment #12
ChrisKennedy CreditAttribution: ChrisKennedy commentedHere's an earlier discussion of this issue: Teaser can be longer than maximum allowed..
Comment #13
JohnAlbinChris, thanks for pointing out the previous discussion.
The only major issue I see in the previous discussion is that people are worried about invalid html caused by chopping off closing tags (such as
</div>
or</p>
.) However, since that is a problem with the current code as well as with my patch, I think "invalid html in teaser" is a separate issue.Looking at the code presented by Adam (http://drupal.org/node/28211#comment-52638), I decided to do timing tests on his idea (search for each breakpoint using
strpos()
and compare the found positions) versus my idea (generate a regex using all breakpoints and usepreg_match()
).strpos()
inside aforeach
loop: 0.14spreg_match()
with no loops: 0.24sSo, I rerolled my patch using
strpos()
Comment #14
Steven CreditAttribution: Steven commentedTested and verified. We should indeed do as the description says.
I did clean up the comments and code a bit. It could be simplified. Committed to HEAD.
Comment #15
RayZ CreditAttribution: RayZ commentedCould use a backport to 4.7.x
Comment #16
JohnAlbinHere is Steven's commit patch against 4.7.x-dev.
Steven, one extremely minor nitpick: In your comments, you say
But it's really becausestrrpos()
is useless/broken in PHP4.Comment #17
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedapplied
Comment #18
bjaspan CreditAttribution: bjaspan commentedIMHO, this is a substantial, possibly site-breaking, change. I agree that the previous method for computing teasers was not necessarily what was meant or expected, but the new method will result in incorrect behavior for sites that have optimized themselves to the old behavior.
For example, one client of mine wants one-sentence teasers, so we set the teaser length to 0. The old code found the first sentence/paragraph/whatever break after position 0, as desired. The new code (as I understand it) will result in empty teasers.
This deserves documentation.
Comment #19
(not verified) CreditAttribution: commentedComment #20
Bevan CreditAttribution: Bevan commentedrelated; http://drupal.org/node/180425