This patch fixes the issues in http://drupal.org/node/180425, and adds some enhancements. This is arguably a feature, where 180425 fixes the bug only. However this is a better fix for the bug and should replace 180425 as the fix for it if the two minor enhancements introduced are minor enough to get into d6.

Improving on 180425, I

  • reduced the amount of code, by iterating through the repeated algorithm that looks for break points
  • added one extra repeat of that algorithm to check for break points at word boundaries if no sentence or paragraph boundaries are found.
  • added three more sentence boundary cases to the array of break points to handle sentences that end at the end of a pair of parenthesis, like "(This is a sentence.) (These are good break points.)"
  • refining comments further

Some text editors I am using on OSX have trouble displaying this line '؟ ' => 1, I think they display that line or part of that line RTL instead of LTR, and as part of that, they invert the >. Weird. Can someone please confirm that the patch applies and the php parses correctly before committing.

I have tested this extensively in my sandbox.

Some testing in other languages would be ideal. There maybe suitable characters in other languages that define word boundaries or sentence boundaries in parenthesis or around other syntactical punctuation for example.

CommentFileSizeAuthor
node_teaser.patch3.58 KBBevan

Comments

Bevan’s picture

Title: node_teaser doesn't break on word boundaries if no sentence or paragraph break » node_teaser doesn't break on word boundaries nor sentences in parenthesis if no sentence or paragraph break
Version: 6.x-dev » 7.x-dev
Category: bug » feature

I've ported this without the features to d6 in http://drupal.org/node/180425#comment-624547. Pushing this patch off to d7 -- it will need to be re rolled with just the features (two lines).

gpk’s picture

Need to be careful this doesn't mangle the HTML - theoretially this could end the teaser inside a tag e.g. if the HTML includes <img src="filename" />

Bevan’s picture

Status: Needs review » Needs work

Ah yes. Good point. This is also a possibility with the current node_teaser, but these changes make that even more likely. htmlcorrector is in core now, right? Does it handle xhtml that's cut off midway?

Bevan’s picture

Since php4 is no longer supported in d7, this should use simpleXML or similar to count characters EXCLUDING markup.

gpk’s picture

>htmlcorrector is in core now, right? Does it handle xhtml that's cut off midway?
It will close any open tags but obviously that won't always make the HTML valid. But closing open tags and fixing nesting errors might be considered "good enough". http://drupal.org/project/htmlcorrector

As you suggest, simpleXML might be a better solution.

However I don't think either HTML corrector or simpleXML will handle custom tags e.g. the [img_assist] tag and similar tags used by other inline filters, which can include spaces within the tag.

Bevan’s picture

Do the filters do their substitution before or after node_teaser runs? I think after, but I'm not sure. Visualize backtrace module will make it easy to find out; http://drupal.org/project/visualize_backtrace.

Assuming it's after it wouldn't be too hard to include a bit of regex to make sure that we're not splicing inside any open parenthesis. This could also be used as a smarter way to slice bodys with the php filter, but ensuring the slice occurs outside the php tags.

gpk’s picture

>Do the filters do their substitution before or after node_teaser runs?
After. The teaser is not usually/currently calculated on the fly but stored in the database on node_save().

>include a bit of regex
Indeed :)

Jaza’s picture

Status: Needs work » Closed (fixed)

The changes in this patch appear to be fully implemented in the text_summary() function in text.module, which is basically the D7 version of the old node_teaser() function (AFAICT).

Closing.