Original report by gpk

Steps to reproduce

As requested here: http://drupal.org/node/220783#comment-728258.

Set trimmed length to 200.

Remove the HTML corrector from the Full HTML input format (or create a new input format without HTML input corrector). (See http://drupal.org/node/221252.)

Create a new page, Full HTML input format (or whichever you created), with this content:

The maximum number of characters used in the trimmed version of a post. <!-- <p>Drupal will use this setting to determine at which offset long posts should be trimmed.</p> The trimmed version of a post is typically used as a teaser when displaying the post on the main page, in XML feeds, etc. To disable teasers, set to 'Unlimited'. --> Note that this setting will only affect new or updated content and will not affect existing teasers.

Preview the page, look at the result and the HTML source ...

Also see #263

Remaining tasks

  1. Add handling of body content to remove markup before counting characters to test against selected trimmed length.
  2. Review and test.
  3. Add / modify tests to coordinate with this change.

User interface changes

None.

API changes

None.

Data model changes

Unknown.

Release notes snippet

Trimmed text value handling fixed to not count markup characters as part of the text length.

CommentFileSizeAuthor
#263 Screenshot from 2024-03-02 15-13-50.png16.27 KBleeksoup
#263 Screenshot from 2024-03-02 15-05-54.png80.76 KBleeksoup
#227 TextSummary.txt2.02 KBbburg
#221 text_summary-221257-221.patch3.49 KBbburg
#212 text_summary-221257-212.patch3.48 KBtarekdj
#191 text_summary-221257-191.patch21.46 KBpillarsdotnet
#187 text_summary-221257-187.patch22.68 KBpillarsdotnet
#169 text_summary-221257-169.patch21.16 KBpillarsdotnet
#169 text_summary-221257-169-d7.patch21.16 KBpillarsdotnet
#169 text_summary-221257-169-d6.patch13.34 KBpillarsdotnet
#167 text_summary-221257-167-d6.patch13.28 KBpillarsdotnet
#166 text_summary-221257-166.patch21.16 KBpillarsdotnet
#166 text_summary-221257-166-d7.patch21.16 KBpillarsdotnet
#166 text_summary-221257-166-d6.patch13.24 KBpillarsdotnet
#141 text_summary-221257-141.patch19.75 KBpillarsdotnet
#137 text_summary-221257-137.patch17.77 KBpillarsdotnet
#140 text_summary-221257-140.patch17.79 KBpillarsdotnet
#133 text_summary-221257-133.patch17.45 KBpillarsdotnet
#132 text_summary-221257-132.patch15.95 KBpillarsdotnet
#139 text_summary-221257-139.patch17.78 KBpillarsdotnet
#138 text_summary-221257-138.patch17.78 KBpillarsdotnet
#136 text_summary-221257-136.patch15.5 KBpillarsdotnet
#135 text_summary-221257-135.patch17.48 KBpillarsdotnet
#127 text_summary-221257-127.patch16.02 KBpillarsdotnet
#123 text_summary-221257-123.patch15.56 KBpillarsdotnet
#122 text_summary-221257-122.patch15.54 KBpillarsdotnet
#121 text_summary-221257-121.patch15.54 KBpillarsdotnet
#119 text_summary-221257-119.patch15.27 KBpillarsdotnet
#118 text_summary-221257-118.patch15.13 KBpillarsdotnet
#117 text_summary-221257-117.patch18.93 KBpillarsdotnet
#111 text_summary-221257-111.patch15.15 KBpillarsdotnet
#76 text_summary-221257-76.patch15.03 KBpillarsdotnet
#75 text_summary-221257-75.patch14.8 KBpillarsdotnet
#71 text_summary-221257-71-donotignore.patch8.44 KBpillarsdotnet
#70 text_summary-221257-70.patch8.44 KBpillarsdotnet
#63 text_summary-221257-63.patch8.37 KBpillarsdotnet
#49 node-6.x-html_teaser3.patch12.7 KBAlexisWilke
#48 node-6.x-html_teaser3.patch12.09 KBAlexisWilke
#45 221257-node-6.x-html_teaser3.patch11.95 KBgreg.harvey
#40 node-6.x-html_teaser3.patch12.03 KBAlexisWilke
#5 node-6.x-html_teaser.patch6.71 KBAlexisWilke
#17 node-6.x-html_teaser2.patch11.12 KBAlexisWilke
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

gpk’s picture

OK I'm assuming this now gets fixed first in 7.x-dev?

AlexisWilke’s picture

Issue tags: -teaser

node_teaser() is still very badly "broken" (written?)

It truncates tags and comments anywhere and it counts characters in UTF-8 and in bytes. It's all wrong.

Not only that, it won't close tags properly, so if you do not use the HTML corrector on ALL filters, teasers will break your themes.

Then on Edit, it computes the $include variable like this:

  // Check if we need to restore the teaser at the beginning of the body.
  $include = !isset($node->teaser) || ($node->teaser == substr($node->body, 0, strlen($node->teaser)));

Here we can see that the teaser is expected to be an exact copy of the body. If not, this test breaks. Guess what, when you properly close tags at the end of the teaser, that $include computation breaks big time.

By the way, the variable should be named "$included" with a D at the end. Personal point of view though. 8-)

There is my version of node_teaser() that works great, every time!

/**
 * Generate a teaser for a node body.
 *
 * If the end of the teaser is not indicated using the <!--break--> delimiter
 * then we generate the teaser automatically, trying to end it at a sensible
 * place such as the end of a paragraph, a line break, or the end of a
 * sentence (in that order of preference).
 *
 * @param $body
 *   The content for which a teaser will be generated.
 * @param $format
 *   The format of the content. If the content contains PHP code, we do not
 *   split it up to prevent parse errors. If the line break filter is present
 *   then we treat newlines embedded in $body as line breaks.
 * @param $size
 *   The desired character length of the teaser. If omitted, the default
 *   value will be used. Ignored if the special delimiter is present
 *   in $body.
 * @return
 *   The generated teaser.
 */
function node_teaser($body, $format = NULL, $size = NULL) {

  if (!isset($size)) {
    $size = variable_get('teaser_length', 600);
  }

  // Find where the delimiter is in the body
  $delimiter = strpos($body, '<!--break-->');

  // If the size is zero, and there is no delimiter, the entire body is the teaser.
  if ($size == 0 && $delimiter === FALSE) {
    return $body;
  }

  // If a valid delimiter has been specified, use it to chop off the teaser.
  if ($delimiter !== FALSE) {
    return substr($body, 0, $delimiter);
  }

  // We check for the presence of the PHP evaluator filter in the current
  // format. If the body contains PHP code, we do not split it up to prevent
  // parse errors.
  if (isset($format)) {
    $filters = filter_list_format($format);
    if (isset($filters['php/0']) && strpos($body, '<?') !== FALSE) {
      return $body;
    }
  }

  // this is a teaser computation that works every time!

  // If we have a short body, the entire body is the teaser.
  $len = drupal_strlen($body);
  if ($len <= $size) {
    return $body;
  }

  $filter_newline = isset($filters['filter/1']);
  $len = strlen($body);

  $p = 0;
  $l = 0;
  $s = array(); // stack
  while ($p < $len && $l < $size) {
    $last_tag = FALSE;
    $o = strpos($body, '<', $p);
    if ($o === FALSE) {
      // no more tags till the end
      $a = drupal_strlen(substr($body, $p, $len - $p)); // UTF-8 length
      $n = $len;
    }
    else {
      // count characters between previous position and
      // beginning of tag
      $a = drupal_strlen(substr($body, $p, $o - $p)); // UTF-8 length

      ++$o; // skip the '<'
      $n = strpos($body, '>', $o);

      if ($body[$o] == '/') {
        // closing tag, pop the opening tag too
        array_pop($s);
      }
      elseif ($body[$n - 1] != '/') { // skip empty tags
        // opening tag, save its name on the stack so we can close it later
        $end_name = strpos($body, ' ', $o);
        if ($end_name === FALSE || $end_name > $n) {
          $end_name = $n;
        }
        $tag_name = substr($body, $o, $end_name - $o);
        switch ($tag_name) { // ignore empty tags that were not properly closed
        case 'br':
        case 'hr':
        case 'img':
        case 'input':
          break;

        default:
          $s[] = $tag_name;
          $last_tag = TRUE;
          break;

        }
      }

      // skip the tag now (we assume properly opening/closing tag boundaries!)
      if ($n === FALSE) {
        // last tag not closed or it wasn't a tag?!
        $n = $len;
      }
      else {
        ++$n;  // skip the '>' character
      }
    }

    // any characters to add the to result?
    if ($a) {
      if ($l + $a >= $size) {
        // the last tag did not make it in
        if ($last_tag) {
          array_pop($s);
        }
        // we've got more than we want to, search for a break point
        $o = $p + $size - $l;
        if ($body[$o] != ' ') while ($o > $p) {
          switch ($body[$o - 1]) {
          case "\xD8": // "\xD8\x9F" == arabic '?' (right to left)
            if (!isset($body[$o]) || $body[$o] != "\x9F") {
              // no the right sequence
              break;
            }
            if ($o + 1 == $len || $body[$o + 1] == ' ') {
              // found a break-point
              break 2;
            }
            if ($body[$o + 1] == '"') {
              $o += 2;
              break 2;
            }
            break;

          case '.':
          case '!':
          case '?':
            if ($o == $len || $body[$o] == ' ') {
              // found a break-point
              break 2;
            }
            if ($body[$o] == '"') {
              ++$o;
              break 2;
            }
            break;

          case "\n":
            if (!$filter_newline) {
              break;
            }
          case ' ':
            // found and remove the space (not that we ignore no-break spaces since we're not supposed to break there)
            --$o;
            break 2;

          //case ... add support for other UTF-8 spaces?

          case "\xE3":
            // found the CJK ideographic full stop?
            if (isset($body[$o + 1]) && $body[$o] == "\x80" && $body[$o + 1] == "\x82") {
              // keep this character in full
              $o += 2;
              break 2;
            }
            break;

          }
          --$o;
        }
        $p = $o;
        break;
      }
      $l += $a;
    }

    $p = $n;
  }

  $result = substr($body, 0, $p);
  while (!empty($s)) {
    $result .= '</' . array_pop($s) . '>';
  }

  return $result;
}

Here is my sample code that breaks the default code (as of 6.19):

<p>The Internet is littered with thousands of directories and yet there still people who create new ones... like me here on <a href="/" target="_blank" title="Snap! Websites"><strong>Snap! Websites</strong></a>. Well... I like the idea of a central repository of directories that I could use and thought that it would be nice to share them with you and even better, I offer your to <a href="http://snapwebsites.info/node/add/directory?destination=thank-you-submit..." target="_blank" title="Click now and add your directory to my list on Snap! Websites.">submit your own directories</a> to my list so you too can use the list to check out your favorite directories.</p>

The core version will truncate even before looping through the content and that truncates inside one of the anchors tag. That means you do not have a correct tag at all (it cuts in the middle of an attribute value.)

WARNING: My version properly counts the number of characters VISIBLE (to the very end users, well... it counts spaces that we don't really see per se. 8-) ). When you say 600, it won't be 600 bytes as your version. We can do it either way, it just makes a lot more sense to me to count visible characters. We could also have two limits if necessary. But the current version is not working for people who dare enter any kind of HTML in their posts.

I'll be back later with a patch that fixes the $include variable too.

Thank you.
Alexis Wilke

AlexisWilke’s picture

Issue tags: +teaser

tagging

AlexisWilke’s picture

Priority: Major » Normal
Status: Needs work » Needs review
FileSize
6.71 KB

Okay, the fix for the teaser test in the node.pages.inc was easy enough. 8-)

  $include = TRUE;
  if (isset($node->teaser)) {
    // remove all the tags at the end because if they are closing tags
    // then they were likely added by node_teaser().
    $teaser = preg_replace('/(<[^>]+>)+$/', '', $node->teaser);
    $include = $teaser == substr($node->body, 0, strlen($teaser));
  }

I certainly hope you'll consider fixing your teaser code in D6... If not in D6, at least in D7. It is clear that your first call:

  // The teaser may not be longer than maximum length specified. Initial slice.
  $teaser = truncate_utf8($body, $size);

is already wrong because that can cut a comment or tag in half and thus you cannot know whether the last < character is part of a tag or a little mistake from the user (since the corresponding > is no present! so if you find "</p" you won't know that it is the end of the P tag... and you may actually keep that in there!)

Thank you.
Alexis Wilke

AlexisWilke’s picture

Assigned: Unassigned » AlexisWilke
Status: Active » Needs review
Issue tags: +teaser

Ah! I see another fix to the truncate_utf8() function...

I'm not wondering if the truncate should not be in that truncate_utf8() function and not in node_teaser()!

#768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug)

That would mean all the code I write to fix the node_teaser() function should be moved inside the truncate_utf8() which should probably take a flag $html to know whether the string may contain HTML or not.

Thank you.
Alexis Wilke

NancyDru’s picture

Marked #248842: Teaser breaks are picking up parts of html code as duplicate.

Alexis, does this remove the tendency to let <p> dominate the breakpoint selection as the current node_teaser has?

+1 for back-porting to D6. Definitely for D7. So shouldn't this be tagged? And maybe made "critical?"

dddave’s picture

Priority: Normal » Major

Ah, the joys of teaser creation. ;) A constant source for support requests and headscratchers. I think this warrants an upgrade to major.

The attached patch is for D6, right? Has this any chance of inclusion without being solved for D7 first? If not and a D7 patch is built first I totally vote for back.porting to D6.

AlexisWilke’s picture

@dddave,

D6 is definitively broken in that regard... so I suspect D7 is too.

The patch I provided is for D6, indeed (hence ...-6.x-... in the name).

Now, as I mentioned in #6 we need to know/understand whether the truncation should occur in the truncate_utf8() function or in the node_teaser. It seems to me that the truncate function would make a lot more sense. Also, to be noted, the "..." does not appear in my version (nor in any teaser that was shrunk further by the "old" node_teaser().)

In D7, the "..." should actually be defined in the Post settings because someone may want to put something else there. But that's a different topic. 8-)

@NancyDru,

Hey! Nancy! I've noticed you've been working on D7 quite a bit these days...

Breaks... My code does not consider <p> as a specific break point. Actually, the node_teaser() was not really working that way (only seemingly.) Instead, I parse all the characters from the start and skip tags at once (anything between < and > is left totally untouched.) And just like the HTML filter used to fix HTML, I manage a stack with all the tags currently opened (taking in account empty tags such as IMG, BR and HR—HR the forgotten!).

When the parser finds our that enough text characters were found, it truncates around the last space that makes the result shorter or equal in length as defined in the Post settings (i.e. 600 by default.)

There are interesting side effects to that method (to me). For instance, the ALT, TITLE, HREF, etc. attributes do not affect the number of characters that the user will see. Nor the IMG, EMBED of such tags (although that might be a problem in some circumstances like an RSS feed?)

Now, if you think that we should cut at a paragraph end and if the 1st paragraph is too long then revert to this algorithm, it would be easy enough to implement. I don't see a problem in cutting any one paragraph in half as long as it is done correctly HTML wise. 8-)

Thank you.
Alexis

NancyDru’s picture

I don't have a problem cutting paragraph at then end of a sentence, but not in the middle of one. However, I DO have a problem cutting a list item up; I want the whole thing or none of it. I have modified the node_teaser() function to do that. But I still have the tag-breaking bug.

the ALT, TITLE, HREF, etc. attributes do not affect the number of characters that the user will see.

I (and my customer) consider this an absolute positive. I do want to count the visible part of a link (<a ...>visible</a>) I don't allow IMG or EMBED in my feeds, so that's still positive.

My customer's request (demand): "We want teasers to be relatively consistent - about 15 lines." Yes, I know there are a lot of variables about what constitutes a "line."

It does sound like you could easily do a better job of correction than the HTML Corrector filter too.

-

Now if you could fix multiple blanks in a row... And the evil <p>&nbsp;</p>

AlexisWilke’s picture

@NancyDru,

Hmmm... not cutting a list sounds like a good idea, but a page that starts with a list would "suffer"...

This page, for instance, starts with a table of contents list...

http://linux.m2osw.com/booting

Of course, in a way, it's a bad example since the corresponding teaser would not include the table, but many people write blogs starting with a list of say... the 10 reasons why I don't like teasers in Drupal.

And we could tell people, on those pages, add a "manual" teaser. The problem is when you hit save you generally do not see how the teaser looks like.

Also, no one commented about #6...

Thank you.
Alexis Wilke

P.S. "evil" <p>&nbsp;</p> is something I took care of with my own filter a long time ago 8-) See here http://www.m2osw.com/mo_trimmer_filter

NancyDru’s picture

Personally I think #6 should be a "clean up" issue after this gets fixed.

AlexisWilke’s picture

Nancy,

Hmmm... The current node_teaser() does this (more or less):

1. Any content? No return ''
2. Any <!--break--> in the content? got the teaser
3. Is content smaller in total # of chars to the max. length allowed in the teaser? Yes return body as is
4. Truncate content using truncate_utf8()
5. [try to] Fix the truncated content to fix the max. # of chars (remove the possible '...' in the process)

Point 4. is gone in my version... If we want to have a truncate function that can truncate HTML (because many modules do a similar work as the Core teaser, for instance the Views may be used to "generate a teaser"!)

So may be we should have a truncate_html(), in which case I guess we can first fix this function and later extract the code and create that separate truncate_html().

Is that your idea?

Thank you.
Alexis Wilke

NancyDru’s picture

BTW, there has always been a flaw with 2. Any <--break--> in the content? got the teaser in that <!--break--> is removed. This results in a "read more" when the break is placed at the end of the node to force a full node to appear (generally seen on shorter nodes that are slightly longer than a normal teaser). There is no need to remove the break because it is a comment to browsers.

AlexisWilke’s picture

We need to have a separate function that compares the teaser and the body properly. I fixed the $include definition but that test should be in a function instead and extended to support more capabilities. See sample below.

I'm also wondering about a break <!--break--> at the very beginning because then the teaser is empty and thus the node_teaser() is called each time. Since that should be really rare, we should be fine. But it could be that an empty teaser means don't show the teaser? I guess the user can turn off the corresponding flag if he/she does not want that to appear.

Otherwise I just noticed that trim() is being used all over the place. Maybe we should have a drupal_trim() that removes the <p> </p> too. That way it would be fixed in many situations!

/**
 * Compare the teaser and the body together and return one of the
 * following values:
 *
 * \li NODE_TEASER_EQUAL -- the teaser and body are the same (avoid read-more)
 * \li NODE_TEASER_START -- the teaser is equal to the beginning of the body
 * \li NODE_TEASER_DIFF -- the teaser is different from the body
 *
 * \return One of the NODE_TEASER_... values
 */
function node_compare_teaser_and_body($node) {
  if (!isset($node->teaser)) {
    // would we need a NODE_TEASER_OFF for this case?
    return NODE_TEASER_DIFF;
  }

  // remove all the tags at the end because if they are closing tags
  // then they were likely added by node_teaser().
  $teaser = preg_replace('/(<[^>]+>)+$/', '', $node->teaser);
  $len = strlen($teaser);
  $included = $teaser == substr($node->body, 0, $len);

  if ($included) {
    // remove the <!--break--> and empty lines at the end
    $remainder = substr($node->body, $len);
    $remainder = str_replace('<!--break-->', '', $remainder);
    $remainder = preg_replace('%(<p>&nbsp;</p>[\n\r]*)+$%', '', $remainder);
    return trim($body) ? NODE_TEASER_START : NODE_TEASER_EQUAL;
  }

  return NODE_TEASER_DIFF;
}
NancyDru’s picture

I wouldn't call it drupal_trim if it does more than what PHP's trim will do (but I guess trim could do tags if you list them). Having a drupal_ form of a PHP function implies that it be a multibyte safe version (e.g. drupal_substr, drupal_strtolower).

AlexisWilke’s picture

FileSize
11.12 KB

Okay, for now there is a patch that takes the more... in account and also adds a "..."

This fixes the comparison between teaser and body. But it renders the include flag computation quite complex. I'm wondering if the node_teaser() should not actually return the teaser AND the body. We could pass the include flag (i.e. isset($node->teaser_include) && !$node->teaser_include) and determine within the function what needs to be returned instead of recomputing how to cut the body outside.

My concern here is that would change the return value of the node_teaser() function which some other module might be using...

Thank you.
Alexis

P.S. The case of the list is not yet taken in account.

NancyDru’s picture

Nope, that would change the API and guarantee it would not be accepted.

AlexisWilke’s picture

Nancy,

I know. Although we may want to do that in 7.x...

Have you tested my patch yet? 8-)

Thank you.
Alexis

Evan Leeson’s picture

Thank you all for working on this issue. I spent some time today struggling with this because the writers have started putting a lot of links in their first sentences and this was causing this issue to manifest. Using the teaser break solves it but it will be much better when they don't have to.

AlexisWilke’s picture

lectric,

Did you test #17? 8-)

Thank you.
Alexis

NancyDru’s picture

I have asked for permission to test it (on the customer's clock). Even if they say no, I may test replacing my version on my own time (of which I have very little).

One possible way to get this in (probably 8.x) would be to open an issue to "Make node_teaser() pluggable," which seems to be a popular way to say, "I have a better idea, but you probably won't like it."

greg.harvey’s picture

Marked #363889: Strange behaviour when node teaser and first part of node body don't match perfectly as duplicate - not actually certain this is the case, but the issue is very old and no way of testing any more, so no sense leaving it open.

brisath’s picture

Subscribe, same problem with teasers.

AlexisWilke’s picture

@NancyDru (and others),

Btw, in regard to lists... How would you handle the case where a post starts with a list and that list is too long to fit?

1. Return an empty teaser

2. Cut the list...

3. Return the whole list (which could be REALLY big)

In cases 1 and 3 we could also emit a warning, but since any function could call the node_teaser() function, the warning may not automatically be appropriate.

Thank you.
Alexis

NancyDru’s picture

On my site, my code cuts the list. And my customer is happy with that.

NancyDru’s picture

All my work is waiting for better specs, so I'm going to put this into my code and test it. I am rereading in detail.

#10/11/25 - I didn't say not cutting a list - that I already do. I don't want a list ITEM cut. I tried that and it got real ugly real fast. So the only time it might be a problem is when the first list item is too big. I haven't encountered that (yet).

#13 Truncate_html is probably a good idea but will guarantee waiting until D8. It is interesting that even though the API for drupal_validate_utf8 states "All functions designed to filter input should use drupal_validate_utf8 to ensure they operate on valid UTF-8 strings...", only one core function actually calls it.

#15 Given that trim() appears to not be multibyte aware, and some users use something other than utf8 for their databases, it seems that there is indeed a need for drupal_trim(), that would validate and drupal_convert_to_utf8(). However that is a separate issue because it would be forced into D8 even though I would call it a bug in both D6 and D7.

NancyDru’s picture

Why are you using substr() rather than drupal_substr()?

The addition to

    if (isset($filters['php/0']) && strpos($body, '<?') !== FALSE) {
      if (isset($teaser_len)) {
        $teaser_len = strlen($body);
      }
      return $body;
    }

seems superfluous with an immediate return.

If Dries ever looks at this, he will want more comments and properly formatted (start with capital, end with punctuation).

NancyDru’s picture

Looks pretty good. We can definitely decrease our teaser length setting. This version produced an even longer one than my version because of links. It did, however end at the beginning of a list item, so I get 9. ...

The ending "..." will definitely classify this as a changed API. As much as I kind of like it, you might want to remove that.

NancyDru’s picture

Status: Needs work » Needs review

I don't know if it's this change or a change I made to the editor, but I am now getting duplicated content when I edit a node.

I am also now getting:

You specified that the summary should not be shown when this post is displayed in full view. This setting has been ignored since you have not defined a summary for the post. (To define a summary, insert the delimiter "<!--break-->" (without the quotes) in the Body of the post to indicate the end of the summary and the start of the main content.)

UPDATE: #934628: You specified that the summary should not be shown seems to be caused by this code. The generated teaser (complete with ...) is stuck on the front of the body in the edit textarea.

NancyDru’s picture

Status: Needs review » Needs work
AlexisWilke’s picture

Status: Needs review » Needs work

Nancy,

Thank you for taking the time (you could have taken a break instead! 8-) )

#27 - #10/11/25 - okay, that makes more sense. Still, not cutting a certain tag makes it a special case! Now, we can use the stack to know where we're at. If we find an <li> tag, then we would need to remove everything from it on. I guess that requires a position linked with the stack so we know where it's located.

#27 - #13 - truncate_html() would make sense, I think. And the truncate_utf8() is being worked on for other reasons... but that does not take HTML in account.

#27 - #15 - trim() probably does not need to be UTF-8 compatible unless we want to trim special spaces in which case yes, we'd need a UTF-8 equivalent. Otherwise the characters removed are controls (i.e. "\n") and the space (" "). Those are safe in UTF-8. We could verify that it does not remove the &nbsp; character though since that one uses code 0xA0 which is a UTF-8 encoding character. From the PHP documentation, &nbsp; is not removed.

#28 - drupal_substr() - I have to look into it, I do not think it makes any difference because I have < and > as delimiters which cannot be confounded with UTF-8 characters. But I guess that does not matter much.

#28 - PHP handling - if you are asking me to fix old code, I'm fine with it, but I thought that patches were supposed to only change what needed to be changed... I know that my patch changes so many things that it's hard to see what was there before and what's new. This PHP test I kept as it was.

#29 - ... - from what I understand, the truncate_utf8() was expected to add the '...', but then the node_teaser() was likely to remove it as it would have to shorten the data some more. So it was never there! See #30 too.

#30 - duplicate teaser - this is one of the problems I've been running into. I may not have posted my latest fix that removes the '...' in case it was added because that does not match the body (which may be another reason why the '...' was not included with truncate_utf8()!)

#31 - I'm on it. 8-)

NancyDru’s picture

#28: It's not the check for php, it's setting the $teaser_len that I wonder about, because you are immediately going to return, so what is the point? IIRC this is done again a few lines down.

#30: Please read the WYSIWYG maintainers comments in #934628: You specified that the summary should not be shown. Even altered HTML (like closing an unclosed tag) could trigger the same problem. Apparently, there is a exact match comparison that is causing the problem.

AlexisWilke’s picture

Nancy,

#28 because $teaser_len is an [out] variable. Which is why this whole thing is so ugly. The auto-teaser function should either not save the teaser or save a flag marking the teaser as an auto-teaser and not try to be smart and compare potentially really large strings together.

Yes. For #30 I have a function now. That one single function does a "smart" compare, but I did not take the '...' in account in the last version I proposed here. So I don't have an issue with it on my end anymore but here that's a problem. (and this issue is the whole point of fixing the teaser so it all works including that sort of a problem!)

Thank you.
Alexis

NancyDru’s picture

Having to call a new function means "API change" which guarantees that this will never make it into 6.x or 7.x. That's a shame because this algorithm produces much better teasers.

rkodrupal’s picture

Category: bug » support

The code in #3 works great! I even tried a very long

within a node insert [node:999 cck=body;] and it truncated it just fine. Good job.

Quick question ... why should I care about the change to node.pages.inc in #5 and where precisely does it get inserted (sorry, i never got patch.exe working on windows xp so i do all patches manually)?

Again, great job.

rkodrupal’s picture

sorry that double quote in the prior post was originally the word

rkodrupal’s picture

...aargh ... bl0kqu0te

dddave’s picture

Category: support » bug

Please don't change the category.

AlexisWilke’s picture

Priority: Normal » Major
FileSize
12.03 KB

Okay, I finally got a new patch. I still include the ellipsis because I'm 99% sure it was intended to be there just not done because it would break everything with the old code.

I added some documentation and comments to make sure people understand why this and that. Maybe a little more would be required... Especially, in regard to the drupal_strlen(), drupal_substr() calls.

1) All the mb_str...() functions would need a corresponding drupal_str...() call, for instance using strpos() and then drupal_substr() is not going to work well.

2) In my case, I check the string with $body[$o] which is a byte access which totally ignores UTF-8.

2.1) It is okay because I use drupal_strlen() to compute the length of the visible text and compares that against the $size argument

2.2) It is required unless you offer me drupal_strpos() and a way to check a string character by character (and I don't recall seeing that in PHP anyway, although there is mb_split() ... good luck on bodies that are 100Kb in size!)

3) It is now documented, and I already had a comment about the UTF-8 checks when there are required in the algorithm!

Final word:

Personally, whether this patch is accepted or not is not a concern for me. My teasers work. 8-)

If someone from Core thinks it would be a good idea to share... then apply the patch! (after we had a few RTBC, of course) And since it can actually destroy the content of your node, I would strong advise that it be applied. But again, not my problem. I'm safe on that one now!

@rkodrupal, note that you can click on Edit to fix your posts instead of posting 2 or 3 fixes one after another.

Status: Needs review » Needs work
Issue tags: -teaser

The last submitted patch, node-6.x-html_teaser3.patch, failed testing.

AlexisWilke’s picture

Status: Needs work » Needs review

#40: node-6.x-html_teaser3.patch queued for re-testing.

Status: Needs review » Needs work
Issue tags: +teaser

The last submitted patch, node-6.x-html_teaser3.patch, failed testing.

AlexisWilke’s picture

"Ensure the patch applies to the lastest checkotu of the code-base."

Why is that failing all the time?! I've tested against the latest and it works perfectly for me... How can it be bogus on that test server? One command line???

greg.harvey’s picture

Status: Needs work » Needs review
FileSize
11.95 KB

I haven't tried to apply this yet, but maybe the testbot is unhappy with node.module.orig? This is a stab in the dark, but perhaps changing those first two lines will work? Another attempt attached, nothing changed except the initial diff lines.

Status: Needs review » Needs work

The last submitted patch, 221257-node-6.x-html_teaser3.patch, failed testing.

rkodrupal’s picture

i used the patch in #45 and it applied successfully to node.module. The patch in #5 previously worked too, and it patched both node.module and node.pages.inc.

Was the intention for the patch in #45 to only change node.module?

Again, thanks for this great work. it was a huge timesaver for me.

... oh and I have a very small screen and the 'edit' button in this new version of drupal.org is off the page ... but i see it now.

AlexisWilke’s picture

Status: Needs work » Needs review
FileSize
12.09 KB

Thank you greg for looking into it. I checked another thread where a patch worked, checking with an extra line at the very beginning... I don't think that matters, but we'll see.

There is the exact error from the tests goes like this:

[22:21:11] Command [patch -p0 -i /var/lib/drupaltestbot/sites/default/files/review/221257-node-6.x-html_teaser3.patch 2>&1] failed with status [1] and output:
can't find file to patch at input line 3
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|--- modules/node/node.module
|+++ modules/node/node.module
--------------------------
File to patch:
Skip this patch? [y]
Skipping patch.
7 out of 7 hunks ignored.
AlexisWilke’s picture

FileSize
12.7 KB

Good catch rkodrupal, there is still a one line fix in the node.pages.inc file.

There is yet another patch. This time there is a line before and no .orig extension. But I don't think that's the problem... 8-}

Status: Needs review » Needs work

The last submitted patch, node-6.x-html_teaser3.patch, failed testing.

AlexisWilke’s picture

rkodrupal’s picture

patch at #17 doesn't seem to work with blockquote ... i found it impossible to successfully break the following for a teaser ...

Domino Causality sets the guiding principles in the search for innovative drugs. We only can improve productivity by constant attention to the entire line of causality, of the domino chain. Merely optimizing or improving any single link is not helpful.

"<"blockquote">"Dr. Moriarity, a metaphysician, sets up fifty numbered dominoes standing in a straight line with their dots facing the same way on a table in a room, but puts a blind in front of the dominoes so that only numbers one and fifty are visible. You enter the room, and observe that domino number one and domino number fifty are lying flat with their tops pointing in the same direction… Does this mean that either domino caused the other to fall? Not necessarily ... Dr. Moriarty could have pushed over only dominoes number one and fifty, or bumped the table in a way that only these two dominoes fell, or that all the dominoes fell at once. It is essential to remove the blind and look at the intervening dominoes... Do their positions suggest they fell in sequence rather than being bumped or shaken? Did any reliable observers hear the telltale sound of dominoes slapping one another in sequence? From the positions of all the dominoes, can we eliminate rival causal mechanisms, such as earthquakes and wind, as well as human intervention? If the dominoes fell in sequence, can we tell by looking at whether the dots are face up whether the direction of the sequence was from number one to number fifty or the reverse? – "<"a href="/node/879"">"Bennett, George (1997). Process Tracing in Case Study Research"<"/a">""<"/blockquote">"

(note: doesn't do too well with "<"a">" passages either).

AlexisWilke’s picture

rkodrupal,

I tried on one of my sites that includes #49 and it worked fine. You can see the teaser here:

http://gallery.m2osw.com/node

You may want to try with the latest patch?

Thank you.
Alexis

rkodrupal’s picture

confirmed. stupid typo on my part. sorry for the false alarm.

AlexisWilke’s picture

Status: Needs work » Needs review
Issue tags: -teaser

#49: node-6.x-html_teaser3.patch queued for re-testing.

Status: Needs review » Needs work
Issue tags: +teaser

The last submitted patch, node-6.x-html_teaser3.patch, failed testing.

NancyDru’s picture

A) Have you tested with the WYSIWYG module? Anything that alters the teaser from the first xxx bytes of the body is going to cause problems and actual node modifications.

B) Is there any hope that this would be in D7, let alone D6?

rkodrupal’s picture

for what it's worth i've been using it extensively (100+ text-laden nodes with lots of uncommon tags) and it hasn't failed yet. i use wysiwyg.

that being said, the patch does cause intermittent problems when the teaser has already been set by the earlier node.module and you go back and change a node ... typically when this happens there's some goofy message about the changing ... a tip-off to carefully check the results.

Jukebox’s picture

Status: Needs work » Needs review
Issue tags: -teaser

#49: node-6.x-html_teaser3.patch queued for re-testing.

Status: Needs review » Needs work
Issue tags: +teaser

The last submitted patch, node-6.x-html_teaser3.patch, failed testing.

Jukebox’s picture

This is sort of off topic, but may be helpful to some.

I'm using views to display node titles and teasers. One node had an italicised paragraph at the beginning. Since the teaser is limited to a handful of words, it didn't seem to close the italics tag correctly. Also, random italic tags were inserted in the html. To fix this, I just went over to the teaser field and checked the "Strip HTML Tags" option and everything was back to normal.

AlexisWilke’s picture

Jukebox,

The Views are using a copy of the Core function to extract the Teaser. They do not have the problem of comparing teaser and body as the Core does, but they do the same really bad job at not closing the tags... (this is the case for any truncate function in the Views.)

What you are seeing (many <i> tags) is probably due to your browser "cleaning up" after the Views output.

You are more than welcome to report this problem to the Views people and have them close tags properly of any HTML truncated data buffer.

Thank you.
Alexis

pillarsdotnet’s picture

Version: 6.x-dev » 7.x-dev
Component: node.module » field system
Status: Needs work » Needs review
FileSize
8.37 KB

Forward-ported to D7, cleaned up comments, and changed variable names to be (hopefully) more descriptive.

pillarsdotnet’s picture

Title: node_teaser() can truncate teaser to the middle of an HTML comment » text_summary() should be made HTML-aware.

Title update

pillarsdotnet’s picture

Jamie Holly’s picture

I actually just noticed this yesterday on a D7 site. It cut off a YouTube embed code in the middle. That's pretty bad, as it affects a lot of block-level elements (including forms).

Status: Needs review » Needs work
Issue tags: -teaser

The last submitted patch, text_summary-221257-63.patch, failed testing.

pillarsdotnet’s picture

Status: Needs work » Needs review

#63: text_summary-221257-63.patch queued for re-testing.

Status: Needs review » Needs work
Issue tags: +teaser

The last submitted patch, text_summary-221257-63.patch, failed testing.

pillarsdotnet’s picture

Trying again, with typo-corrections:

pillarsdotnet’s picture

Status: Needs review » Needs work
FileSize
8.44 KB

Patch named incorrectly?

pillarsdotnet’s picture

Status: Needs work » Needs review

Oh. duh. silly me again.

The last submitted patch, text_summary-221257-71-donotignore.patch, failed testing.

AlexisWilke’s picture

Pillarsdotnet,

Thank you for working on this one. 8-)

As a temporary solution people can forcibly use the <!--break--> comment... but go explain that to end users!

Anyway, your point about broken YouTube embedded video is probably not currently correctly handled by my patch, unless it's a single tag. (Is it just one <embed> tag or do they use the <object> tag?) It is however a good point because such an object, or a form, should never be cut. It either gets included or it remains excluded, but no cut! I could imagine a form cut just before the "Send Info" button. Too bad. 8-)

Thank you.
Alexis Wilke

pillarsdotnet’s picture

Status: Needs work » Needs review
FileSize
14.8 KB

Hacked text.test to make all tests pass.

Probably should add an embedded Youtube Video test...

pillarsdotnet’s picture

FileSize
15.03 KB

Fixed an "Uninitialized string offset" warning.

Just for drill, I did test an embedded video -- looks like the entire video counts as a single character, using this algorithm. I'm not sure whether this is desirable or not. Maybe all <object>...</object> sequences should be stripped from a text_summary?

NancyDru’s picture

Pillars and Alexis, there are always going to be exceptions that are not well handled by coding and the end-user is going to have to be told to use a teaser break technique. Considering how long an embed/object tag sequence can be, this may be one of those cases. No matter how good your code is, there's someone out there who will find a way to break it.

AlexisWilke’s picture

Nancy,

I don't disagree that we'll find others. However, by not fixing anything, it's not going to be better. I've not seen any such discussion on other systems, especially WordPress. So some people know how to make it work right, in code.

Thank you.
Alexis

NancyDru’s picture

I hope that Fields challenges WYSIWYG to stop doing what they were currently doing that broke this technique when I had a chance to test it. Perhaps you can "borrow" from WP if that is a much better algorithm. It is abundantly clear that this issue needs attention. I tried my hand at fixing it within the D6 framework and only partially succeeded. Your code produced a much better teaser than even mine did - core's was a very distant 3rd. Unfortunately, WYSIWYG does something that I think is wrong and that broke your code as it was at that time.

pillarsdotnet’s picture

Jamie Holly’s picture

@NancyDru - actually a lot of the fixed in WP comes from TinyMCE. For example - if you put a more break within an object/embed code block, then switch back to the WYSIWYG view, TinyMCE will actually strip it out. That's due to the way TinyMCE parses out embed codes and replaces it with their own markup for the visual display, then again replaces it on clean-up when the final HTML is generated.

When it comes to actually balancing out tags on teaser/more views, I find WP much worse. I can't even begin to count the times I had to fix a broken front page on a client's WP site simply because someone put the more break in a weird spot.

NancyDru’s picture

Good, that read more link has been bothering me since I started with Drupal and was told it wasn't worth fixing, even though it was simple to do so. I despise all those wysiwyg editors, but some people seem to think they have to have them.

pillarsdotnet’s picture

Thinking about this some more, there could be all kinds of additional conditions on summary generation.

  • Maximum display length (not including markup)
  • Maximum storage length (including markup)
  • Maximum line width / word-wrap
  • Maximum number of lines
  • Allowed tags/styles/fonts/etc.

Probably it would be best to allow some sort of hook allowing modules or themes to add conditions of their own.

So the (pseudo-)code would look something like:

// Parse into an array containing the following keys:
// - #tag: Opening/closing tag name, if any.
// - #attributes: Array of tag attributes.
// - #empty: If TRUE, tag is empty like <br /> or <hr /> and both #elements and #markup are ignored.
// - #elements: Array of child elements.
// - #markup: Rendered child elements, or static content if #elements is empty.
$elements = text_summary_explode($text);

// Break at all allowable break points, for applicable languages.
// Each element with an empty #elements and a non-empty $markup is a candidate.
// Default would include implementation to break at space characters.
$elements = text_summary_invoke_all('break', $elements, $language);

// Strip out content not wanted in summary, such as embedded video objects, styles, etc.
$elements = text_summary_invoke_all('strip', $elements, $language);

// Enforce applicable limits, such as line length, total number of lines, displayed characters, etc.
// Default would include implementation to limit storage length, with maximum possible weight.
$elements = text_summary_invoke_all('limit', $elements, $language);

// Collapse to text.
$summary = text_summary_implode($elements);
pillarsdotnet’s picture

Version: 7.x-dev » 8.x-dev

Now that we have a real 8.x branch, I should code a patch.

NancyDru’s picture

Version: 8.x-dev » 7.x-dev

Line width and number of lines depends on CSS and the displaying device. Unless you're much smarter than I, such details are pretty close to impossible to determine at teaser time. For example 5 lines on a full width HD monitor for a PC may very well overflow the entire screen on a smart phone.

pillarsdotnet’s picture

Jamie Holly’s picture

One thing that should be considered is that text_summary is always run in teaser view on D7, unlike in D6, where node_teaser was only run on node saves and then that value stored in the node_revisions table. That means text_summary can have a serious impact on performance if it gets to heavy processing wise.

Having said that, I got to give a big +1 to implementing a hookable system to allow modules to alter what happens in text_summary, such as deciding what block level elements should/shouldn't be cut. We just need to keep performance in mind when coming up with something.

NancyDru’s picture

Is that also true to cached anonymous users? And does it mean that filters are also always called?

Jamie Holly’s picture

Not for anonymous users, since the caching there is on the entire page.

NancyDru’s picture

Would it be possible to add a teaser (or summary) cache to mitigate a potential performance issue? Perhaps something as simple as doing an MD5 on the text and saving the resulting teaser? This would be similar to the way filters cache data. Actually, if the teaser/summary is a field, why not turn this into a filter?

Jamie Holly’s picture

I really think that should be in there. It would be nice to get a core maintainer to give us some feedback on the pros/cons of implementing that. Putting that caching in should be relatively simple.

yched’s picture

Long thread - could someone post an issue summary (as well as an explanation of why this lies in the 'field system' queue) ?

pillarsdotnet’s picture

Status: Reviewed & tested by the community » Needs review

Summary requested by ysched:

Jamie Holly’s picture

@yched - @pillarsdotnet pretty much summed it up very well there. The only thing I would add is that since text_summary is ran on every field that requires it on each page view, we really should keep an eye at performance on this, or (even better) work out a way to cache the teaser/summary text.

yched’s picture

@pillarsdotnet : Thanks for the summary. The former node_teaser() indeed has been moved more or less as is within text.module, but I can hardly claim any expertise on this area of code, and I'm afraid I'm not really relevant to review this :-/. Not sure who would - maybe a 'git blame' on the D6 branch would give a couple hints on the people who contributed to the original node_teaser() function ?

pillarsdotnet’s picture

@ysched --

Perhaps it would be more accurate to say that the patch in #76 is in "RTBC" status, as I believe it has been reviewed and found to work correctly by intoxination, AlexisWilke, NancyDru, rkodrupal, and myself. There is a related issue with WYSIWYG, but I at least can attest that the patch in #76 does not render WYSIWYG unusable.

Certainly it is an improvement over the current code.

In cases like this, where we all agree an an incremental improvement, but are still discussing further improvements, should the extended discussion move to a separate issue?

pillarsdotnet’s picture

Status: Needs review » Reviewed & tested by the community

Gonna take that giant leap. Somebody slap me if I'm wrong.

greg.harvey’s picture

Status: Needs review » Reviewed & tested by the community

Personally, I can't fault the logic in #96. Don't let the great get in the way of the good, etc. etc. =)

yched’s picture

Status: Reviewed & tested by the community » Needs review

Sorry, but this didn't get any real code review, and would also really need an approval from some core devs (no offense).

An explanation of what the patch actually does, where and why it modifies the existing code, might also help :-)

bryancasler’s picture

subscribe

AlexisWilke’s picture

I'm not using D7 so I don't know whether the latest patch works. 8-)

However, I wanted to make a note about the <object> tag. I agree that you have to either put it in full or remove it altogether. If you only include a couple of <param> tags inside the object, it won't serve any purpose.

Not only that, this applies to <textarea> and <form>.

There are possibly a few others. This makes me think of a module I had a problem with... Yes! Footnotes.

The other tag to handle properly is <script>, although with WYSIWYG you probably won't have much problems with that one.

Footnotes also handles <style>, but that tag should never appear anywhere else than in the header so I'm much less worried about it.

Thank you.
Alexis

NancyDru’s picture

Yes, even with WYSIWYG, users find ways to stick scripts in. At my last customer, they kept trying to stuff their own GA code into the nodes. And I kept writing better filters to remove them. ;-) Oh, and don't forget that <script> is often accompanied by <noscript>, which can have anything you can imagine in it.

bryancasler’s picture

Is there anything that would be holding up #76 from being committed?

pillarsdotnet’s picture

Lars Toomre’s picture

Version: 7.x-dev » 8.x-dev
Issue tags: +Needs backport to D6, +Needs backport to D7

I really like the pluggable approach suggested in #83 with the suggestion that the middle hook be renamed 'filter' instead of 'strip'. I use quite a bit of embedded HTML in my node bodies and have had considerable problems in D6 with broken HTML tags from the default split. That has required me to manually place <!--break--> in many places and led me to this issue discussion this morning.

I suggest the use of 'filter' since what Alexis started this thread with, what NancyDru has done for her previous client and what I want to do is filter out some portion of the characters in a node body are tied to linking or display rather than the content that is presented to the user. I also really like the idea of the limit hook since quite likely one might like to display a different 'teaser' length based on the type of device being served to (desktop/mobile/XML/JSON) or type of feed being produced for Web Service/Twitter/Facebook/RSS.

Before the current patch is applied, I would hope that we could address the following issues in the default filtering of this patch code:

  • From #76 - I would expect the filter of the <object>...</object> text would have zero length rather than one character. Perhaps I misunderstood what was reported in this item??
  • From #101 - It seems that we should have a list of HTML tags to filter out by default if the full result does not fit within the size limit (like <object>, <textarea> and <form>).
  • From #102 - Perhaps there should be another list of always filter HTML tags (like <SCRIPT> and <NOSCRIPT>)?
  • Looking at the function text_summary() in the patch from #76, I am unsure why the calling parameters were changed since it appears that $summary_length is never referenced in the subroutine. Was I missing something there?

I am not sure whether minor versions of D7 will allow the introduction of new hooks. Certainly D6 will not at this point. Hence, whatever we design here for D8 will need to be condensed into one function that can be "swapped" into D6 and probably D7. My hope is that a fix for this behavior can be back-ported to both D7 and D6 soon.

My two cents ...

NancyDru’s picture

According to Dries, if no existing API is modified, a new API may be allowed into current versions. We would have to fight hard to convince him though. And I agree with both parts of that policy. I doubt this will be introduced to D6 ever.

AlexisWilke’s picture

My fix does not change the API. It enhanced it only. Meaning that any module using the old API is compatible with my change.

And my fix is in my D6 install. I don't really care much whether it gets in Core on Drupal.org... 8-)

Thank you.
Alexis

pillarsdotnet’s picture

I'm about 30% of the way through implementing the approach I suggested in #83 as a contrib module. When I'm done, I should have d6/d7/d8 versions available.

pillarsdotnet’s picture

You can help this issue by posting a review:

  • Decide whether all code and comments comply with Drupal coding standards.

  • Decide whether any included tests are necessary and sufficient.

  • Decide whether the patch actually solves the problem. There should be a reproducible test where the current code fails and the patched code succeeds.

A review that affirms success in all of the above should also:

  • Change the issue status from "Needs Review" to "Reviewed and Tested By the Community".

Otherwise, the review should:

  • Explain what is wrong with the proposed patch and/or supply a corrections to it.

  • Change the issue status from "Needs Review" to "Needs Work".

njardim72’s picture

hi,

I'm in core D7. is this issue fixed? if so could you please point the right direction?

I need to trim the feeds to 200 chars

thanks,
njardim

pillarsdotnet’s picture

Gratuitous re-roll of #76 above.

AlexisWilke’s picture

Assigned: AlexisWilke » Unassigned

njardim72,

It's not fixed in any version.

pillarsdotnet is working on a D8 version and I offered a patch for D6. I don't think anyone posted a patch for D7 yet.

Good luck.
Alexis

pillarsdotnet’s picture

The d8 patch should apply cleanly to d7. There isn't much difference between d7 and d8 yet.

sun’s picture

Status: Needs review » Needs work

Unfortunately, this patch is far from being RTBC.

+++ b/modules/field/modules/text/text.module
@@ -327,6 +327,10 @@ function _text_sanitize($instance, $langcode, $item, $column) {
+ * @note
+ *   This function uses strlen(), strpos(), etc. rather than their multibyte
+ *   equivalents where doing so increases speed without affecting output.

1) All information in phpDoc are notes. The @note directive is therefore useless and shouldn't be used. Use regular phpDoc paragraphs instead.

2) Everything that goes through text_summary() is supposed to be cached by the caller, so speed does not matter. Additionally, there's only a marginal difference between Drupal's PHP wrappers and directly calling the native functions. We _always_ use Drupal's PHP wrappers, unless there's a valid technical reason for ignoring UTF-8 byte sequences. If there is such a reason, it needs to be clearly stated and explained in an inline comment.

+++ b/modules/field/modules/text/text.module
@@ -344,7 +348,7 @@ function _text_sanitize($instance, $langcode, $item, $column) {
-function text_summary($text, $format = NULL, $size = NULL) {
+function text_summary($text, $format = NULL, $size = NULL, &$summary_length = NULL) {

Missing phpDoc for the new function argument. It doesn't seem to be used anywhere?

+++ b/modules/field/modules/text/text.module
@@ -380,67 +385,133 @@ function text_summary($text, $format = NULL, $size = NULL) {
+  $text_length = strlen($text);
...
-  // Store the actual length of the UTF8 string -- which might not be the same
-  // as $size.
-  $max_rpos = strlen($summary);

The new code could use some more inline comments about what it is doing and more importantly, why.

+++ b/modules/field/modules/text/text.module
@@ -380,67 +385,133 @@ function text_summary($text, $format = NULL, $size = NULL) {
+  $position = 0;
+  $length = 0;
+  $stack = array();
+  while ($position < $text_length && $length < $size) {
+    $last_tag = FALSE;
+    $offset = strpos($text, '<', $position);

Overall, it looks like using preg_split() or a real DOMDocument would be more appropriate for the purpose of this function. If this has been tried already or purposively wasn't done for some reason, then it should be documented in an inline comment.

+++ b/modules/field/modules/text/text.module
@@ -380,67 +385,133 @@ function text_summary($text, $format = NULL, $size = NULL) {
+      // There are no more tags, so find the UTF-8 string length.
+      $additional = drupal_strlen(substr($text, $position, $text_length - $position));

I don't understand why substr() is not UTF-8 safe but the strlen() is. This will have unexpected consequences.

Note that the old code used an entirely different approach, for which it may have been valid to ignore UTF-8 byte sequences. The old code might perfectly be buggy with regard to not using PHP wrappers - we don't know, since the existing tests do not cover UTF-8 at all.

+++ b/modules/field/modules/text/text.module
@@ -380,67 +385,133 @@ function text_summary($text, $format = NULL, $size = NULL) {
+        switch ($tag_name) { // Ignore empty tags that were not properly closed.
+        case 'br':
+        case 'hr':
+        case 'img':
+        case 'input':
+          break;
+        default:
+          $stack[] = $tag_name;
+          $last_tag = TRUE;
+          break;
+        }

1) Wrong indentation.

2) The code assumes that HTML tags are lowercase, which may not be the case.

+++ b/modules/field/modules/text/text.module
@@ -380,67 +385,133 @@ function text_summary($text, $format = NULL, $size = NULL) {
+                break;
+              case '.':

There should be a blank line between switch cases, unless a case "falls through" to another one (which should be explained in an inline comment).

+++ b/modules/field/modules/text/text.module
@@ -380,67 +385,133 @@ function text_summary($text, $format = NULL, $size = NULL) {
+              if ($text[$offset] == '"') {
+                ++$offset;
+                break 2;
+              }
+              break;
+              case "\n":

Wrong indentation.

+++ b/modules/field/modules/text/text.module
@@ -380,67 +385,133 @@ function text_summary($text, $format = NULL, $size = NULL) {
-  // If the htmlcorrector filter is present, apply it to the generated summary.
-  if (isset($filters['filter_htmlcorrector'])) {
-    $summary = _filter_htmlcorrector($summary);
+  $summary = substr($text, 0, $position);
+  if (!empty($stack)) {
+    // If closing tags are missing, we add an ellipsis and closing tags.
+    $summary .= t('...');
+    do {
+      $summary .= '</' . array_pop($stack) . '>';
+    } while (!empty($stack));
   }

It looks like you're unconditionally tying to do something along the lines of the HTML Corrector filter. In that case, it would make much more sense to call _filter_htmlcorrector() instead, as it's much more reliable (and also covered by tests).

Additionally, _filter_htmlcorrector() uses a DOMDocument for parsing, which in turn makes us circle back into the question whether text_summary() shouldn't use DOMDocument, too.

+++ b/modules/field/modules/text/text.test
@@ -287,87 +287,87 @@ class TextSummaryTestCase extends DrupalWebTestCase {
-    $expected = array(
...
+    $expected = array (

Coding style error.

+++ b/modules/field/modules/text/text.test
@@ -287,87 +287,87 @@ class TextSummaryTestCase extends DrupalWebTestCase {
     // And using a text format WITH the line-break and htmlcorrector filters.
...
-      "",
-      "<p></p>",
-      "<p></p>",
-      "<p></p>",
-      "<p></p>",
-      "<p></p>",
-      "<p>\nHi</p>",
...
+      1 => "<p>",
+      2 => "<p>",
+      3 => "<p>",
+      4 => "<p>\nHi",
+      5 => "<p>\nHi\n</p>",
+      6 => "<p>\nHi\n</p>\n<p>...</p>",

The changed expectations are wrong. With enabled HTML corrector, the expectations 1-4 are invalid, since tags are not properly closed.

Furthermore, our expectation for 6+ is that the text summary ends on a paragraph.

In turn, it looks like we've lost the essential idea and functionality of text_summary(). This needs to be retained.

Lastly, no tests have been added that actually verify and prove the desired behavior regarding HTML.

I'd also like to know why a simpler approach like in #220783: HTML markup is counted in for the length of trimmed posts isn't sufficient.

Powered by Dreditor.

BrightBold’s picture

Subscribing, since the other issue was closed as dupe.

AlexisWilke’s picture

sun,

The @note is from Doxygen... 8-) It can make a note appear differently.

One reason for using the strpos(), strlen(), etc. directly is speed. Even if you have caches, speed is always an issue, especially if you have hundreds of users accessing your website.

Now, there is another reason for using them: the Drupal API does not offer all the functions used here. So unless you'd accept to also add all the missing functions, the safest was to use the "simple" functions.

Finally, it doesn't make any difference in this case because what we skip with the strpos() and strlen() and substr() calls are tags which cannot be UTF-8 (since we ignore the attributes.) Where you absolutely need to use UTF-8 is to count the number of characters between tags.

This should answer this worry:

I don't understand why substr() is not UTF-8 safe but the strlen() is. This will have unexpected consequences.

If you ask, I'll give you detailed explanation about UTF-8. I can assure you that what I wrote is not only correct, it makes things work a lot better.

+        switch ($tag_name) { // Ignore empty tags that were not properly closed.

Good catch! We should have a strtolower() here. Again, you don't need UTF-8 since tag names are ASCII.

The stack restoration, that you say is equivalent to HTML Corrector, is A LOT FASTER than calling that filter after we're done (at least under D6). Plus, we already have all the information we need to have. Why should we call another function instead of unwinding our own stack?

The main problem of the existing Core version is that the current version does this:

  $teaser = substr($body, 0, $max_teaser_length);

And the result in $teaser is completely whatever, including a tag cut in half.

This being said, in D8, using DOM could be a good solution. After all, you can then go through the DOM and cut everything after the max. size for your teaser was reached. It's probably much slower on very large pages, but that's not the majority of the pages out there.

Thank you.
Alexis

P.S. I offered the D6 version, the D7 patch is not from me, but some of the code is a copy from my own corrections. 8-)

pillarsdotnet’s picture

Status: Needs work » Needs review
FileSize
18.93 KB

Okay, this is completely new code, relying on the PHP DOM extension.

pillarsdotnet’s picture

Bah. Didn't mean to include my custom changes to the standard profile...

pillarsdotnet’s picture

Couple of minor changes, trying to anticipate objections from @sun...

sun’s picture

Status: Needs review » Needs work

Yay, this looks much better now :)

+++ b/modules/field/modules/text/text.module
@@ -380,68 +369,79 @@ function text_summary($text, $format = NULL, $size = NULL) {
+  foreach ($body_node->childNodes as $node) {
+    if ($size <= 0) {
       break;
     }
+    _text_summarize($node, $size, $summary_node, $summary_doc);
   }

I think it would increase readability if the _text_summarize() helper function would return the new $size instead of changing it by reference.

+++ b/modules/field/modules/text/text.module
@@ -380,68 +369,79 @@ function text_summary($text, $format = NULL, $size = NULL) {
+  foreach ($summary_doc->getElementsByTagName('body')->item(0)->childNodes as $node) {
+    $output .= $summary_doc->saveXml($node);
   }

Any reason why we can't simply use $summary_node->saveXml() instead? If so, please document inline, not on issue.

+++ b/modules/field/modules/text/text.module
@@ -380,68 +369,79 @@ function text_summary($text, $format = NULL, $size = NULL) {
+  // If the original format was plaintext, the summary should be, too.
+  if ($text === strip_tags($text)) {
+    $output = strip_tags($output);
+  }

This doesn't make sense to me. I'd at least like to know the reasons for introducing it (as inline comment), but actually, it looks like a major bug if you pass no HTML into something and suddenly deal with HTML along the way.

Might be caused by the unconditional invocation of check_markup(), or alternatively, because the DOM documents are loading text nodes into the body without a wrapping block-level element (DIV/P), so DOM might automatically inject one.

+++ b/modules/field/modules/text/text.test
@@ -286,95 +287,52 @@ class TextSummaryTestCase extends DrupalWebTestCase {
-      $this->callTextSummary($text, $expected[$i],    NULL, $i);
...
+      $this->callTextSummary($text, $expected[$i], 'filtered_html', $i);

I think we want and need to retain the text_summary() functionality for no format. It is useful for text [field] contents that does not have a text format associated, but still needs to be potentially rendered in "teaser" use-cases. The complex phrase detection logic in text_summary() is actually the only way in Drupal core to extract proper summaries from text strings.

Also note that the old text_summary() contained some more international punctuation characters, which should be retained.

+++ b/modules/field/modules/text/text.test
@@ -383,7 +341,14 @@ class TextSummaryTestCase extends DrupalWebTestCase {
+    $this->assertIdentical(
+      $summary, $expected, t('Generated summary "@summary" matches expected "@expected".',
+        array(

Odd wrapping/indentation here; all except of the array keys should be on one line.

Powered by Dreditor.

pillarsdotnet’s picture

Status: Needs work » Needs review
FileSize
15.54 KB

I think it would increase readability if the _text_summarize() helper function would return the new $size instead of changing it by reference.

Done. I also added a doxygen header for the helper function.

+++ b/modules/field/modules/text/text.module
@@ -380,68 +369,79 @@ function text_summary($text, $format = NULL, $size = NULL) {
+  foreach ($summary_doc->getElementsByTagName('body')->item(0)->childNodes as $node) {
+    $output .= $summary_doc->saveXml($node);
   }

Any reason why we can't simply use $summary_node->saveXml() instead? If so, please document inline, not on issue.

Added inline comment noting that there is no DOMNode::saveXml() function.

+++ b/modules/field/modules/text/text.module
@@ -380,68 +369,79 @@ function text_summary($text, $format = NULL, $size = NULL) {
+  // If the original format was plaintext, the summary should be, too.
+  if ($text === strip_tags($text)) {
+    $output = strip_tags($output);
+  }

This doesn't make sense to me. I'd at least like to know the reasons for introducing it (as inline comment), but actually, it looks like a major bug if you pass no HTML into something and suddenly deal with HTML along the way.

Might be caused by the unconditional invocation of check_markup(), or alternatively, because the DOM documents are loading text nodes into the body without a wrapping block-level element (DIV/P), so DOM might automatically inject one.

As you suspected, DOM is automatically injecting <p> tags. Revised the inline comment to make this clearer.

+++ b/modules/field/modules/text/text.test
@@ -286,95 +287,52 @@ class TextSummaryTestCase extends DrupalWebTestCase {
-      $this->callTextSummary($text, $expected[$i],    NULL, $i);
...
+      $this->callTextSummary($text, $expected[$i], 'filtered_html', $i);

I think we want and need to retain the text_summary() functionality for no format. It is useful for text [field] contents that does not have a text format associated, but still needs to be potentially rendered in "teaser" use-cases. The complex phrase detection logic in text_summary() is actually the only way in Drupal core to extract proper summaries from text strings.

The revised function works for plaintext, as tested by testLongSentence().

However, the example text in testLength() isn't very useful for the default plaintext format.

Calling check_markup("<p>\nHi\n</p>\n<p>\nfolks\n<br />\n!\n</p>", NULL)
yields "&lt;p&gt;<br />\nHi<br />\n&lt;/p&gt;<br />\n&lt;p&gt;<br />\nfolks<br />\n&lt;br /&gt;<br />\n!<br />\n&lt;/p&gt;".

If you really think it's useful to summarize that mess, I'll put it back.

Also note that the old text_summary() contained some more international punctuation characters, which should be retained.

Added the missing character, but I don't know what it is. Do you?

+++ b/modules/field/modules/text/text.test
@@ -383,7 +341,14 @@ class TextSummaryTestCase extends DrupalWebTestCase {
+    $this->assertIdentical(
+      $summary, $expected, t('Generated summary "@summary" matches expected "@expected".',
+        array(

Odd wrapping/indentation here; all except of the array keys should be on one line.

Revised, but we may have to agree to disagree here. I refuse to write code that extends past column 80 unless it is absolutely required by Drupal coding standards.

pillarsdotnet’s picture

Grrr... Ignore the previous patch.

pillarsdotnet’s picture

... And that one, too. I *really* need to get some sleep...

pillarsdotnet’s picture

I see one minor problem:

+    // Using plain text format:
+    $expected = array (
+      0 => "<p>Hi</p>\n<p>folks</p>\n<p>!</p>",

That should read, // Using filtered_html format: but I'll wait for @sun's review.

If anyone from the International community would like to propose tests of non-English sentence parsing, I for one would welcome the assistance.

AlexisWilke’s picture

I think we want and need to retain the text_summary() functionality for no format.

Note that this is something I mentioned. For a fix in D6, I don't think that function should be changed, but it is the right place to generate the teaser. Only problem, it either needs to understand HTML or have another function that does and calls text_summary() to handle the actual text to be cut for teaser purposes.

pillarsdotnet’s picture

@AlexisWilke, @sun:

Of course the patch in #123 cannot be backported to D6 because D6 does not require the PHP DOM extension. For D6, an approach such as the one taken by #111 is required.

In D7 and D8, when you set the $format argument to NULL, that's the same as specifying the filter_fallback_format() filter, which is 'plain_text'.

The patch in #123 correctly handles such cases by calling check_markup() before looking for sentence and line breaks.

You seem to be requiring that text_summary() should treat a NULL $format argument as if 'full_html' had been specified, instead.

While that may be more consistent with the original code, I believe that it is a bug and should be treated as such.

A further advantage of this "unconditional invocation" of check_markup() is that text_summary() will correctly handle cases where the format includes such filters as Markdown, Image Assist, or Bbcode.

How can text_summary() be made HTML-aware without first invoking the required format filters to transform its input into HTML?

pillarsdotnet’s picture

Corrected problem noted in #124 and also explicitly added the filter module to dependencies.

AlexisWilke’s picture

@pillarsdotnet,

I did not see the code sun was talking about. However, something such as:

¡Hello!

would require "special" handling in that the first exclamation mark should not be viewed to end the previous sentence. It actually starts the next sentence. That's one special case I know of.

Another thing is the dash. In general, it isn't much of a problem in English. However, in French, cutting a word with a dash can be a problem. Although I guess the code was supposed to not cut words but on spaces. So that would not be a concern.

As for the text_summary(), I was referencing the problem in D6. The teaser function in there would call the text_summary() as is, whether the body had HTML or not. Anyway, the main reason for my current solution is to change a minimal number of things and possible all in the same file (which it is right now.)

Thank you.
Alexis Wilke

pillarsdotnet’s picture

However, something such as:

¡Hello!

would require "special" handling in that the first exclamation mark should not be viewed to end the previous sentence. It actually starts the next sentence. That's one special case I know of.

The '¡' character is not an exclamation point. And even if it were, my code wouldn't use it as a break point. The regex I'm using breaks after an end-of-sentence character only if it is preceded by a word character and followed by a space character. However, it would break after common abbreviations such as 'Dr.' and 'Mr.', so I'm open to suggestions for a smarter algorithm, if you have any. For instance, how would you write code that splits at the end of a sentence, but does not split in the middle of the following sentence?

'They were like Dr. Jekyll and Mr. Hyde.'

As for the text_summary(), I was referencing the problem in D6.

The patch in question is not being submitted against D6. It is being submitted against D8. Are you suggesting that a patch against D8 be rejected on the grounds that it is unsuitable for D6? If so, this and most other D8 patches should simply be "Closed (won't fix)".

AlexisWilke’s picture

No, not exactly... text_summary() is properly labeled and does what it is supposed to do. It just cannot be used the way it is (at least in D6.)

We should have an html_summary() function that does the same thing as text_summary() but has knowledge of HTML. The fact is that views and CCK call text_summary() to create teasers, just the same as the normal core node_teaser() implementation does and... it breaks those teasers as badly as the node teasers.

Any module should be able to call a very simple function to generate a teaser from HTML text they have (i.e. the Taxonomy could do that with descriptions.) But if you put any specialization in the node_teaser() directly, it won't be available to other modules and that's a shame.

As for periods after Mr. Ms. Sr. ... I don't know anything about those.

Now, note that European are not unlikely to put spaces all over the place: ¡ Hola !   is probably common in Spain. Just saying. I understand that if you don't find a full stop, ! or ?, then you ignore anyway so this isn't a problem.

Thank you.
Alexis Wilke

pillarsdotnet’s picture

@AlexisWilke:

The patch in #127 causes all tests to pass and addresses all the objections raised so far. I would like for it to eventually reach "RTBC" status, however optimistic that may seem.

If you have an additional objection to the proposed patch against D8, please:

  • Clearly state your objection.
  • Suggest an additional test that illustrates your objection by failing in connection with the proposed patch.
  • Suggest a revised patch that overcomes the objection and enables the additional test to pass.

If, however, you are still talking about D6, or about D6 contrib modules, please hold your comments until this issue once again reaches D6 backport status. By that time the comment count will no doubt have reached three hundred or more, and I wouldn't want your insights to be lost in the middle.

pillarsdotnet’s picture

Another slight tweak. Since the splitting regex only contains assertions, there is no need for the additional parameters to preg_split().

Note that the approach used cannot be extended to include Unicode end-of-sentence marks unless we require PHP 5.1.0 or later.

See http://php.net/manual/regexp.reference.unicode.php

EDIT: Never mind; there isn't a Sentence_Terminal character class yet.

See http://unicode.org/review/pr-23.html

pillarsdotnet’s picture

Tried to add support for Unicode Sentence_Terminal characters by explicit list. Wonder if it will work?

Jamie Holly’s picture

D7 requires PHP 5.2.5+. D8 will probably be the same, or could be PHP 5.3+ - that's currently in discussion. I would go ahead and use the Unicode, and then if it makes it to a D6 back port status, a work around could be worked on.

However, I kind of wonder if we aren't look at this all wrong. I can think of other instances where an end-of-sentence character might not signify an actual end of sentence, for example in code blocks:

$foo = (! $bar);

(Yeah it's a sucky coding standard, but you get the idea. Add to that the other variations in different coding languages, and there is probably many more examples out there.)

Maybe instead of end-of-sentence, a better approach would be splitting at end-of-block-elements or even br's.

(Numerous times I've seen where teaser breaks have been put in automatically with only one sentence left in the article (I've seen this in Wordpress as well). It really leaves the impression that the site is just looking to drive-up page views. )

Going this route would seem a lot simpler and with much better performance, plus make automatically placed breaks seem much more logical, especially to the reader/visitor. On top of that, it would be a much more "fool proof" approach, though probably not 100% yet.

Of course this is a very significant change from how the function operated before, so getting it into even D7 might be near impossible.

pillarsdotnet’s picture

Perhaps this will better address @sun's objections to nested indentation.

EDIT: Sorry about the crosspost.

@#134, I mostly agree, despite the fact that your proposed example wouldn't force a break anyway. The open parenthesis character is not part of the [:word:] class.

pillarsdotnet’s picture

Okay, here's a patch that avoids breaking in the middle of a XML_TEXT_NODE, as suggested in #134. I expect some of the tests to fail.

Hmm... As an intermediate approach, we could turn off sentence-splitting if any of the ancestor tags are 'code'.

According to @sun, performance is a non-issue. This function can take as long as it wants to, since the results are cached.

;-)

pillarsdotnet’s picture

Same as #135, but avoids splitting text nodes within code blocks. A test is needed.

pillarsdotnet’s picture

Interesting. Converting UTF entities to characters is not supported by our testbots. What, I wonder, is the proper way to create Unicode characters? For now, I'll just quote them verbatim, but the result is unreadable in my text editor.

This one passes tests locally; hopefully the testbot will agree.

pillarsdotnet’s picture

Found and fixed a spelling error in a comment.

pillarsdotnet’s picture

Forgot to add the $parents parameter to the recursive call. Still need a test that would otherwise split in the middle of a code block.

pillarsdotnet’s picture

Added tests for code blocks and for Unicode Sentence_Terminal characters. Note that the test failures in #140 have nothing to do with this patch. Also included in #823380-111: Read More link is always visible on teaser.

pillarsdotnet’s picture

Status: Needs review » Closed (duplicate)

Marking this as a dup since it's rolled into #823380-111: Read More link is always visible on teaser.

AlexisWilke’s picture

pillarsdotnet,

Is that a way to close this issue on D7/D6 users?!

Or will that other issue be back ported instead?

Thank you.
Alexis

pillarsdotnet’s picture

Don't worry; I want this backported as much as you do.

sun’s picture

Status: Closed (duplicate) » Needs review

#823380: Read More link is always visible on teaser. is not related, does not conflict, and is not a duplicate of this issue.

pillarsdotnet’s picture

@sun -- It's not a dup, but the patches are related; they do overlap; they do break each other; they were merged at the suggestion of intoxination in #823380-90: Read More link is always visible on teaser., and that issue is much shorter and hence more likely to actually receive a review.

zilverdistel’s picture

Subscribing. And I also don't think this is a duplicate of #823380.

--EDIT--
I'd like +1 for the need to backport this to D7.

pillarsdotnet’s picture

pillarsdotnet’s picture

pillarsdotnet’s picture

sun’s picture

Category: bug » task

I like the new implementation. Need to do another in-depth review soon.

However, this issue existed in the past 2-3 Drupal core versions already, and the new code implements an entirely new approach to hopefully resolve the issue finally in a rock solid way, but this also means we're improving the previously existing code in Drupal core. A major annoyance, but not really a bug.

pillarsdotnet’s picture

#141: text_summary-221257-141.patch queued for re-testing.

fmosca’s picture

subscribing, +1 for the need to backport this to D7

pillarsdotnet’s picture

jct’s picture

jct’s picture

Attempted to add issue summary

jct’s picture

Issue summary: View changes

Adding issue summary

jct’s picture

Issue summary: View changes

Changing my summary to one provided in a comment

mangelp’s picture

Hi, I've tried this patch in a drupal 7 site and it seems to work fine, but I've found that if the text to summarize starts with a sentence longer than the maximum length the summary is returned empty because _text_summarize stop processing sentences when it finds a sentence with a length longer than the maximum allowable length left.

Is that a bug or is the expected behaviour?

pillarsdotnet’s picture

... _text_summarize stop processing sentences when it finds a sentence with a length longer than the maximum allowable length left.

Is that a bug or is the expected behaviour?

Possibly sub-optimal behavior, depending on whether you are willing to have a summary that stops in the middle of a sentence.

NancyDru’s picture

From the Merriam-Webster dictionary for "teaser:"

an advertising or promotional device intended to arouse interest or curiosity especially in something to follow

Stopping in the middle of a sentence is certainly going to arouse more interest than blanks.

Akaoni’s picture

Subscribing.
+1 for D7 backport.

zouz747’s picture

Version: 8.x-dev » 7.8
Category: task » feature
Priority: Major » Normal

I all, I'm new to Drupal and I see the utf8_truncate function being called in text_summary() handles the add_ellipsis parameter which BTW could be added in the UI configuration screen for teasers (Y/N) for setting. Assuming it's been set somewhere, shouldn't the parameter be passed to text_summary() from text_field_formatter_view()?

dddave’s picture

Version: 7.8 » 8.x-dev
Category: feature » task
Priority: Normal » Major

Being new is fine, fiddling wildly with the issue settings not. ;)

rkodrupal’s picture

sorry i haven't been keeping up with this thread since #49 but i recently discovered how to 'break' the code ... have a teaser that exceeds the length of an em string, then re-edit the article (shortening it) so it breaks in the middle of the em string.

again i'm still on drupal 6.0 and using the code mentioned in #49, but I thought I'd share this in case it's helpful for further code development.

here's the offending code that broke on re-editing:
Our in-house&nbsp;<em>judiciary</em><a id="/sites/default/files/snapshot/1499.jpg" class="imgtip" href="/node/1499"><img title="Click here for more information." alt="" src="/sites/all/themes/fordrupal/zincout/icons/internal_link_icon.gif"></a> gets to

pillarsdotnet’s picture

@rkodrupal:

since #49 but i recently discovered how ... it breaks in the middle of the em string.

The current code in #141 makes this impossible. If you think it is necessary, I will add your example to the tests to prove it.

@NancyDru:

Stopping in the middle of a sentence is certainly going to arouse more interest than blanks.

The Unicode line-break algorithm is ... complicated.

I'm not sure it's worth implementing. Are you?

rkodrupal’s picture

not a big deal on my part ... the breakage is quite evident on the screen and i can manually edit the teaser. just sharing this in case it was worthwhile. we do strive for perfection heh?

is the #141 code drupal 6 friendly? i'm losing track of the versions.

k

pillarsdotnet’s picture

Title: text_summary() should be made HTML-aware. » text_summary() should output valid HTML and Unicode text.
FileSize
13.24 KB
21.16 KB
21.16 KB

@rkodrupal:

is the #141 code drupal 6 friendly? i'm losing track of the versions.

No. I can roll a D6 patch, but I'm sure that the API changes will not be accepted. Please let me know how it works for you; I'm getting weird results in my tests.

@NancyDru:

Stopping in the middle of a sentence

Okay; implemented as a fallback only if no valid sentence break is found:

+    // If no suitable sentence break is found, we split before any Unicode
+    // separator character.
+    $word_splitter = '/(?=\p{Z})/u';
pillarsdotnet’s picture

@rkodrupal:

I'm getting weird results in my tests.

This version makes the weirdness go away, for d6. The problem is that d6 code assumes that:

  1. Javascript is enabled.
  2. The teaser is an exact substring of the body.

Status: Needs review » Needs work

The last submitted patch, text_summary-221257-166.patch, failed testing.

pillarsdotnet’s picture

Fixed failing test. (When did that line disappear from filter_dom_serialize()?)

rkodrupal’s picture

sorry for delay. i just updated to drupal 6.22 thinking it would be a good test of text_summary-221257-169-d6.patch.

if i create a new node it splits the teaser fine. if i re-edit that same node it inserts an annoying manual break ... duplicating everything within the teaser above and below the break.

so i was unable to execute the test from #163 because i never made it that far.

before you also included an update for node.pages.inc. is that still needed with this latest patch for d6?

thanks for all you great work on this.

k

pillarsdotnet’s picture

Status: Needs work » Needs review

if i create a new node it splits the teaser fine. if i re-edit that same node it inserts an annoying manual break ... duplicating everything within the teaser above and below the break.

Sorry, but that's just part of the broken-by-design d6 javascript-powered teaser-break weirdness, and a good example of why that featurebug was removed in d7.

I don't know how to fix it so that it works as expected, both:

  • When the teaser summary is included with the node body.

and:

  • When the teaser summary is not included with the node body.

before you also included an update for node.pages.inc. is that still needed with this latest patch for d6?

I'm sorry, but looking back through the last few versions, I don't see one that updated node.pages.inc. Can you provide a link?

Furthermore, this issue is currently targeting version 8.x. It's nice to get the 7.x and 6.x backports out of the way, but problems with them should not distract from getting the patch reviewed and into 8.x, which must happen first.

Therefore, setting back to needs review.

rkodrupal’s picture

I vote for making it work when the teaser summary is not included with the node body. There should be no reason for D6 to add in a manual teaser break on its own, and that was the main benefit for the fix done in http://drupal.org/node/221257#comment-3735536 (#49 above)

... the patch at #49 also includes the node.pages.inc change.

Thanks

pillarsdotnet’s picture

@#172 by rkodrupal:

the patch at #49 also includes the node.pages.inc change.

Okay, I'll try to remember that the next time I roll a backport. For now, I'm just trying to get this accepted into 8.x, which lacks a node_body_field() in its node.pages.inc file.

See: http://drupal.org/update/modules/6/7#body_teaser_fields
And: #372743: Body and teaser as fields

pillarsdotnet’s picture

Issue summary: View changes

Corrected pointer to current patch.

AlexisWilke’s picture

Actually, in D8 you probably want to change the scheme.

A much better idea is to have a flag to tell you whether the teaser was generated or not and completely forget about the comparison between the beginning of the body and the teaser (which is a crazy idea.)

Once you have that flag, you know when you edit whether the teaser is part of the body or not.

That would not work for D7 and D6, of course.

Good luck.
Alexis

pillarsdotnet’s picture

@#174 by AlexisWilke:

A much better idea is to have a flag to tell you whether the teaser was generated or not and completely forget about the comparison between the beginning of the body and the teaser (which is a crazy idea.)

Did you mean to comment on #823380: Read More link is always visible on teaser.?

AlexisWilke’s picture

I'm the one who wrote the patch in #49. You don't need to have this function:

node_compare_teaser_and_body()

if you have a flag to know that the teaser was auto-generated.

Unfortunately, in D6, that was the easiest way to fix the problem, but in a new version that should not be done that way... The Read More link has nothing to do with that, although a flag to mark that the teaser == body is a good idea too. 8-)

Thank you.
Alexis Wilke

pillarsdotnet’s picture

@#176 by AlexisWilke on October 6, 2011 at 2:58am:

I'm the one who wrote the patch in #49.

Okay; I apologize for not recognizing you as an early contributor to this issue.

You don't need to have this function:

node_compare_teaser_and_body()

The patch from #169 does not contain such a function, nor can I find one anywhere in core.

if you have a flag to know that the teaser was auto-generated.

Such a flag, if it were to exist, would having nothing to do with this patch. The text_summary() function does not deal with node or entity objects. It only deals with plain text.

Unfortunately, in D6, that was the easiest way to fix the problem, but in a new version that should not be done that way...

And the new version is not done that way, which is why I am having difficulty understanding the relevance of your commentary. Have you actually read the code? If not, please do. The only thing holding up this issue is the lack of a thorough code review by someone other than myself.

The Read More link has nothing to do with that, although a flag to mark that the teaser == body is a good idea too. 8-)

Again, this is why I thought you were referring to the other issue, where such commentary would be more relevant.

rkodrupal’s picture

for other folks reading this thread i'm stickin' with the patch in #49 for drupal 6.22 ... i did need to split the patch between the .module and the .inc ... the patch for .module works well in drupal 6.22, but not the patch for .inc ... the .inc patch is only 1 line so just manually make the change.

of course this doesn't fix the minor issue i raised in #163.

unfortunately delays in updates to the contrib modules are keeping me from going to drupal 7, so any worries about drupal 8 are far off into my future.

k

David_Rothstein’s picture

Status: Needs review » Needs work

A couple comments after reading through this issue:

  1.  function text_summary($text, $format = NULL, $size = NULL) {
    ...
    +    return trim(check_markup(substr($text, 0, $delimiter), $format));
    

    The check_markup() function has already been called on the text before it was ever passed to text_summary(). Therefore, this code is going to result in check_markup() being called twice. (See also: #1347920: text_summary() checks for specific filters, but it should be agnostic to filters.)

  2. There were comments above stating that we don't have to worry about the performance of text_summary() because it's cached.... Unfortunately, though, it's actually not cached (although it probably should be). See #1347910: The result of text_summary() should be cached rather than calculated every time a teaser is viewed. So unless that's fixed, we do have to think a bit about performance here.
effulgentsia’s picture

Thanks for everyone's work on this. In addition to the feedback in #179,

+++ b/modules/field/modules/text/text.module
@@ -380,68 +369,147 @@ function text_summary($text, $format = NULL, $size = NULL) {
+  if ($body->hasChildNodes()) {
+    $node = $summary->appendChild($doc->createElement($body->tagName));
+    $parents[] = $body->tagName;
+    foreach ($body->childNodes as $child) {
+      if ($size > 0) {
+        $size = _text_summarize($child, $size, $node, $doc, $parents);
+      }
+      else {
+        break;
+      }
+    }
   }

This fails to include self-closing tags (like images) in the summary. It also fails to include tag attributes (like href).

For anyone interested, I added a Improved Text Trim sandbox project that alters the text field trimming formatters to use a trimming function modeled on #169 with #179 and the above incorporated.

@pillarsdotnet: #169 needs to be rerolled for the D8 "core" folder introduction. Do you want to do that and incorporate the fixes from my sandbox that you approve of, or would you like me to post a patch here with the fixes?

effulgentsia’s picture

Title: text_summary() should output valid HTML and Unicode text. » text_summary() should output valid HTML and Unicode text, and not count markup characters as part of the text length
andrey-z’s picture

Status: Needs work » Needs review
Issue tags: -Needs backport to D6, -teaser, -Needs backport to D7

#169: text_summary-221257-169.patch queued for re-testing.

Status: Needs review » Needs work
Issue tags: +Needs backport to D6, +teaser, +Needs backport to D7

The last submitted patch, text_summary-221257-169.patch, failed testing.

pillarsdotnet’s picture

@effulgentsia -- Sorry; I've been offline for a long time and not likely to get back to this soon. Please post whatever patches you feel appropriate.

NROTC_Webmaster’s picture

sub

pillarsdotnet’s picture

@#179 Posted by David_Rothstein on November 21, 2011 at 5:07am:

The check_markup() function has already been called on the text before it was ever passed to text_summary(). ... See #1347920: text_summary() checks for specific filters, but it should be agnostic to filters

That has not been the case in my tests. Has something changed in core? (checking...) Wow. I guess it has. Re-rolling accordingly.

See #1347910: The result of text_summary() should be cached rather than calculated every time a teaser is viewed. So unless that's fixed, we do have to think a bit about performance here.

I'm glad there is another issue to worry about performance and caching so that I don't have to worry about it in this issue. (grin!)

pillarsdotnet’s picture

@#180 Posted by effulgentsia on December 21, 2011 at 2:14am

Thanks.

Re-rolled using your sandbox as a guide.

pillarsdotnet’s picture

Status: Needs work » Needs review

Grr....

pillarsdotnet’s picture

Also needs test-cases for the shortcomings pointed out by Effulgentsia in #180:

This fails to include self-closing tags (like images) in the summary. It also fails to include tag attributes (like href).

Status: Needs review » Needs work

The last submitted patch, text_summary-221257-187.patch, failed testing.

pillarsdotnet’s picture

No clue why the fatal error occurred. Re-rolling with some trivial changes and re-uploading.

pillarsdotnet’s picture

Status: Needs work » Needs review

Go, testbot!

Status: Needs review » Needs work
Issue tags: -Needs backport to D6, -teaser, -Needs backport to D7

The last submitted patch, text_summary-221257-191.patch, failed testing.

pillarsdotnet’s picture

Status: Needs work » Needs review

#191: text_summary-221257-191.patch queued for re-testing.

Status: Needs review » Needs work
Issue tags: +Needs backport to D6, +teaser, +Needs backport to D7

The last submitted patch, text_summary-221257-191.patch, failed testing.

Mark_L6n’s picture

Determining sentence boundaries is a complex issue, even if only English is considered. Periods are frequently used for purposes other than sentence boundaries, particularly abbreviations. The difficulty can be seen by the number of academic papers on it...take a quick look at what Google Scholar returns for "sentence boundaries". People have used rule-based, statistical, neural-network, machine-learning approaches, etc. to determine whether a stop is a sentence boundary or not.
You could do a quick-and-dirty processor (that wouldn't always be accurate) by searching for a stop followed by white space followed by a capital letter, disallowing Mr., Mrs., Dr. etc.
However, that's only for English and still incomplete.
So of course using the nearest markup boundary is much better. Perhaps a basic sentence boundary detector could be implemented, but allowed only as an option, with a note explaining it won't always be accurate.

pillarsdotnet’s picture

@Mark123 -- Patches are always welcome.

dozymoe’s picture

Maybe you can think of text_summary as another text-formats filter, that is inserted right before the last filter, which is usually html corrector, or if user choose, htmLawed.

If it can be seen like that, then the call to _filter_htmlcorrector() is outside of text_summary() function's contract.

tim.plunkett’s picture

Category: task » feature

This is a feature request.

jenlampton’s picture

Patch at #76 still applies cleanly to 7.18. In case anyone else still needs a fix today :) And I'm not sure why this isn't a bug report.

David_Rothstein’s picture

Category: feature » bug
Priority: Major » Normal

Yeah, as written, this is definitely a bug report (regardless of whether the patch here also goes into feature request territory)...

However, I'm not convinced the main bug (faulty HTML being produced in teasers) exists anywhere except Drupal 6? That's because of #1235062: text_summary() ignores filter status - basically, in Drupal 7 and Drupal 8, the two bugs should cancel each other out and ensure the HTML corrector filter always runs.

I think we should split the difference and call it a non-major bug. Even in Drupal 6, I would think that enabling the HTML corrector filter gets around the main part of this bug.

lightsurge’s picture

I can't find any reference to text_summary() in 8.x, has it been removed and that was why this issue was put to feature request?

David_Rothstein’s picture

text_summary() is still in Drupal 8 (in core/modules/text/text.module).

On api.drupal.org it only appears for Drupal 7 right now; I'm not sure why but I believe that hasn't been rebuilt in a while and the module has moved around recently.

jenlampton’s picture

Priority: Normal » Major

Bumping back up to major because I'm still seeing this problem in D7.

I'm not sure which of the two things that are supposed to "cancel each other out" is failing, but without the patch (I'm using this one from #76) my site is still hosed.

Above mentioned patch still applies to 7.21, for others who are also currently hosed :)

dcam’s picture

http://drupal.org/node/1427826 contains instructions for updating the issue summary with the summary template.

The summary may need to be updated with information from comments.

ttkaminski’s picture

Not sure why no one brought this up, but it seems that if you have a post that has a div tag that spans the <!--break--> marker, then the teaser summary will contain invalid HTML. Example:

<div>
  <p>Hello</p>
  <!--break-->
  <p>there</o>
</div>

text_summary() will return:

<div>
  <p>Hello</p>

Which breaks the entire layout of the site because of the missing div close tag. My recommendation is to include scenario in the tests and also perform HTML correction after the text is trimmed via the <!--break-->

swentel’s picture

Component: field system » text.module

Moving to text module

Stalski’s picture

Just tried to read this huge issue. To go forward with this, can anyone give a status update summary?

Especially :
- What's the work that needs to be done?
- Which patch is the most probable to start with to move forward?
- As lots of issues change the testFirstSentence test case, I think it might be a good idea to have a separate test for this issue, no?

alexverb’s picture

I don't want to sidetrack this issue. But I've provided a contrib solution for D7 that might be of interest to this issue: https://drupal.org/node/2118995. It provides a way to trim text without taking HTML tags into account in about 20 lines of code:

/**
 * Returns trimmed HTML without counting HTML tags.
 */
function _field_html_trim_get_trimmed_html($safe_value, $trim_length, $ellipsis) {

  // Get text without any formatting.
  $text = strip_tags(str_replace('<', ' <', $safe_value));
  $text = preg_replace('/\n|\r|\t/m', ' ', $text);
  $text = str_replace('&nbsp;', ' ', $text);
  $text = str_replace("\xc2\xa0", ' ', $text);
  $text = trim(preg_replace('/\s\s+/', ' ', $text));

  if (strlen($text) > $trim_length) {
    // Shorten the text.
    $short_text = truncate_utf8($text, $trim_length, 1);
    // Break text up in words.
    $words = str_word_count($short_text, 1);
    // Build regular expression pattern.
    $pattern = implode("(.*?)", $words);
    $pattern = '/(.*?)' . $pattern . '/';
    // Find the word pattern in our HTML.
    if (preg_match($pattern, $safe_value, $result)) {
      // Close any unclosed HTML tags and append ellipsis.
      $safe_value = _filter_htmlcorrector($result[0] . $ellipsis);
    }
  }

  return $safe_value;
}

I have not read this entire issue. But I've looked at some patches and see that DOMDocument is used. Probably to get to sentence level for cutting off. My solution does not take that into account, but might be an interesting approach to the HTML problem.

alexverb’s picture

Issue summary: View changes

Updated info, links, and status.

jtbayly’s picture

After close to a year of struggling to figure out how our entire homepage layout gets destroyed every few months, I finally tracked it down to this bug.

Isn't this really (at least) two different bugs?

1. "Correct faulty and chopped off HTML" is completely ignored on Summary field, thereby breaking sites on a regular basis. (If you use a WYSIWYG tool of any sort and use a break to determine the summary, there are all kinds of ways to cause problems. Any tag that isn't closed prior to the break will destroy your site wherever that summary is displayed.) Don't we have a fix for this, and can't we just apply it? I know it will be painful for a bit, but really it's a catch 22. We can't wait forever.

2. The teaser isn't quite as long as it should be sometimes, because HTML code and is getting counted as words.

Would it help to split them up into 2 different issues?

-Joseph

tarekdj’s picture

Issue summary: View changes
Status: Needs work » Needs review
FileSize
3.48 KB

Patch based on #210.

Status: Needs review » Needs work

The last submitted patch, 212: text_summary-221257-212.patch, failed testing.

valderama’s picture

Until this gets fixed in core one can use smart_trim module with a patch from this issue applied: https://www.drupal.org/node/1916060

jenlampton’s picture

Since the most recent patch no longer applies cleanly to D7 also, still applying the patch in #76 to 7.32

@jtbayly I think you are exactly right with your explanation of the two problems, and we did have two (or more) issues, but since this one had a working patch, the others were closed and marked a dupe of this one.

jenlampton’s picture

Patch in #76 still applies clealy to 7.34. Tried the patch in #169 again also, but no luck there.

jenlampton’s picture

Issue summary: View changes

removing current issue status from issue

bburg’s picture

Issue summary: View changes

Updating broken link.

bburg’s picture

Confirming this is still issue in D8 head.

mdrayer’s picture

This is a triage at the Los Angeles Sprints. @bburg and I did the triage.

We reproduced the issue against Drupal 8 head with the following steps (similar to steps in summary):

  1. Set text field format to 'Summary or trimmed' with Trimmed limit set to 200 characters
  2. Create new content, insert the following text into the Source for the text field:
    <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce diam est, finibus a lobortis ac, mattis imperdiet sapien. Sed non libero ac augue aliquam volutpat. Praesent sollicitudin ipsum ac amet.</p>
  3. The trim will count the HTML markup in the character count; so instead of displaying "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce diam est, finibus a lobortis ac, mattis imperdiet sapien. Sed non libero ac augue aliquam volutpat. Praesent sollicitudin ipsum ac amet." (199 characters), the field outputs as "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce diam est, finibus a lobortis ac, mattis imperdiet sapien. Sed non libero ac augue aliquam volutpat."

Note that HTML entities and Unicode characters seemed to get decoded before the trim count went into effect, so those passed our testing. HTML markup seems to be the only problem here. The HTML corrector step in the summary is unnecessary and seems like a whole other issue.

bburg’s picture

Rerolled patch against current head. Resolved conflict generated from #2361745: Remove usage of truncate_utf8(). This will fail tests.

kgoel’s picture

Status: Needs work » Needs review
bburg’s picture

While several tests need to be rewritten to address the new behavior being added by this issue, at least one appears to be legitimately failing. namely testFirstSentenceQuestion(). This is because the new function added by this patch _html_trim_get_trimmed_html() strips out the ending sentence character.

Status: Needs review » Needs work

The last submitted patch, 221: text_summary-221257-221.patch, failed testing.

xjm’s picture

(Saving proposed issue credit for discussion and triage participants at LA.)

jtbayly’s picture

"The HTML corrector step in the summary is unnecessary and seems like a whole other issue."

I'm not sure exactly what you mean by that. The issue title starts out with "text_summary() should output valid HTML..." Are you saying that that should be a different issue? I would say it's the more important of the two bugs under discussion here. Teaser length being off by a little bit isn't nearly as big of a deal as outputting bad HTML onto the homepage.

Regardless, if you don't intend to address that part of it, let's change the title here and I'll create another issue. Let me know.

And thanks for working on this.

-Joseph

bburg’s picture

FileSize
2.02 KB

@jtbayly, To recreate the reported issue, one must turn off the HTML corrector filter. So the reported problem essentially boils down to: If we turn off the feature that was created to fix this bug in the first place, then the bug occurs. Which doesn't make a whole lot of sense as an issue to me. Now we can discuss whether this should just be default behavior for the summary text formatter in the first place? That's what we were thinking by "The HTML corrector step in the summary is unnecessary and seems like a whole other issue."

I did start work on a TextSummary class to create summaries while preserving markup integrity. Unfortunately, I haven't had much time since the week after DrupalCon to work on this. It's not remotely ready, hence why I haven't posted a patch, but I am only sharing this now to start a discussion on the direction I was thinking about heading in. Rather than relying on regular expressions to manipulate HTML, I was thinking of constructing the snippet as a DOM and traversing through the node tree recursively. Example attached (beware, I haven't looked at this in almost a month).

Wim Leers’s picture

Devil's advocate:

Shouldn't we just remove auto-truncating? It's a bad practice anyway. Have a separate teaser field instead.

bburg’s picture

@wim-leers, Good question. Because content editors are lazy? I've used auto-truncating as a backup on most of my projects. But I definitely get that as a general practice, it's better to curate your own summaries.

Great session at DrupalCon btw.

jtbayly’s picture

I'm pretty sure you've missed one part of this bug, which is the one that I happen to care about. Either that, or I'm majorly confused.

I don't use auto-truncation. I use a WYSIWYG tool, which allows me to set a breakpoint and auto-fills the summary field with the content found prior to the breakpoint. The problem is that we end up with broken HTML in the summary field, even when the HTML corrector is on. In other words you *don't* need to turn off the HTML corrector to reproduce this bug. You need to turn it on and put bad HTML in the summary field.

Here is *my* summary of the two parts to this issue, as I wrote in #211 above:

1. "Correct faulty and chopped off HTML" is completely ignored on Summary field, thereby breaking sites on a regular basis.
2. The teaser isn't quite as long as it should be sometimes, because HTML code and is getting counted as words.

And like I said before, #1 is a much bigger issue, because it completely DESTROYS every page where the summary field is displayed with broken HTML. On the other hand, it seems to me you are only working on #2?

That's why I want to see this broken into two issues. I see how they are related, but I really think that it's confusing to try to deal with two problems at once, especially when one of them is major and the other (in my opinion) should be marked as minor.

Can't somebody just write a patch that makes the "Correct faulty and chopped off HTML" actually work on the summary field? It seems so simple... (as a non-coder).

Warmly, and on my knees begging...
-Joseph

geerlingguy’s picture

Shouldn't we just remove auto-truncating? It's a bad practice anyway. Have a separate teaser field instead. (— #228)

For my own sites (where I'm the only one managing content), I typically use a summary field and use that. For almost any project where there's no OCD-level content administrator, summary fields become next to worthless, as admins rarely type in a good summary, and instead copy and paste a line of text (which makes no sense out of context), type in a couple words to fill the field, or do something similarly unhelpful.

I think the main thing we're aiming for is something like a google search summary result: https://www.google.com/?q=test (or even the kind of summary text you get through default Solr search results with Apachesolr or Search API Solr).

In the results, the summaries are all of similar length, but they are all relevant text (through the use of a slightly more intelligent filter). For my needs, something like Smart Trim usually works, and it would be nice to have that option in core for any fields where Wysiwyg and/or images/files are allowed, especially on larger multi-author or community sites.

Also, IIRC, Drupal 6 used to promote the 'summary/body' split a bit more prominently, and it was still with us in Drupal 7 ('show summary' for body fields), but was removed from Drupal 8 because the UX is always awkward, people rarely used summaries correctly, and content updates would often result in incorrect/outdated summaries anyways :(

lightsurge’s picture

I agree with @geerlingguy to get admins to use summary fields takes training and re-training, which isn't always practical. To keep things simple if a teaser can be extracted from the main body of text, in many cases it's more successful. Fine to have a separate summary field, but admins wouldn't necessarily want it accessible to all users, and would want it auto-magically filled in where it's empty.

Something like Google search result summaries would be great, the thing is that and solr sort of rely on having a search keyword/phrase etc. in order to pick out what they show? Seems like any clever summary is going to be the realm of contrib and even plugins to third party software. For example, it would be good to have some kind of wrapper module for Open Text Summarizer http://libots.sourceforge.net/ which despite the ancient last release date seems to still have good support in modern distros http://packages.ubuntu.com/wily/libots0 and there's existing php wrappers to look at https://github.com/znck/summarizer

Edit: I had a go with OTS at creating google-like summaries https://www.drupal.org/node/2652726 I quite like the result although obviously it doesn't really help this core issue.

Jamie Holly’s picture

Given the importance of a dedicated summary field, especially in terms of SEO, social media and design, I can't help but wonder if this 8 year old issue is that important anymore? Layout has changed a lot since this issue first emerged. Back then using teasers (summaries) was pretty much common place, but design trends over the past several years have gone more card or summary based.

Auto text summarization is something I've played with a lot, trying out numerous systems in various languages, as well as service based. I believe the problem with that is that it becomes a crutch. A computer isn't going to write an effective summary anymore than it will write an effective headline, and summaries are almost as important as headlines anymore.

Back to the issue at hand, I think we really need to re-evaluate. First off, a backport to D6 is basically not going to happen at this point, given we're less than a month away from D6's EOL. Looking forward, is having a Body+Summary field really the best path forward? Perhaps in future versions of Drupal, the better method would be a dedicated summary field and then body, leaving the Body+Summary in for legacy purposes for one version, then moving it to contrib after that.

Thinking that, the best route forward for this issue will be to fix the actual issue at hand, ie: getting valid HTML. I don't think spending time on reworking the actual functionality of this field is really a priority, given I can actually see a future Drupal without it.

geerlingguy’s picture

Issue tags: -Needs backport to D6

Yeah, definitely not heading back to D6 at this point.

alexverb’s picture

In my mind the easiest solution is to grab a hold of CakePHP's truncate function like I did for Field HTML Trim.

This function has several options.

  • To include or exclude HTML from trim count.
  • Adding an ellipsis.
  • Unicode support.
  • Cut of at character level or word level.
  • Properly closes unclosed HTML tags.

I've rigorously tested this function and compared it's performance against a DOMDocument approach. I think it's the best way to go. I'd be happy to write a patch and adjust the function to Core's needs if the maintainer is interested in this approach.

jenlampton’s picture

Patch in #76 still applies clealy to 7.43.

dqd’s picture

(EDIT) Firsts things first: there is an issue with D8.1 beta body field trimmed view (e.g. teaser, frontpage) which renders the formatted and trimmed text with a starting but not with a closing html <p> tag. I assume that this is part of this issue (?), if not, we need further testing for a new issue to set up.

Now back to topic:

Some thoughts on the summary yes/no paradigm discussion above: both main points are valid and do not neutralize each other. Automated teaser text IS and ever WILL BE important in bigger projects or web applications to prevent some data inconsistency in many editorial desk concepts. (" Dear editor, if you change the article, please don't forget to update the summary"). And of course it would mean a lot of loss to remove the summary option for those who are picky about would should be extracted for a preview. I think, to have both options at hand isn't obsolete for a data contructs building CMF like Drupal is.

Another thought on this: Despite of that, Views is in Core now for D8 and is the default frontpage provider (but with teaser view by default). We may should think about merging forces here since Views also supports field length trimming? And does this issue affects Views trimming text also? Maybe we should add a Views tag to get attention of Views maintainers here? Just thinking loud ...

@alexverb: interesting solution. Thanks for this contribution. Any performance concerns to consider for further testing under D8 ?

Version: 8.0.x-dev » 8.1.x-dev

Drupal 8.0.6 was released on April 6 and is the final bugfix release for the Drupal 8.0.x series. Drupal 8.0.x will not receive any further development aside from security fixes. Drupal 8.1.0-rc1 is now available and sites should prepare to update to 8.1.0.

Bug reports should be targeted against the 8.1.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

jenlampton’s picture

Patch in #76 still applies clealy to 7.50.

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.9 was released on September 7 and is the final bugfix release for the Drupal 8.1.x series. Drupal 8.1.x will not receive any further development aside from security fixes. Drupal 8.2.0-rc1 is now available and sites should prepare to upgrade to 8.2.0.

Bug reports should be targeted against the 8.2.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.6 was released on February 1, 2017 and is the final full bugfix release for the Drupal 8.2.x series. Drupal 8.2.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.3.0 on April 5, 2017. (Drupal 8.3.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.3.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.6 was released on August 2, 2017 and is the final full bugfix release for the Drupal 8.3.x series. Drupal 8.3.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.4.0 on October 4, 2017. (Drupal 8.4.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.4.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

jenlampton’s picture

Patch in #76 still applies clealy to 7.56.

dqd’s picture

Oh Dear ... another long holding one ... Lost it completely out of my focus. Maybe I should raise voice for it on https://www.drupal.org/node/2905741 ? But I think its too late already.

@jenlampton: heh thanks for frequently reporting that the patch still applies cleanly :)

Just to come back again to @Wim Leers on #228

Shouldn't we just remove auto-truncating? It's a bad practice anyway. Have a separate teaser field instead.

if this became a show stopper for anyone to make any decision or to push this issue forward, let me remember that this functionality not only applies for "teasing" any content but also for any listings with shortened text in different lengths and admin previews etc. I think this function needs to work properly, also for admin content use cases. And just to make sure that I do not mix things up here: text_summary() is doing the trimming in most of the core trimmed text previews, correct?

#62: The Views are using a copy of the Core function to extract the Teaser. They do not have the problem of comparing teaser and body as the Core does, but they do the same really bad job at not closing the tags... (this is the case for any truncate function in the Views.)

Is this still the case? I should take a short chat with @dawehner on this ...

(EDIT: Cutted last part to paste it for a new comment since this is a completely different discussion)

dqd’s picture

Somebody mentioned an example of when using an embed video is breaking the expected behavior. I wanted to quote it but I can't find it anymore in the rush. I found it:

#66: I actually just noticed this yesterday on a D7 site. It cut off a YouTube embed code in the middle. That's pretty bad, as it affects a lot of block-level elements (including forms).

I wanted to demonstrate that we maybe need a new clear road map about what should be counted and how? (Another issue) I think this was one of the main reasons for the function to work for some not like expected (not talking about the bug but about the all over behavior often stated in this issue here). The POV and expectations of each user on this is very different and shouldn't be underrated. It's not as clear as it looks like at the first glance: "Is a video part of a summary and if, how much should it count (items)? Do we need different counter types like a word counter, a character counter and HTML element counter? Can we mix them?" But this should be part of another issue.

But as #83 already stated (as I found out later, errm):

Thinking about this some more, there could be all kinds of additional conditions on summary generation.

Maximum display length (not including markup)
Maximum storage length (including markup)
Maximum line width / word-wrap
Maximum number of lines
Allowed tags/styles/fonts/etc.

Probably it would be best to allow some sort of hook allowing modules or themes to add conditions of their own.

But as I sad this is part of another issue if it doesn't still exist and it also should take contrib area into account. e.g. extending Smart_Trim?

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.4 was released on January 3, 2018 and is the final full bugfix release for the Drupal 8.4.x series. Drupal 8.4.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.5.0 on March 7, 2018. (Drupal 8.5.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.5.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

jenlampton’s picture

@diqidoq you're welcome! I sometimes wonder if it's too much 'noise' but I do appreciate being able to easily find the patch I have to apply to each core update, and I hope that helps others too. So...

Patch in #76 still applies (with offset) to 7.58.

jenlampton’s picture

Patch in #76 still applies (with offset) to 7.59.

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.6 was released on August 1, 2018 and is the final bugfix release for the Drupal 8.5.x series. Drupal 8.5.x will not receive any further development aside from security fixes. Sites should prepare to update to 8.6.0 on September 5, 2018. (Drupal 8.6.0-rc1 is available for testing.)

Bug reports should be targeted against the 8.6.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

jenlampton’s picture

Patch in #76 still applies (with offset) to 7.63.

calebtr’s picture

I want to add that, while patch #76 does indeed apply, it is also buggy: when text is clipped in the middle of a closing tag so that </ are the last two characters, the tag is left open. Sure, it is an edge case, but it means that the original problem of the auto-summary producing invalid HTML still exists.

I like the DOMDocument approach in #141 better overall, but unfortunately that one is not applying for me. Echoing comment #231, for Drupal 7 this is handled nicely by contrib - and is perhaps the best way forward for Drupal 8+ as well.

Wim Leers’s picture

leymannx’s picture

Version: 8.6.x-dev » 8.8.x-dev

Drupal 8.6.x will not receive any further development aside from security fixes. Bug reports should be targeted against the 8.8.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.9.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.8.x-dev » 8.9.x-dev

Drupal 8.8.7 was released on June 3, 2020 and is the final full bugfix release for the Drupal 8.8.x series. Drupal 8.8.x will not receive any further development aside from security fixes. Sites should prepare to update to Drupal 8.9.0 or Drupal 9.0.0 for ongoing support.

Bug reports should be targeted against the 8.9.x-dev branch from now on, and new development or disruptive changes should be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.9.x-dev » 9.2.x-dev

Drupal 8 is end-of-life as of November 17, 2021. There will not be further changes made to Drupal 8. Bugfixes are now made to the 9.3.x and higher branches only. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.2.x-dev » 9.3.x-dev
larowlan’s picture

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.15 was released on June 1st, 2022 and is the final full bugfix release for the Drupal 9.3.x series. Drupal 9.3.x will not receive any further development aside from security fixes. Drupal 9 bug reports should be targeted for the 9.4.x-dev branch from now on, and new development or disruptive changes should be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.4.x-dev » 9.5.x-dev

Drupal 9.4.9 was released on December 7, 2022 and is the final full bugfix release for the Drupal 9.4.x series. Drupal 9.4.x will not receive any further development aside from security fixes. Drupal 9 bug reports should be targeted for the 9.5.x-dev branch from now on, and new development or disruptive changes should be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.5.x-dev » 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

leeksoup’s picture

@larowlan Re #259

In D 10.2.3, text_summary (trimmed) output is still much shorter than the selected number of characters in "Trimmed limit." It appears that it still counts markup within the character count.

E.g. I set a Trimmed limit to 1500 characters but get this for trimmed output:
trimmed output 900 chars
This is only 900 characters.

If the node contains something with more markup, such as bullet points, the trimmed value is even shorter. e.g. with the same filler text put into a numbered list, like this:

Meeting recording:
[video embed]

Your questions -- answered!

This is some filler text that I am using to test the trimmed text function.

  1. This is some filler text that I am using to test the trimmed text function. This is some filler text that I am using to test the trimmed text function.
  2. This is some filler text that I am using to test the trimmed text function. This is some filler text that I am using to test the trimmed text function.
  3. This is some filler text that I am using to test the trimmed text function. This is some filler text that I am using to test the trimmed text function. This is some filler text that I am using to test the trimmed text function.
  4. This is some filler text that I am using to test the trimmed text function.
  5. This is some filler text that I am using to test the trimmed text function.
  6. This is some filler text that I am using to test the trimmed text function.

And continues on with more paragraphs of text here.

Then the trimmed output looks like this:
only 139 characters this time
With the same text as in the first screenshot, this time the trimmed value only has 139 characters.

larowlan’s picture

Thanks @leeksoup - can you update the issue summary with remaining tasks etc?

leeksoup’s picture

Issue summary: View changes

@larowlan - Does that work?

leeksoup’s picture

@larowlan - Do the remaining items need to be split off into a new / separate issue?

larowlan’s picture

I think this issue is fine, thank you for updating the issue summary!