text_summary() should output valid HTML and Unicode text [#221257]

Comment	File	Size	Author
#263	Screenshot from 2024-03-02 15-13-50.png	16.27 KB	leeksoup
#263	Screenshot from 2024-03-02 15-05-54.png	80.76 KB	leeksoup
#227	TextSummary.txt	2.02 KB	bburg
#221	text_summary-221257-221.patch	3.49 KB	bburg
#212	text_summary-221257-212.patch	3.48 KB	tarekdj
#191	text_summary-221257-191.patch	21.46 KB	pillarsdotnet
#187	text_summary-221257-187.patch	22.68 KB	pillarsdotnet
#169	text_summary-221257-169.patch	21.16 KB	pillarsdotnet
#169	text_summary-221257-169-d7.patch	21.16 KB	pillarsdotnet
#169	text_summary-221257-169-d6.patch	13.34 KB	pillarsdotnet
#167	text_summary-221257-167-d6.patch	13.28 KB	pillarsdotnet
#166	text_summary-221257-166.patch	21.16 KB	pillarsdotnet
#166	text_summary-221257-166-d7.patch	21.16 KB	pillarsdotnet
#166	text_summary-221257-166-d6.patch	13.24 KB	pillarsdotnet
#141	text_summary-221257-141.patch	19.75 KB	pillarsdotnet
#137	text_summary-221257-137.patch	17.77 KB	pillarsdotnet
#140	text_summary-221257-140.patch	17.79 KB	pillarsdotnet
#133	text_summary-221257-133.patch	17.45 KB	pillarsdotnet
#132	text_summary-221257-132.patch	15.95 KB	pillarsdotnet
#139	text_summary-221257-139.patch	17.78 KB	pillarsdotnet
#138	text_summary-221257-138.patch	17.78 KB	pillarsdotnet
#136	text_summary-221257-136.patch	15.5 KB	pillarsdotnet
#135	text_summary-221257-135.patch	17.48 KB	pillarsdotnet
#127	text_summary-221257-127.patch	16.02 KB	pillarsdotnet
#123	text_summary-221257-123.patch	15.56 KB	pillarsdotnet
#122	text_summary-221257-122.patch	15.54 KB	pillarsdotnet
#121	text_summary-221257-121.patch	15.54 KB	pillarsdotnet
#119	text_summary-221257-119.patch	15.27 KB	pillarsdotnet
#118	text_summary-221257-118.patch	15.13 KB	pillarsdotnet
#117	text_summary-221257-117.patch	18.93 KB	pillarsdotnet
#111	text_summary-221257-111.patch	15.15 KB	pillarsdotnet
#76	text_summary-221257-76.patch	15.03 KB	pillarsdotnet
#75	text_summary-221257-75.patch	14.8 KB	pillarsdotnet
#71	text_summary-221257-71-donotignore.patch	8.44 KB	pillarsdotnet
#70	text_summary-221257-70.patch	8.44 KB	pillarsdotnet
#63	text_summary-221257-63.patch	8.37 KB	pillarsdotnet
#49	node-6.x-html_teaser3.patch	12.7 KB	AlexisWilke
#48	node-6.x-html_teaser3.patch	12.09 KB	AlexisWilke
#45	221257-node-6.x-html_teaser3.patch	11.95 KB	greg.harvey
#40	node-6.x-html_teaser3.patch	12.03 KB	AlexisWilke
#5	node-6.x-html_teaser.patch	6.71 KB	AlexisWilke
#17	node-6.x-html_teaser2.patch	11.12 KB	AlexisWilke

Comment #1

gpk commented 14 February 2008 at 01:49

OK I'm assuming this now gets fixed first in 7.x-dev?

Log in or register to post comments

Comment #3

AlexisWilke commented 19 September 2010 at 08:49

Issue tags:

-teaser

node_teaser() is still very badly "broken" (written?)

It truncates tags and comments anywhere and it counts characters in UTF-8 and in bytes. It's all wrong.

Not only that, it won't close tags properly, so if you do not use the HTML corrector on ALL filters, teasers will break your themes.

Then on Edit, it computes the $include variable like this:

  // Check if we need to restore the teaser at the beginning of the body.
  $include = !isset($node->teaser) || ($node->teaser == substr($node->body, 0, strlen($node->teaser)));

Here we can see that the teaser is expected to be an exact copy of the body. If not, this test breaks. Guess what, when you properly close tags at the end of the teaser, that $include computation breaks big time.

By the way, the variable should be named "$included" with a D at the end. Personal point of view though. 8-)

There is my version of node_teaser() that works great, every time!

/**
 * Generate a teaser for a node body.
 *
 * If the end of the teaser is not indicated using the <!--break--> delimiter
 * then we generate the teaser automatically, trying to end it at a sensible
 * place such as the end of a paragraph, a line break, or the end of a
 * sentence (in that order of preference).
 *
 * @param $body
 *   The content for which a teaser will be generated.
 * @param $format
 *   The format of the content. If the content contains PHP code, we do not
 *   split it up to prevent parse errors. If the line break filter is present
 *   then we treat newlines embedded in $body as line breaks.
 * @param $size
 *   The desired character length of the teaser. If omitted, the default
 *   value will be used. Ignored if the special delimiter is present
 *   in $body.
 * @return
 *   The generated teaser.
 */
function node_teaser($body, $format = NULL, $size = NULL) {

  if (!isset($size)) {
    $size = variable_get('teaser_length', 600);
  }

  // Find where the delimiter is in the body
  $delimiter = strpos($body, '<!--break-->');

  // If the size is zero, and there is no delimiter, the entire body is the teaser.
  if ($size == 0 && $delimiter === FALSE) {
    return $body;
  }

  // If a valid delimiter has been specified, use it to chop off the teaser.
  if ($delimiter !== FALSE) {
    return substr($body, 0, $delimiter);
  }

  // We check for the presence of the PHP evaluator filter in the current
  // format. If the body contains PHP code, we do not split it up to prevent
  // parse errors.
  if (isset($format)) {
    $filters = filter_list_format($format);
    if (isset($filters['php/0']) && strpos($body, '<?') !== FALSE) {
      return $body;
    }
  }

  // this is a teaser computation that works every time!

  // If we have a short body, the entire body is the teaser.
  $len = drupal_strlen($body);
  if ($len <= $size) {
    return $body;
  }

  $filter_newline = isset($filters['filter/1']);
  $len = strlen($body);

  $p = 0;
  $l = 0;
  $s = array(); // stack
  while ($p < $len && $l < $size) {
    $last_tag = FALSE;
    $o = strpos($body, '<', $p);
    if ($o === FALSE) {
      // no more tags till the end
      $a = drupal_strlen(substr($body, $p, $len - $p)); // UTF-8 length
      $n = $len;
    }
    else {
      // count characters between previous position and
      // beginning of tag
      $a = drupal_strlen(substr($body, $p, $o - $p)); // UTF-8 length

      ++$o; // skip the '<'
      $n = strpos($body, '>', $o);

      if ($body[$o] == '/') {
        // closing tag, pop the opening tag too
        array_pop($s);
      }
      elseif ($body[$n - 1] != '/') { // skip empty tags
        // opening tag, save its name on the stack so we can close it later
        $end_name = strpos($body, ' ', $o);
        if ($end_name === FALSE || $end_name > $n) {
          $end_name = $n;
        }
        $tag_name = substr($body, $o, $end_name - $o);
        switch ($tag_name) { // ignore empty tags that were not properly closed
        case 'br':
        case 'hr':
        case 'img':
        case 'input':
          break;

        default:
          $s[] = $tag_name;
          $last_tag = TRUE;
          break;

        }
      }

      // skip the tag now (we assume properly opening/closing tag boundaries!)
      if ($n === FALSE) {
        // last tag not closed or it wasn't a tag?!
        $n = $len;
      }
      else {
        ++$n;  // skip the '>' character
      }
    }

    // any characters to add the to result?
    if ($a) {
      if ($l + $a >= $size) {
        // the last tag did not make it in
        if ($last_tag) {
          array_pop($s);
        }
        // we've got more than we want to, search for a break point
        $o = $p + $size - $l;
        if ($body[$o] != ' ') while ($o > $p) {
          switch ($body[$o - 1]) {
          case "\xD8": // "\xD8\x9F" == arabic '?' (right to left)
            if (!isset($body[$o]) || $body[$o] != "\x9F") {
              // no the right sequence
              break;
            }
            if ($o + 1 == $len || $body[$o + 1] == ' ') {
              // found a break-point
              break 2;
            }
            if ($body[$o + 1] == '"') {
              $o += 2;
              break 2;
            }
            break;

          case '.':
          case '!':
          case '?':
            if ($o == $len || $body[$o] == ' ') {
              // found a break-point
              break 2;
            }
            if ($body[$o] == '"') {
              ++$o;
              break 2;
            }
            break;

          case "\n":
            if (!$filter_newline) {
              break;
            }
          case ' ':
            // found and remove the space (not that we ignore no-break spaces since we're not supposed to break there)
            --$o;
            break 2;

          //case ... add support for other UTF-8 spaces?

          case "\xE3":
            // found the CJK ideographic full stop?
            if (isset($body[$o + 1]) && $body[$o] == "\x80" && $body[$o + 1] == "\x82") {
              // keep this character in full
              $o += 2;
              break 2;
            }
            break;

          }
          --$o;
        }
        $p = $o;
        break;
      }
      $l += $a;
    }

    $p = $n;
  }

  $result = substr($body, 0, $p);
  while (!empty($s)) {
    $result .= '</' . array_pop($s) . '>';
  }

  return $result;
}

Here is my sample code that breaks the default code (as of 6.19):

<p>The Internet is littered with thousands of directories and yet there still people who create new ones... like me here on <a href="/" target="_blank" title="Snap! Websites"><strong>Snap! Websites</strong></a>. Well... I like the idea of a central repository of directories that I could use and thought that it would be nice to share them with you and even better, I offer your to <a href="http://snapwebsites.info/node/add/directory?destination=thank-you-submit..." target="_blank" title="Click now and add your directory to my list on Snap! Websites.">submit your own directories</a> to my list so you too can use the list to check out your favorite directories.</p>

The core version will truncate even before looping through the content and that truncates inside one of the anchors tag. That means you do not have a correct tag at all (it cuts in the middle of an attribute value.)

WARNING: My version properly counts the number of characters VISIBLE (to the very end users, well... it counts spaces that we don't really see per se. 8-) ). When you say 600, it won't be 600 bytes as your version. We can do it either way, it just makes a lot more sense to me to count visible characters. We could also have two limits if necessary. But the current version is not working for people who dare enter any kind of HTML in their posts.

I'll be back later with a patch that fixes the $include variable too.

Thank you.
Alexis Wilke

Log in or register to post comments

Comment #4

AlexisWilke commented 19 September 2010 at 04:04

Issue tags:

+teaser

tagging

Log in or register to post comments

Comment #5

AlexisWilke commented 19 November 2010 at 18:03

Priority:	Major	» Normal
Status:	Needs work	» Needs review

Status	File	Size
new	node-6.x-html_teaser.patch	6.71 KB

Okay, the fix for the teaser test in the node.pages.inc was easy enough. 8-)

  $include = TRUE;
  if (isset($node->teaser)) {
    // remove all the tags at the end because if they are closing tags
    // then they were likely added by node_teaser().
    $teaser = preg_replace('/(<[^>]+>)+$/', '', $node->teaser);
    $include = $teaser == substr($node->body, 0, strlen($teaser));
  }

I certainly hope you'll consider fixing your teaser code in D6... If not in D6, at least in D7. It is clear that your first call:

  // The teaser may not be longer than maximum length specified. Initial slice.
  $teaser = truncate_utf8($body, $size);

is already wrong because that can cut a comment or tag in half and thus you cannot know whether the last < character is part of a tag or a little mistake from the user (since the corresponding > is no present! so if you find "</p" you won't know that it is the end of the P tag... and you may actually keep that in there!)

Thank you.
Alexis Wilke

Log in or register to post comments

Comment #6

AlexisWilke commented 19 September 2010 at 10:21

Assigned:	Unassigned	» AlexisWilke
Status:	Active	» Needs review
Issue tags:		+teaser

Ah! I see another fix to the truncate_utf8() function...

I'm not wondering if the truncate should not be in that truncate_utf8() function and not in node_teaser()!

#768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug)

That would mean all the code I write to fix the node_teaser() function should be moved inside the truncate_utf8() which should probably take a flag $html to know whether the string may contain HTML or not.

Thank you.
Alexis Wilke

Log in or register to post comments

Comment #7

nancydru

she/her/hers

English

Boston

commented 19 September 2010 at 21:34

Marked #248842: Teaser breaks are picking up parts of html code as duplicate.

Alexis, does this remove the tendency to let <p> dominate the breakpoint selection as the current node_teaser has?

+1 for back-porting to D6. Definitely for D7. So shouldn't this be tagged? And maybe made "critical?"

Log in or register to post comments

Comment #8

dddave commented 20 September 2010 at 08:19

Priority:

Normal

» Major

Ah, the joys of teaser creation. ;) A constant source for support requests and headscratchers. I think this warrants an upgrade to major.

The attached patch is for D6, right? Has this any chance of inclusion without being solved for D7 first? If not and a D7 patch is built first I totally vote for back.porting to D6.

Log in or register to post comments

Comment #9

AlexisWilke commented 20 September 2010 at 09:49

@dddave,

D6 is definitively broken in that regard... so I suspect D7 is too.

The patch I provided is for D6, indeed (hence ...-6.x-... in the name).

Now, as I mentioned in #6 we need to know/understand whether the truncation should occur in the truncate_utf8() function or in the node_teaser. It seems to me that the truncate function would make a lot more sense. Also, to be noted, the "..." does not appear in my version (nor in any teaser that was shrunk further by the "old" node_teaser().)

In D7, the "..." should actually be defined in the Post settings because someone may want to put something else there. But that's a different topic. 8-)

@NancyDru,

Hey! Nancy! I've noticed you've been working on D7 quite a bit these days...

Breaks... My code does not consider <p> as a specific break point. Actually, the node_teaser() was not really working that way (only seemingly.) Instead, I parse all the characters from the start and skip tags at once (anything between < and > is left totally untouched.) And just like the HTML filter used to fix HTML, I manage a stack with all the tags currently opened (taking in account empty tags such as IMG, BR and HR—HR the forgotten!).

When the parser finds our that enough text characters were found, it truncates around the last space that makes the result shorter or equal in length as defined in the Post settings (i.e. 600 by default.)

There are interesting side effects to that method (to me). For instance, the ALT, TITLE, HREF, etc. attributes do not affect the number of characters that the user will see. Nor the IMG, EMBED of such tags (although that might be a problem in some circumstances like an RSS feed?)

Now, if you think that we should cut at a paragraph end and if the 1st paragraph is too long then revert to this algorithm, it would be easy enough to implement. I don't see a problem in cutting any one paragraph in half as long as it is done correctly HTML wise. 8-)

Thank you.
Alexis

Log in or register to post comments

Comment #10

nancydru

she/her/hers

English

Boston

commented 20 September 2010 at 16:20

I don't have a problem cutting paragraph at then end of a sentence, but not in the middle of one. However, I DO have a problem cutting a list item up; I want the whole thing or none of it. I have modified the node_teaser() function to do that. But I still have the tag-breaking bug.

the ALT, TITLE, HREF, etc. attributes do not affect the number of characters that the user will see.

I (and my customer) consider this an absolute positive. I do want to count the visible part of a link (<a ...>visible</a>) I don't allow IMG or EMBED in my feeds, so that's still positive.

My customer's request (demand): "We want teasers to be relatively consistent - about 15 lines." Yes, I know there are a lot of variables about what constitutes a "line."

It does sound like you could easily do a better job of correction than the HTML Corrector filter too.

-

Now if you could fix multiple blanks in a row... And the evil <p> </p>

Log in or register to post comments

Comment #11

AlexisWilke commented 20 September 2010 at 22:27

@NancyDru,

Hmmm... not cutting a list sounds like a good idea, but a page that starts with a list would "suffer"...

This page, for instance, starts with a table of contents list...

http://linux.m2osw.com/booting

Of course, in a way, it's a bad example since the corresponding teaser would not include the table, but many people write blogs starting with a list of say... the 10 reasons why I don't like teasers in Drupal.

And we could tell people, on those pages, add a "manual" teaser. The problem is when you hit save you generally do not see how the teaser looks like.

Also, no one commented about #6...

Thank you.
Alexis Wilke

P.S. "evil" <p> </p> is something I took care of with my own filter a long time ago 8-) See here http://www.m2osw.com/mo_trimmer_filter

Log in or register to post comments

Comment #12

nancydru

she/her/hers

English

Boston

commented 21 September 2010 at 12:59

Personally I think #6 should be a "clean up" issue after this gets fixed.

Log in or register to post comments

Comment #13

AlexisWilke commented 22 September 2010 at 19:13

Nancy,

Hmmm... The current node_teaser() does this (more or less):

1. Any content? No return ''
2. Any  in the content? got the teaser
3. Is content smaller in total # of chars to the max. length allowed in the teaser? Yes return body as is
4. Truncate content using truncate_utf8()
5. [try to] Fix the truncated content to fix the max. # of chars (remove the possible '...' in the process)

Point 4. is gone in my version... If we want to have a truncate function that can truncate HTML (because many modules do a similar work as the Core teaser, for instance the Views may be used to "generate a teaser"!)

So may be we should have a truncate_html(), in which case I guess we can first fix this function and later extract the code and create that separate truncate_html().

Is that your idea?

Thank you.
Alexis Wilke

Log in or register to post comments

Comment #14

nancydru

she/her/hers

English

Boston

commented 22 September 2010 at 14:55

BTW, there has always been a flaw with 2. Any <--break--> in the content? got the teaser in that  is removed. This results in a "read more" when the break is placed at the end of the node to force a full node to appear (generally seen on shorter nodes that are slightly longer than a normal teaser). There is no need to remove the break because it is a comment to browsers.

Log in or register to post comments

Comment #15

AlexisWilke commented 22 September 2010 at 19:09

We need to have a separate function that compares the teaser and the body properly. I fixed the $include definition but that test should be in a function instead and extended to support more capabilities. See sample below.

I'm also wondering about a break  at the very beginning because then the teaser is empty and thus the node_teaser() is called each time. Since that should be really rare, we should be fine. But it could be that an empty teaser means don't show the teaser? I guess the user can turn off the corresponding flag if he/she does not want that to appear.

Otherwise I just noticed that trim() is being used all over the place. Maybe we should have a drupal_trim() that removes the <p> </p> too. That way it would be fixed in many situations!

/**
 * Compare the teaser and the body together and return one of the
 * following values:
 *
 * \li NODE_TEASER_EQUAL -- the teaser and body are the same (avoid read-more)
 * \li NODE_TEASER_START -- the teaser is equal to the beginning of the body
 * \li NODE_TEASER_DIFF -- the teaser is different from the body
 *
 * \return One of the NODE_TEASER_... values
 */
function node_compare_teaser_and_body($node) {
  if (!isset($node->teaser)) {
    // would we need a NODE_TEASER_OFF for this case?
    return NODE_TEASER_DIFF;
  }

  // remove all the tags at the end because if they are closing tags
  // then they were likely added by node_teaser().
  $teaser = preg_replace('/(<[^>]+>)+$/', '', $node->teaser);
  $len = strlen($teaser);
  $included = $teaser == substr($node->body, 0, $len);

  if ($included) {
    // remove the <!--break--> and empty lines at the end
    $remainder = substr($node->body, $len);
    $remainder = str_replace('<!--break-->', '', $remainder);
    $remainder = preg_replace('%(<p>&nbsp;</p>[\n\r]*)+$%', '', $remainder);
    return trim($body) ? NODE_TEASER_START : NODE_TEASER_EQUAL;
  }

  return NODE_TEASER_DIFF;
}

Log in or register to post comments

Comment #16

nancydru

she/her/hers

English

Boston

commented 22 September 2010 at 19:54

I wouldn't call it drupal_trim if it does more than what PHP's trim will do (but I guess trim could do tags if you list them). Having a drupal_ form of a PHP function implies that it be a multibyte safe version (e.g. drupal_substr, drupal_strtolower).

Log in or register to post comments

Comment #17

AlexisWilke commented 22 September 2010 at 23:03

Status	File	Size
new	node-6.x-html_teaser2.patch	11.12 KB

Okay, for now there is a patch that takes the more... in account and also adds a "..."

This fixes the comparison between teaser and body. But it renders the include flag computation quite complex. I'm wondering if the node_teaser() should not actually return the teaser AND the body. We could pass the include flag (i.e. isset($node->teaser_include) && !$node->teaser_include) and determine within the function what needs to be returned instead of recomputing how to cut the body outside.

My concern here is that would change the return value of the node_teaser() function which some other module might be using...

Thank you.
Alexis

P.S. The case of the list is not yet taken in account.

Log in or register to post comments

Comment #18

nancydru

she/her/hers

English

Boston

commented 22 September 2010 at 23:21

Nope, that would change the API and guarantee it would not be accepted.

Log in or register to post comments

Comment #19

AlexisWilke commented 23 September 2010 at 02:48

Nancy,

I know. Although we may want to do that in 7.x...

Have you tested my patch yet? 8-)

Thank you.
Alexis

Log in or register to post comments

Comment #20

Evan Leeson commented 24 September 2010 at 00:13

Thank you all for working on this issue. I spent some time today struggling with this because the writers have started putting a lot of links in their first sentences and this was causing this issue to manifest. Using the teaser break solves it but it will be much better when they don't have to.

Log in or register to post comments

Comment #21

AlexisWilke commented 24 September 2010 at 06:08

lectric,

Did you test #17? 8-)

Thank you.
Alexis

Log in or register to post comments

Comment #22

nancydru

she/her/hers

English

Boston

commented 24 September 2010 at 12:38

I have asked for permission to test it (on the customer's clock). Even if they say no, I may test replacing my version on my own time (of which I have very little).

One possible way to get this in (probably 8.x) would be to open an issue to "Make node_teaser() pluggable," which seems to be a popular way to say, "I have a better idea, but you probably won't like it."

Log in or register to post comments

Comment #23

greg.harvey

English

Occitanie, France

commented 27 September 2010 at 10:16

Marked #363889: Strange behaviour when node teaser and first part of node body don't match perfectly as duplicate - not actually certain this is the case, but the issue is very old and no way of testing any more, so no sense leaving it open.

Log in or register to post comments

Comment #24

brisath commented 29 September 2010 at 02:06

Subscribe, same problem with teasers.

Log in or register to post comments

Comment #25

AlexisWilke commented 29 September 2010 at 07:19

@NancyDru (and others),

Btw, in regard to lists... How would you handle the case where a post starts with a list and that list is too long to fit?

1. Return an empty teaser

2. Cut the list...

3. Return the whole list (which could be REALLY big)

In cases 1 and 3 we could also emit a warning, but since any function could call the node_teaser() function, the warning may not automatically be appropriate.

Thank you.
Alexis

Log in or register to post comments

Comment #26

nancydru

she/her/hers

English

Boston

commented 29 September 2010 at 18:29

On my site, my code cuts the list. And my customer is happy with that.

Log in or register to post comments

Comment #27

nancydru

she/her/hers

English

Boston

commented 6 October 2010 at 18:46

All my work is waiting for better specs, so I'm going to put this into my code and test it. I am rereading in detail.

#10/11/25 - I didn't say not cutting a list - that I already do. I don't want a list ITEM cut. I tried that and it got real ugly real fast. So the only time it might be a problem is when the first list item is too big. I haven't encountered that (yet).

#13 Truncate_html is probably a good idea but will guarantee waiting until D8. It is interesting that even though the API for drupal_validate_utf8 states "All functions designed to filter input should use drupal_validate_utf8 to ensure they operate on valid UTF-8 strings...", only one core function actually calls it.

#15 Given that trim() appears to not be multibyte aware, and some users use something other than utf8 for their databases, it seems that there is indeed a need for drupal_trim(), that would validate and drupal_convert_to_utf8(). However that is a separate issue because it would be forced into D8 even though I would call it a bug in both D6 and D7.

Log in or register to post comments

Comment #28

nancydru

she/her/hers

English

Boston

commented 6 October 2010 at 19:36

Why are you using substr() rather than drupal_substr()?

The addition to

    if (isset($filters['php/0']) && strpos($body, '<?') !== FALSE) {
      if (isset($teaser_len)) {
        $teaser_len = strlen($body);
      }
      return $body;
    }

seems superfluous with an immediate return.

If Dries ever looks at this, he will want more comments and properly formatted (start with capital, end with punctuation).

Log in or register to post comments

Comment #29

nancydru

she/her/hers

English

Boston

commented 6 October 2010 at 19:48

Looks pretty good. We can definitely decrease our teaser length setting. This version produced an even longer one than my version because of links. It did, however end at the beginning of a list item, so I get 9. ...

The ending "..." will definitely classify this as a changed API. As much as I kind of like it, you might want to remove that.

Log in or register to post comments

Comment #30

nancydru

she/her/hers

English

Boston

commented 7 October 2010 at 19:05

Status:

Needs work

» Needs review

I don't know if it's this change or a change I made to the editor, but I am now getting duplicated content when I edit a node.

I am also now getting:

You specified that the summary should not be shown when this post is displayed in full view. This setting has been ignored since you have not defined a summary for the post. (To define a summary, insert the delimiter "" (without the quotes) in the Body of the post to indicate the end of the summary and the start of the main content.)

UPDATE: #934628: You specified that the summary should not be shown seems to be caused by this code. The generated teaser (complete with ...) is stuck on the front of the body in the edit textarea.

Log in or register to post comments

Comment #31

nancydru

she/her/hers

English

Boston

commented 7 October 2010 at 19:03

Status:

Needs review

» Needs work

Log in or register to post comments

Comment #32

AlexisWilke commented 7 October 2010 at 19:57

Status:

Needs review

» Needs work

Nancy,

Thank you for taking the time (you could have taken a break instead! 8-) )

#27 - #10/11/25 - okay, that makes more sense. Still, not cutting a certain tag makes it a special case! Now, we can use the stack to know where we're at. If we find an <li> tag, then we would need to remove everything from it on. I guess that requires a position linked with the stack so we know where it's located.

#27 - #13 - truncate_html() would make sense, I think. And the truncate_utf8() is being worked on for other reasons... but that does not take HTML in account.

#27 - #15 - trim() probably does not need to be UTF-8 compatible unless we want to trim special spaces in which case yes, we'd need a UTF-8 equivalent. Otherwise the characters removed are controls (i.e. "\n") and the space (" "). Those are safe in UTF-8. We could verify that it does not remove the   character though since that one uses code 0xA0 which is a UTF-8 encoding character. From the PHP documentation,   is not removed.

#28 - drupal_substr() - I have to look into it, I do not think it makes any difference because I have < and > as delimiters which cannot be confounded with UTF-8 characters. But I guess that does not matter much.

#28 - PHP handling - if you are asking me to fix old code, I'm fine with it, but I thought that patches were supposed to only change what needed to be changed... I know that my patch changes so many things that it's hard to see what was there before and what's new. This PHP test I kept as it was.

#29 - ... - from what I understand, the truncate_utf8() was expected to add the '...', but then the node_teaser() was likely to remove it as it would have to shorten the data some more. So it was never there! See #30 too.

#30 - duplicate teaser - this is one of the problems I've been running into. I may not have posted my latest fix that removes the '...' in case it was added because that does not match the body (which may be another reason why the '...' was not included with truncate_utf8()!)

#31 - I'm on it. 8-)

Log in or register to post comments

Comment #33

nancydru

she/her/hers

English

Boston

commented 8 October 2010 at 12:55

#28: It's not the check for php, it's setting the $teaser_len that I wonder about, because you are immediately going to return, so what is the point? IIRC this is done again a few lines down.

#30: Please read the WYSIWYG maintainers comments in #934628: You specified that the summary should not be shown. Even altered HTML (like closing an unclosed tag) could trigger the same problem. Apparently, there is a exact match comparison that is causing the problem.

Log in or register to post comments

Comment #34

AlexisWilke commented 8 October 2010 at 22:50

Nancy,

#28 because $teaser_len is an [out] variable. Which is why this whole thing is so ugly. The auto-teaser function should either not save the teaser or save a flag marking the teaser as an auto-teaser and not try to be smart and compare potentially really large strings together.

Yes. For #30 I have a function now. That one single function does a "smart" compare, but I did not take the '...' in account in the last version I proposed here. So I don't have an issue with it on my end anymore but here that's a problem. (and this issue is the whole point of fixing the teaser so it all works including that sort of a problem!)

Thank you.
Alexis

Log in or register to post comments

Comment #35

nancydru

she/her/hers

English

Boston

commented 12 October 2010 at 12:32

Having to call a new function means "API change" which guarantees that this will never make it into 6.x or 7.x. That's a shame because this algorithm produces much better teasers.

Log in or register to post comments

Comment #36

rkodrupal commented 18 November 2010 at 19:04

Category:

bug

» support

The code in #3 works great! I even tried a very long

within a node insert [node:999 cck=body;] and it truncated it just fine. Good job.
Quick question ... why should I care about the change to node.pages.inc in #5 and where precisely does it get inserted (sorry, i never got patch.exe working on windows xp so i do all patches manually)?

Again, great job.

Log in or register to post comments

Comment #37

rkodrupal commented 18 November 2010 at 19:07

sorry that double quote in the prior post was originally the word

Log in or register to post comments

Comment #38

rkodrupal commented 18 November 2010 at 19:08

...aargh ... bl0kqu0te

Log in or register to post comments

Comment #39

dddave commented 18 November 2010 at 20:43

Category:

support

» bug

Please don't change the category.

Log in or register to post comments

Comment #40

AlexisWilke commented 20 November 2010 at 08:59

Priority:

Normal

» Major

Status	File	Size
new	node-6.x-html_teaser3.patch	12.03 KB

Okay, I finally got a new patch. I still include the ellipsis because I'm 99% sure it was intended to be there just not done because it would break everything with the old code.

I added some documentation and comments to make sure people understand why this and that. Maybe a little more would be required... Especially, in regard to the drupal_strlen(), drupal_substr() calls.

1) All the mb_str...() functions would need a corresponding drupal_str...() call, for instance using strpos() and then drupal_substr() is not going to work well.

2) In my case, I check the string with $body[$o] which is a byte access which totally ignores UTF-8.

2.1) It is okay because I use drupal_strlen() to compute the length of the visible text and compares that against the $size argument

2.2) It is required unless you offer me drupal_strpos() and a way to check a string character by character (and I don't recall seeing that in PHP anyway, although there is mb_split() ... good luck on bodies that are 100Kb in size!)

3) It is now documented, and I already had a comment about the UTF-8 checks when there are required in the algorithm!

Final word:

Personally, whether this patch is accepted or not is not a concern for me. My teasers work. 8-)

If someone from Core thinks it would be a good idea to share... then apply the patch! (after we had a few RTBC, of course) And since it can actually destroy the content of your node, I would strong advise that it be applied. But again, not my problem. I'm safe on that one now!

@rkodrupal, note that you can click on Edit to fix your posts instead of posting 2 or 3 fixes one after another.

Log in or register to post comments

Comment #41

20 November 2010 at 09:10

Status:	Needs review	» Needs work
Issue tags:	-teaser

The last submitted patch, node-6.x-html_teaser3.patch, failed testing.

Log in or register to post comments

Comment #42

AlexisWilke commented 20 November 2010 at 18:58

Status:

Needs work

» Needs review

#40: node-6.x-html_teaser3.patch queued for re-testing.

Log in or register to post comments

Comment #43

20 November 2010 at 19:21

Status:	Needs review	» Needs work
Issue tags:		+teaser

The last submitted patch, node-6.x-html_teaser3.patch, failed testing.

Log in or register to post comments

Comment #44

AlexisWilke commented 20 November 2010 at 20:51

"Ensure the patch applies to the lastest checkotu of the code-base."

Why is that failing all the time?! I've tested against the latest and it works perfectly for me... How can it be bogus on that test server? One command line???

Log in or register to post comments

Comment #45

greg.harvey

English

Occitanie, France

commented 20 November 2010 at 22:15

Status:

Needs work

» Needs review

Status	File	Size
new	221257-node-6.x-html_teaser3.patch	11.95 KB

I haven't tried to apply this yet, but maybe the testbot is unhappy with node.module.orig? This is a stab in the dark, but perhaps changing those first two lines will work? Another attempt attached, nothing changed except the initial diff lines.

Log in or register to post comments

Comment #46

20 November 2010 at 22:30

Status:

Needs review

» Needs work

The last submitted patch, 221257-node-6.x-html_teaser3.patch, failed testing.

Log in or register to post comments

Comment #47

rkodrupal commented 21 November 2010 at 02:09

i used the patch in #45 and it applied successfully to node.module. The patch in #5 previously worked too, and it patched both node.module and node.pages.inc.

Was the intention for the patch in #45 to only change node.module?

Again, thanks for this great work. it was a huge timesaver for me.

... oh and I have a very small screen and the 'edit' button in this new version of drupal.org is off the page ... but i see it now.

Log in or register to post comments

Comment #48

AlexisWilke commented 21 November 2010 at 02:14

Status:

Needs work

» Needs review

Status	File	Size
new	node-6.x-html_teaser3.patch	12.09 KB

Thank you greg for looking into it. I checked another thread where a patch worked, checking with an extra line at the very beginning... I don't think that matters, but we'll see.

There is the exact error from the tests goes like this:

[22:21:11] Command [patch -p0 -i /var/lib/drupaltestbot/sites/default/files/review/221257-node-6.x-html_teaser3.patch 2>&1] failed with status [1] and output:
can't find file to patch at input line 3
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|--- modules/node/node.module
|+++ modules/node/node.module
--------------------------
File to patch:
Skip this patch? [y]
Skipping patch.
7 out of 7 hunks ignored.

Log in or register to post comments

Comment #49

AlexisWilke commented 21 November 2010 at 02:28

Status	File	Size
new	node-6.x-html_teaser3.patch	12.7 KB

Good catch rkodrupal, there is still a one line fix in the node.pages.inc file.

There is yet another patch. This time there is a line before and no .orig extension. But I don't think that's the problem... 8-}

Log in or register to post comments

Comment #50

21 November 2010 at 02:40

Status:

Needs review

» Needs work

The last submitted patch, node-6.x-html_teaser3.patch, failed testing.

Log in or register to post comments

Comment #51

AlexisWilke commented 21 November 2010 at 05:44

There we go...

http://drupal.org/node/961172#comment-3716348

#961172: All D6 Core patches are failing

Log in or register to post comments

Comment #52

rkodrupal commented 29 November 2010 at 14:25

patch at #17 doesn't seem to work with blockquote ... i found it impossible to successfully break the following for a teaser ...

Domino Causality sets the guiding principles in the search for innovative drugs. We only can improve productivity by constant attention to the entire line of causality, of the domino chain. Merely optimizing or improving any single link is not helpful.

"<"blockquote">"Dr. Moriarity, a metaphysician, sets up fifty numbered dominoes standing in a straight line with their dots facing the same way on a table in a room, but puts a blind in front of the dominoes so that only numbers one and fifty are visible. You enter the room, and observe that domino number one and domino number fifty are lying flat with their tops pointing in the same direction… Does this mean that either domino caused the other to fall? Not necessarily ... Dr. Moriarty could have pushed over only dominoes number one and fifty, or bumped the table in a way that only these two dominoes fell, or that all the dominoes fell at once. It is essential to remove the blind and look at the intervening dominoes... Do their positions suggest they fell in sequence rather than being bumped or shaken? Did any reliable observers hear the telltale sound of dominoes slapping one another in sequence? From the positions of all the dominoes, can we eliminate rival causal mechanisms, such as earthquakes and wind, as well as human intervention? If the dominoes fell in sequence, can we tell by looking at whether the dots are face up whether the direction of the sequence was from number one to number fifty or the reverse? – "<"a href="/node/879"">"Bennett, George (1997). Process Tracing in Case Study Research"<"/a">""<"/blockquote">"

(note: doesn't do too well with "<"a">" passages either).

Log in or register to post comments

Comment #53

AlexisWilke commented 29 November 2010 at 22:08

rkodrupal,

I tried on one of my sites that includes #49 and it worked fine. You can see the teaser here:

http://gallery.m2osw.com/node

You may want to try with the latest patch?

Thank you.
Alexis

Log in or register to post comments

Comment #54

rkodrupal commented 30 November 2010 at 01:26

confirmed. stupid typo on my part. sorry for the false alarm.

Log in or register to post comments

Comment #55

AlexisWilke commented 5 January 2011 at 12:47

Status:	Needs work	» Needs review
Issue tags:	-teaser

#49: node-6.x-html_teaser3.patch queued for re-testing.

Log in or register to post comments

Comment #56

5 January 2011 at 13:01

Status:	Needs review	» Needs work
Issue tags:		+teaser

The last submitted patch, node-6.x-html_teaser3.patch, failed testing.

Log in or register to post comments

Comment #57

nancydru

she/her/hers

English

Boston

commented 5 January 2011 at 14:36

A) Have you tested with the WYSIWYG module? Anything that alters the teaser from the first xxx bytes of the body is going to cause problems and actual node modifications.

B) Is there any hope that this would be in D7, let alone D6?

Log in or register to post comments

Comment #58

rkodrupal commented 5 January 2011 at 18:21

for what it's worth i've been using it extensively (100+ text-laden nodes with lots of uncommon tags) and it hasn't failed yet. i use wysiwyg.

that being said, the patch does cause intermittent problems when the teaser has already been set by the earlier node.module and you go back and change a node ... typically when this happens there's some goofy message about the changing ... a tip-off to carefully check the results.

Log in or register to post comments

Comment #59

Jukebox commented 7 January 2011 at 17:07

Status:	Needs work	» Needs review
Issue tags:	-teaser

#49: node-6.x-html_teaser3.patch queued for re-testing.

Log in or register to post comments

Comment #60

7 January 2011 at 17:21

Status:	Needs review	» Needs work
Issue tags:		+teaser

The last submitted patch, node-6.x-html_teaser3.patch, failed testing.

Log in or register to post comments

Comment #61

Jukebox commented 7 January 2011 at 17:23

This is sort of off topic, but may be helpful to some.

I'm using views to display node titles and teasers. One node had an italicised paragraph at the beginning. Since the teaser is limited to a handful of words, it didn't seem to close the italics tag correctly. Also, random italic tags were inserted in the html. To fix this, I just went over to the teaser field and checked the "Strip HTML Tags" option and everything was back to normal.

Log in or register to post comments

Comment #62

AlexisWilke commented 8 January 2011 at 02:09

Jukebox,

The Views are using a copy of the Core function to extract the Teaser. They do not have the problem of comparing teaser and body as the Core does, but they do the same really bad job at not closing the tags... (this is the case for any truncate function in the Views.)

What you are seeing (many <i> tags) is probably due to your browser "cleaning up" after the Views output.

You are more than welcome to report this problem to the Views people and have them close tags properly of any HTML truncated data buffer.

Thank you.
Alexis

Log in or register to post comments

Comment #63

pillarsdotnet commented 15 March 2011 at 10:30

Version:	6.x-dev	» 7.x-dev
Component:	node.module	» field system
Status:	Needs work	» Needs review

Status	File	Size
new	text_summary-221257-63.patch	8.37 KB

Forward-ported to D7, cleaned up comments, and changed variable names to be (hopefully) more descriptive.

Log in or register to post comments

Comment #64

pillarsdotnet commented 15 March 2011 at 10:46

Title:

node_teaser() can truncate teaser to the middle of an HTML comment

» text_summary() should be made HTML-aware.

Title update

Log in or register to post comments

Comment #65

pillarsdotnet commented 15 March 2011 at 10:49

Log in or register to post comments

Comment #66

Jamie Holly commented 15 March 2011 at 13:38

I actually just noticed this yesterday on a D7 site. It cut off a YouTube embed code in the middle. That's pretty bad, as it affects a lot of block-level elements (including forms).

Log in or register to post comments

Comment #67

15 March 2011 at 18:21

Status:	Needs review	» Needs work
Issue tags:	-teaser

The last submitted patch, text_summary-221257-63.patch, failed testing.

Log in or register to post comments

Comment #68

pillarsdotnet commented 15 March 2011 at 21:37

Status:

Needs work

» Needs review

#63: text_summary-221257-63.patch queued for re-testing.

Log in or register to post comments

Comment #69

15 March 2011 at 22:41

Status:	Needs review	» Needs work
Issue tags:		+teaser

The last submitted patch, text_summary-221257-63.patch, failed testing.

Log in or register to post comments

Comment #70

pillarsdotnet commented 16 March 2011 at 00:54

Status	File	Size
new	text_summary-221257-70.patch	8.44 KB

Trying again, with typo-corrections:

Log in or register to post comments

Comment #71

pillarsdotnet commented 16 March 2011 at 00:56

Status:

Needs review

» Needs work

Status	File	Size
new	text_summary-221257-71-donotignore.patch	8.44 KB

Patch named incorrectly?

Log in or register to post comments

Comment #72

pillarsdotnet commented 16 March 2011 at 00:56

Status:

Needs work

» Needs review

Oh. duh. silly me again.

Log in or register to post comments

Comment #73

16 March 2011 at 01:31

The last submitted patch, text_summary-221257-71-donotignore.patch, failed testing.

Log in or register to post comments

Comment #74

AlexisWilke commented 16 March 2011 at 02:39

Pillarsdotnet,

Thank you for working on this one. 8-)

As a temporary solution people can forcibly use the  comment... but go explain that to end users!

Anyway, your point about broken YouTube embedded video is probably not currently correctly handled by my patch, unless it's a single tag. (Is it just one <embed> tag or do they use the <object> tag?) It is however a good point because such an object, or a form, should never be cut. It either gets included or it remains excluded, but no cut! I could imagine a form cut just before the "Send Info" button. Too bad. 8-)

Thank you.
Alexis Wilke

Log in or register to post comments

Comment #75

pillarsdotnet commented 16 March 2011 at 09:36

Status:

Needs work

» Needs review

Status	File	Size
new	text_summary-221257-75.patch	14.8 KB

Hacked text.test to make all tests pass.

Probably should add an embedded Youtube Video test...

Log in or register to post comments

Comment #76

pillarsdotnet commented 16 March 2011 at 09:51

Status	File	Size
new	text_summary-221257-76.patch	15.03 KB

Fixed an "Uninitialized string offset" warning.

Just for drill, I did test an embedded video -- looks like the entire video counts as a single character, using this algorithm. I'm not sure whether this is desirable or not. Maybe all <object>...</object> sequences should be stripped from a text_summary?

Log in or register to post comments

Comment #77

nancydru

she/her/hers

English

Boston

commented 16 March 2011 at 13:57

Pillars and Alexis, there are always going to be exceptions that are not well handled by coding and the end-user is going to have to be told to use a teaser break technique. Considering how long an embed/object tag sequence can be, this may be one of those cases. No matter how good your code is, there's someone out there who will find a way to break it.

Log in or register to post comments

Comment #78

AlexisWilke commented 16 March 2011 at 19:25

Nancy,

I don't disagree that we'll find others. However, by not fixing anything, it's not going to be better. I've not seen any such discussion on other systems, especially WordPress. So some people know how to make it work right, in code.

Thank you.
Alexis

Log in or register to post comments

Comment #79

nancydru

she/her/hers

English

Boston

commented 21 March 2011 at 15:24

I hope that Fields challenges WYSIWYG to stop doing what they were currently doing that broke this technique when I had a chance to test it. Perhaps you can "borrow" from WP if that is a much better algorithm. It is abundantly clear that this issue needs attention. I tried my hand at fixing it within the D6 framework and only partially succeeded. Your code produced a much better teaser than even mine did - core's was a very distant 3rd. Unfortunately, WYSIWYG does something that I think is wrong and that broke your code as it was at that time.

Log in or register to post comments

Comment #80

pillarsdotnet commented 22 March 2011 at 01:51

I posted a combination of the patch in #76 above with the patch from #87 of #823380: Read More link is always visible on teaser., and intoxination proposed merging these two issues.

Log in or register to post comments

Comment #81

Jamie Holly commented 22 March 2011 at 12:32

@NancyDru - actually a lot of the fixed in WP comes from TinyMCE. For example - if you put a more break within an object/embed code block, then switch back to the WYSIWYG view, TinyMCE will actually strip it out. That's due to the way TinyMCE parses out embed codes and replaces it with their own markup for the visual display, then again replaces it on clean-up when the final HTML is generated.

When it comes to actually balancing out tags on teaser/more views, I find WP much worse. I can't even begin to count the times I had to fix a broken front page on a client's WP site simply because someone put the more break in a weird spot.

Log in or register to post comments

Comment #82

nancydru

she/her/hers

English

Boston

commented 22 March 2011 at 14:17

Good, that read more link has been bothering me since I started with Drupal and was told it wasn't worth fixing, even though it was simple to do so. I despise all those wysiwyg editors, but some people seem to think they have to have them.

Log in or register to post comments

Comment #83

pillarsdotnet commented 22 March 2011 at 22:01

Thinking about this some more, there could be all kinds of additional conditions on summary generation.

Maximum display length (not including markup)
Maximum storage length (including markup)
Maximum line width / word-wrap
Maximum number of lines
Allowed tags/styles/fonts/etc.

Probably it would be best to allow some sort of hook allowing modules or themes to add conditions of their own.

So the (pseudo-)code would look something like:

// Parse into an array containing the following keys:
// - #tag: Opening/closing tag name, if any.
// - #attributes: Array of tag attributes.
// - #empty: If TRUE, tag is empty like <br /> or <hr /> and both #elements and #markup are ignored.
// - #elements: Array of child elements.
// - #markup: Rendered child elements, or static content if #elements is empty.
$elements = text_summary_explode($text);

// Break at all allowable break points, for applicable languages.
// Each element with an empty #elements and a non-empty $markup is a candidate.
// Default would include implementation to break at space characters.
$elements = text_summary_invoke_all('break', $elements, $language);

// Strip out content not wanted in summary, such as embedded video objects, styles, etc.
$elements = text_summary_invoke_all('strip', $elements, $language);

// Enforce applicable limits, such as line length, total number of lines, displayed characters, etc.
// Default would include implementation to limit storage length, with maximum possible weight.
$elements = text_summary_invoke_all('limit', $elements, $language);

// Collapse to text.
$summary = text_summary_implode($elements);

Log in or register to post comments

Comment #84

pillarsdotnet commented 22 March 2011 at 20:21

Version:

7.x-dev

» 8.x-dev

Now that we have a real 8.x branch, I should code a patch.

Log in or register to post comments

Comment #85

nancydru

she/her/hers

English

Boston

commented 22 March 2011 at 20:27

Version:

8.x-dev

» 7.x-dev

Line width and number of lines depends on CSS and the displaying device. Unless you're much smarter than I, such details are pretty close to impossible to determine at teaser time. For example 5 lines on a full width HD monitor for a PC may very well overflow the entire screen on a smart phone.

Log in or register to post comments

Comment #86

pillarsdotnet commented 22 March 2011 at 21:59

See http://drupal.org/project/mobile_theme

Log in or register to post comments

Comment #87

Jamie Holly commented 23 March 2011 at 12:21

One thing that should be considered is that text_summary is always run in teaser view on D7, unlike in D6, where node_teaser was only run on node saves and then that value stored in the node_revisions table. That means text_summary can have a serious impact on performance if it gets to heavy processing wise.

Having said that, I got to give a big +1 to implementing a hookable system to allow modules to alter what happens in text_summary, such as deciding what block level elements should/shouldn't be cut. We just need to keep performance in mind when coming up with something.

Log in or register to post comments

Comment #88

nancydru

she/her/hers

English

Boston

commented 23 March 2011 at 13:54

Is that also true to cached anonymous users? And does it mean that filters are also always called?

Log in or register to post comments

Comment #89

Jamie Holly commented 23 March 2011 at 14:36

Not for anonymous users, since the caching there is on the entire page.

Log in or register to post comments

Comment #90

nancydru

she/her/hers

English

Boston

commented 23 March 2011 at 19:54

Would it be possible to add a teaser (or summary) cache to mitigate a potential performance issue? Perhaps something as simple as doing an MD5 on the text and saving the resulting teaser? This would be similar to the way filters cache data. Actually, if the teaser/summary is a field, why not turn this into a filter?

Log in or register to post comments

Comment #91

Jamie Holly commented 23 March 2011 at 20:00

I really think that should be in there. It would be nice to get a core maintainer to give us some feedback on the pros/cons of implementing that. Putting that caching in should be relatively simple.

Log in or register to post comments

Comment #92

yched commented 27 March 2011 at 12:53

Long thread - could someone post an issue summary (as well as an explanation of why this lies in the 'field system' queue) ?

Log in or register to post comments

Comment #93

pillarsdotnet commented 27 March 2011 at 19:35

Status:

Reviewed & tested by the community

» Needs review

Summary requested by ysched:

In D6, the function to auto-generate a shortened (summary or teaser) version of a node is called node_teaser(), and is found in node.module.
In D7, this function was renamed text_summary() and moved to text.module. There is no "text.module" queue, but there is a "field system" which seemed appropriate.
The current code generates invalid HTML, such as summaries that end in the middle of a closing tag. In fact, current tests require this mis-behavior, as does the WYSIWYG module. (See #741606: Teaser splitter / text fields with summary support and #934628: You specified that the summary should not be shown.)
The patch in #76 ensures that auto-generated summaries contain valid HTML, and alters the tests to match. It is currently in "needs review" status.
There is room for further improvement. Since different websites may have different needs and expectations regarding auto-generated summaries, perhaps a plug-in system would be best for all concerned.
This issue is closely related to #823380: Read More link is always visible on teaser.. Once a clearly preferable strategy is decided upon here, the two issues should probably be merged.

Log in or register to post comments

Comment #94

Jamie Holly commented 27 March 2011 at 14:02

@yched - @pillarsdotnet pretty much summed it up very well there. The only thing I would add is that since text_summary is ran on every field that requires it on each page view, we really should keep an eye at performance on this, or (even better) work out a way to cache the teaser/summary text.

Log in or register to post comments

Comment #95

yched commented 27 March 2011 at 14:51

@pillarsdotnet : Thanks for the summary. The former node_teaser() indeed has been moved more or less as is within text.module, but I can hardly claim any expertise on this area of code, and I'm afraid I'm not really relevant to review this :-/. Not sure who would - maybe a 'git blame' on the D6 branch would give a couple hints on the people who contributed to the original node_teaser() function ?

Log in or register to post comments

Comment #96

pillarsdotnet commented 27 March 2011 at 21:26

@ysched --

Perhaps it would be more accurate to say that the patch in #76 is in "RTBC" status, as I believe it has been reviewed and found to work correctly by intoxination, AlexisWilke, NancyDru, rkodrupal, and myself. There is a related issue with WYSIWYG, but I at least can attest that the patch in #76 does not render WYSIWYG unusable.

Certainly it is an improvement over the current code.

In cases like this, where we all agree an an incremental improvement, but are still discussing further improvements, should the extended discussion move to a separate issue?

Log in or register to post comments

Comment #97

pillarsdotnet commented 27 March 2011 at 17:40

Status:

Needs review

» Reviewed & tested by the community

Gonna take that giant leap. Somebody slap me if I'm wrong.

Log in or register to post comments

Comment #98

greg.harvey

English

Occitanie, France

commented 27 March 2011 at 22:08

Status:

Needs review

» Reviewed & tested by the community

Personally, I can't fault the logic in #96. Don't let the great get in the way of the good, etc. etc. =)

Log in or register to post comments

Comment #99

yched commented 27 March 2011 at 22:39

Status:

Reviewed & tested by the community

» Needs review

Sorry, but this didn't get any real code review, and would also really need an approval from some core devs (no offense).

An explanation of what the patch actually does, where and why it modifies the existing code, might also help :-)

Log in or register to post comments

Comment #100

bryancasler commented 28 March 2011 at 03:54

subscribe

Log in or register to post comments

Comment #101

AlexisWilke commented 28 March 2011 at 04:24

I'm not using D7 so I don't know whether the latest patch works. 8-)

However, I wanted to make a note about the <object> tag. I agree that you have to either put it in full or remove it altogether. If you only include a couple of <param> tags inside the object, it won't serve any purpose.

Not only that, this applies to <textarea> and <form>.

There are possibly a few others. This makes me think of a module I had a problem with... Yes! Footnotes.

The other tag to handle properly is <script>, although with WYSIWYG you probably won't have much problems with that one.

Footnotes also handles <style>, but that tag should never appear anywhere else than in the header so I'm much less worried about it.

Thank you.
Alexis

Log in or register to post comments

Comment #102

nancydru

she/her/hers

English

Boston

commented 28 March 2011 at 14:47

Yes, even with WYSIWYG, users find ways to stick scripts in. At my last customer, they kept trying to stuff their own GA code into the nodes. And I kept writing better filters to remove them. ;-) Oh, and don't forget that <script> is often accompanied by <noscript>, which can have anything you can imagine in it.

Log in or register to post comments

Comment #103

bryancasler commented 14 April 2011 at 00:35

Is there anything that would be holding up #76 from being committed?

Log in or register to post comments

Comment #104

pillarsdotnet commented 2 May 2011 at 01:45

@animelion Yes. See #1050616: Figure out backport workflow from Drupal 8 to Drupal 7

EDIT: Marked #220783: HTML markup is counted in for the length of trimmed posts as a dup of this issue.

Log in or register to post comments

Comment #105

lars toomre commented 19 April 2011 at 08:12

Version:	7.x-dev	» 8.x-dev
Issue tags:		+Needs backport to D6, +Needs backport to D7

I really like the pluggable approach suggested in #83 with the suggestion that the middle hook be renamed 'filter' instead of 'strip'. I use quite a bit of embedded HTML in my node bodies and have had considerable problems in D6 with broken HTML tags from the default split. That has required me to manually place  in many places and led me to this issue discussion this morning.

I suggest the use of 'filter' since what Alexis started this thread with, what NancyDru has done for her previous client and what I want to do is filter out some portion of the characters in a node body are tied to linking or display rather than the content that is presented to the user. I also really like the idea of the limit hook since quite likely one might like to display a different 'teaser' length based on the type of device being served to (desktop/mobile/XML/JSON) or type of feed being produced for Web Service/Twitter/Facebook/RSS.

Before the current patch is applied, I would hope that we could address the following issues in the default filtering of this patch code:

From #76 - I would expect the filter of the <object>...</object> text would have zero length rather than one character. Perhaps I misunderstood what was reported in this item??
From #101 - It seems that we should have a list of HTML tags to filter out by default if the full result does not fit within the size limit (like <object>, <textarea> and <form>).
From #102 - Perhaps there should be another list of always filter HTML tags (like <SCRIPT> and <NOSCRIPT>)?
Looking at the function text_summary() in the patch from #76, I am unsure why the calling parameters were changed since it appears that $summary_length is never referenced in the subroutine. Was I missing something there?

I am not sure whether minor versions of D7 will allow the introduction of new hooks. Certainly D6 will not at this point. Hence, whatever we design here for D8 will need to be condensed into one function that can be "swapped" into D6 and probably D7. My hope is that a fix for this behavior can be back-ported to both D7 and D6 soon.

My two cents ...

Log in or register to post comments

Comment #106

nancydru

she/her/hers

English

Boston

commented 19 April 2011 at 20:51

According to Dries, if no existing API is modified, a new API may be allowed into current versions. We would have to fight hard to convince him though. And I agree with both parts of that policy. I doubt this will be introduced to D6 ever.

Log in or register to post comments

Comment #107

AlexisWilke commented 19 April 2011 at 23:04

My fix does not change the API. It enhanced it only. Meaning that any module using the old API is compatible with my change.

And my fix is in my D6 install. I don't really care much whether it gets in Core on Drupal.org... 8-)

Thank you.
Alexis

Log in or register to post comments

Comment #108

pillarsdotnet commented 20 April 2011 at 02:23

I'm about 30% of the way through implementing the approach I suggested in #83 as a contrib module. When I'm done, I should have d6/d7/d8 versions available.

Log in or register to post comments

Comment #109

pillarsdotnet commented 21 April 2011 at 21:02

You can help this issue by posting a review:

Decide whether all code and comments comply with Drupal coding standards.
Decide whether any included tests are necessary and sufficient.
Decide whether the patch actually solves the problem. There should be a reproducible test where the current code fails and the patched code succeeds.

A review that affirms success in all of the above should also:

Change the issue status from "Needs Review" to "Reviewed and Tested By the Community".

Otherwise, the review should:

Explain what is wrong with the proposed patch and/or supply a corrections to it.
Change the issue status from "Needs Review" to "Needs Work".

Log in or register to post comments

Comment #110

njardim72 commented 25 April 2011 at 00:36

hi,

I'm in core D7. is this issue fixed? if so could you please point the right direction?

I need to trim the feeds to 200 chars

thanks,
njardim

Log in or register to post comments

Comment #111

pillarsdotnet commented 2 May 2011 at 01:43

Status	File	Size
new	text_summary-221257-111.patch	15.15 KB

Gratuitous re-roll of #76 above.

Log in or register to post comments

Comment #112

AlexisWilke commented 2 May 2011 at 02:13

Assigned:

AlexisWilke

» Unassigned

njardim72,

It's not fixed in any version.

pillarsdotnet is working on a D8 version and I offered a patch for D6. I don't think anyone posted a patch for D7 yet.

Good luck.
Alexis

Log in or register to post comments

Comment #113

pillarsdotnet commented 2 May 2011 at 03:01

The d8 patch should apply cleanly to d7. There isn't much difference between d7 and d8 yet.

Log in or register to post comments

Comment #114

sun

German

Karlsruhe

commented 2 May 2011 at 14:23

Status:

Needs review

» Needs work

Unfortunately, this patch is far from being RTBC.

+++ b/modules/field/modules/text/text.module
@@ -327,6 +327,10 @@ function _text_sanitize($instance, $langcode, $item, $column) {
+ * @note
+ *   This function uses strlen(), strpos(), etc. rather than their multibyte
+ *   equivalents where doing so increases speed without affecting output.

1) All information in phpDoc are notes. The @note directive is therefore useless and shouldn't be used. Use regular phpDoc paragraphs instead.

2) Everything that goes through text_summary() is supposed to be cached by the caller, so speed does not matter. Additionally, there's only a marginal difference between Drupal's PHP wrappers and directly calling the native functions. We _always_ use Drupal's PHP wrappers, unless there's a valid technical reason for ignoring UTF-8 byte sequences. If there is such a reason, it needs to be clearly stated and explained in an inline comment.

+++ b/modules/field/modules/text/text.module
@@ -344,7 +348,7 @@ function _text_sanitize($instance, $langcode, $item, $column) {
-function text_summary($text, $format = NULL, $size = NULL) {
+function text_summary($text, $format = NULL, $size = NULL, &$summary_length = NULL) {

Missing phpDoc for the new function argument. It doesn't seem to be used anywhere?

+++ b/modules/field/modules/text/text.module
@@ -380,67 +385,133 @@ function text_summary($text, $format = NULL, $size = NULL) {
+  $text_length = strlen($text);
...
-  // Store the actual length of the UTF8 string -- which might not be the same
-  // as $size.
-  $max_rpos = strlen($summary);

The new code could use some more inline comments about what it is doing and more importantly, why.

+++ b/modules/field/modules/text/text.module
@@ -380,67 +385,133 @@ function text_summary($text, $format = NULL, $size = NULL) {
+  $position = 0;
+  $length = 0;
+  $stack = array();
+  while ($position < $text_length && $length < $size) {
+    $last_tag = FALSE;
+    $offset = strpos($text, '<', $position);

Overall, it looks like using preg_split() or a real DOMDocument would be more appropriate for the purpose of this function. If this has been tried already or purposively wasn't done for some reason, then it should be documented in an inline comment.

+++ b/modules/field/modules/text/text.module
@@ -380,67 +385,133 @@ function text_summary($text, $format = NULL, $size = NULL) {
+      // There are no more tags, so find the UTF-8 string length.
+      $additional = drupal_strlen(substr($text, $position, $text_length - $position));

I don't understand why substr() is not UTF-8 safe but the strlen() is. This will have unexpected consequences.

Note that the old code used an entirely different approach, for which it may have been valid to ignore UTF-8 byte sequences. The old code might perfectly be buggy with regard to not using PHP wrappers - we don't know, since the existing tests do not cover UTF-8 at all.

+++ b/modules/field/modules/text/text.module
@@ -380,67 +385,133 @@ function text_summary($text, $format = NULL, $size = NULL) {
+        switch ($tag_name) { // Ignore empty tags that were not properly closed.
+        case 'br':
+        case 'hr':
+        case 'img':
+        case 'input':
+          break;
+        default:
+          $stack[] = $tag_name;
+          $last_tag = TRUE;
+          break;
+        }

1) Wrong indentation.

2) The code assumes that HTML tags are lowercase, which may not be the case.

+++ b/modules/field/modules/text/text.module
@@ -380,67 +385,133 @@ function text_summary($text, $format = NULL, $size = NULL) {
+                break;
+              case '.':

There should be a blank line between switch cases, unless a case "falls through" to another one (which should be explained in an inline comment).

+++ b/modules/field/modules/text/text.module
@@ -380,67 +385,133 @@ function text_summary($text, $format = NULL, $size = NULL) {
+              if ($text[$offset] == '"') {
+                ++$offset;
+                break 2;
+              }
+              break;
+              case "\n":

Wrong indentation.

+++ b/modules/field/modules/text/text.module
@@ -380,67 +385,133 @@ function text_summary($text, $format = NULL, $size = NULL) {
-  // If the htmlcorrector filter is present, apply it to the generated summary.
-  if (isset($filters['filter_htmlcorrector'])) {
-    $summary = _filter_htmlcorrector($summary);
+  $summary = substr($text, 0, $position);
+  if (!empty($stack)) {
+    // If closing tags are missing, we add an ellipsis and closing tags.
+    $summary .= t('...');
+    do {
+      $summary .= '</' . array_pop($stack) . '>';
+    } while (!empty($stack));
   }

It looks like you're unconditionally tying to do something along the lines of the HTML Corrector filter. In that case, it would make much more sense to call _filter_htmlcorrector() instead, as it's much more reliable (and also covered by tests).

Additionally, _filter_htmlcorrector() uses a DOMDocument for parsing, which in turn makes us circle back into the question whether text_summary() shouldn't use DOMDocument, too.

+++ b/modules/field/modules/text/text.test
@@ -287,87 +287,87 @@ class TextSummaryTestCase extends DrupalWebTestCase {
-    $expected = array(
...
+    $expected = array (

Coding style error.

+++ b/modules/field/modules/text/text.test
@@ -287,87 +287,87 @@ class TextSummaryTestCase extends DrupalWebTestCase {
     // And using a text format WITH the line-break and htmlcorrector filters.
...
-      "",
-      "<p></p>",
-      "<p></p>",
-      "<p></p>",
-      "<p></p>",
-      "<p></p>",
-      "<p>\nHi</p>",
...
+      1 => "<p>",
+      2 => "<p>",
+      3 => "<p>",
+      4 => "<p>\nHi",
+      5 => "<p>\nHi\n</p>",
+      6 => "<p>\nHi\n</p>\n<p>...</p>",

The changed expectations are wrong. With enabled HTML corrector, the expectations 1-4 are invalid, since tags are not properly closed.

Furthermore, our expectation for 6+ is that the text summary ends on a paragraph.

In turn, it looks like we've lost the essential idea and functionality of text_summary(). This needs to be retained.

Lastly, no tests have been added that actually verify and prove the desired behavior regarding HTML.

I'd also like to know why a simpler approach like in #220783: HTML markup is counted in for the length of trimmed posts isn't sufficient.

Powered by Dreditor.

Log in or register to post comments

Comment #115

brightbold

she/her

English

Boston, MA

commented 3 May 2011 at 03:01

Subscribing, since the other issue was closed as dupe.

Log in or register to post comments

Comment #116

AlexisWilke commented 3 May 2011 at 04:00

sun,

The @note is from Doxygen... 8-) It can make a note appear differently.

One reason for using the strpos(), strlen(), etc. directly is speed. Even if you have caches, speed is always an issue, especially if you have hundreds of users accessing your website.

Now, there is another reason for using them: the Drupal API does not offer all the functions used here. So unless you'd accept to also add all the missing functions, the safest was to use the "simple" functions.

Finally, it doesn't make any difference in this case because what we skip with the strpos() and strlen() and substr() calls are tags which cannot be UTF-8 (since we ignore the attributes.) Where you absolutely need to use UTF-8 is to count the number of characters between tags.

This should answer this worry:

I don't understand why substr() is not UTF-8 safe but the strlen() is. This will have unexpected consequences.

If you ask, I'll give you detailed explanation about UTF-8. I can assure you that what I wrote is not only correct, it makes things work a lot better.

+        switch ($tag_name) { // Ignore empty tags that were not properly closed.

Good catch! We should have a strtolower() here. Again, you don't need UTF-8 since tag names are ASCII.

The stack restoration, that you say is equivalent to HTML Corrector, is A LOT FASTER than calling that filter after we're done (at least under D6). Plus, we already have all the information we need to have. Why should we call another function instead of unwinding our own stack?

The main problem of the existing Core version is that the current version does this:

  $teaser = substr($body, 0, $max_teaser_length);

And the result in $teaser is completely whatever, including a tag cut in half.

This being said, in D8, using DOM could be a good solution. After all, you can then go through the DOM and cut everything after the max. size for your teaser was reached. It's probably much slower on very large pages, but that's not the majority of the pages out there.

Thank you.
Alexis

P.S. I offered the D6 version, the D7 patch is not from me, but some of the code is a copy from my own corrections. 8-)

Log in or register to post comments

Comment #117

pillarsdotnet commented 13 May 2011 at 13:05

Status:

Needs work

» Needs review

Status	File	Size
new	text_summary-221257-117.patch	18.93 KB

Okay, this is completely new code, relying on the PHP DOM extension.

Log in or register to post comments

Comment #118

pillarsdotnet commented 13 May 2011 at 13:09

Status	File	Size
new	text_summary-221257-118.patch	15.13 KB

Bah. Didn't mean to include my custom changes to the standard profile...

Log in or register to post comments

Comment #119

pillarsdotnet commented 13 May 2011 at 13:17

Status	File	Size
new	text_summary-221257-119.patch	15.27 KB

Couple of minor changes, trying to anticipate objections from @sun...

Log in or register to post comments

Comment #120

sun

German

Karlsruhe

commented 13 May 2011 at 13:56

Status:

Needs review

» Needs work

Yay, this looks much better now :)

+++ b/modules/field/modules/text/text.module
@@ -380,68 +369,79 @@ function text_summary($text, $format = NULL, $size = NULL) {
+  foreach ($body_node->childNodes as $node) {
+    if ($size <= 0) {
       break;
     }
+    _text_summarize($node, $size, $summary_node, $summary_doc);
   }

I think it would increase readability if the _text_summarize() helper function would return the new $size instead of changing it by reference.

+++ b/modules/field/modules/text/text.module
@@ -380,68 +369,79 @@ function text_summary($text, $format = NULL, $size = NULL) {
+  foreach ($summary_doc->getElementsByTagName('body')->item(0)->childNodes as $node) {
+    $output .= $summary_doc->saveXml($node);
   }

Any reason why we can't simply use $summary_node->saveXml() instead? If so, please document inline, not on issue.

+++ b/modules/field/modules/text/text.module
@@ -380,68 +369,79 @@ function text_summary($text, $format = NULL, $size = NULL) {
+  // If the original format was plaintext, the summary should be, too.
+  if ($text === strip_tags($text)) {
+    $output = strip_tags($output);
+  }

This doesn't make sense to me. I'd at least like to know the reasons for introducing it (as inline comment), but actually, it looks like a major bug if you pass no HTML into something and suddenly deal with HTML along the way.

Might be caused by the unconditional invocation of check_markup(), or alternatively, because the DOM documents are loading text nodes into the body without a wrapping block-level element (DIV/P), so DOM might automatically inject one.

+++ b/modules/field/modules/text/text.test
@@ -286,95 +287,52 @@ class TextSummaryTestCase extends DrupalWebTestCase {
-      $this->callTextSummary($text, $expected[$i],    NULL, $i);
...
+      $this->callTextSummary($text, $expected[$i], 'filtered_html', $i);

I think we want and need to retain the text_summary() functionality for no format. It is useful for text [field] contents that does not have a text format associated, but still needs to be potentially rendered in "teaser" use-cases. The complex phrase detection logic in text_summary() is actually the only way in Drupal core to extract proper summaries from text strings.

Also note that the old text_summary() contained some more international punctuation characters, which should be retained.

+++ b/modules/field/modules/text/text.test
@@ -383,7 +341,14 @@ class TextSummaryTestCase extends DrupalWebTestCase {
+    $this->assertIdentical(
+      $summary, $expected, t('Generated summary "@summary" matches expected "@expected".',
+        array(

Odd wrapping/indentation here; all except of the array keys should be on one line.

Powered by Dreditor.

Log in or register to post comments

Comment #121

pillarsdotnet commented 13 May 2011 at 15:05

Status:

Needs work

» Needs review

Status	File	Size
new	text_summary-221257-121.patch	15.54 KB

I think it would increase readability if the _text_summarize() helper function would return the new $size instead of changing it by reference.

Done. I also added a doxygen header for the helper function.

+++ b/modules/field/modules/text/text.module
@@ -380,68 +369,79 @@ function text_summary($text, $format = NULL, $size = NULL) {
+  foreach ($summary_doc->getElementsByTagName('body')->item(0)->childNodes as $node) {
+    $output .= $summary_doc->saveXml($node);
   }

Any reason why we can't simply use $summary_node->saveXml() instead? If so, please document inline, not on issue.

Added inline comment noting that there is no DOMNode::saveXml() function.

+++ b/modules/field/modules/text/text.module
@@ -380,68 +369,79 @@ function text_summary($text, $format = NULL, $size = NULL) {
+  // If the original format was plaintext, the summary should be, too.
+  if ($text === strip_tags($text)) {
+    $output = strip_tags($output);
+  }
This doesn't make sense to me. I'd at least like to know the reasons for introducing it (as inline comment), but actually, it looks like a major bug if you pass no HTML into something and suddenly deal with HTML along the way.

Might be caused by the unconditional invocation of check_markup(), or alternatively, because the DOM documents are loading text nodes into the body without a wrapping block-level element (DIV/P), so DOM might automatically inject one.

As you suspected, DOM is automatically injecting <p> tags. Revised the inline comment to make this clearer.

+++ b/modules/field/modules/text/text.test
@@ -286,95 +287,52 @@ class TextSummaryTestCase extends DrupalWebTestCase {
-      $this->callTextSummary($text, $expected[$i],    NULL, $i);
...
+      $this->callTextSummary($text, $expected[$i], 'filtered_html', $i);
I think we want and need to retain the text_summary() functionality for no format. It is useful for text [field] contents that does not have a text format associated, but still needs to be potentially rendered in "teaser" use-cases. The complex phrase detection logic in text_summary() is actually the only way in Drupal core to extract proper summaries from text strings.

The revised function works for plaintext, as tested by testLongSentence().

However, the example text in testLength() isn't very useful for the default plaintext format.

Calling check_markup("<p>\nHi\n</p>\n<p>\nfolks\n<br />\n!\n</p>", NULL)
yields "<p><br />\nHi<br />\n</p><br />\n<p><br />\nfolks<br />\n<br /><br />\n!<br />\n</p>".

If you really think it's useful to summarize that mess, I'll put it back.

Also note that the old text_summary() contained some more international punctuation characters, which should be retained.

Added the missing character, but I don't know what it is. Do you?

+++ b/modules/field/modules/text/text.test
@@ -383,7 +341,14 @@ class TextSummaryTestCase extends DrupalWebTestCase {
+    $this->assertIdentical(
+      $summary, $expected, t('Generated summary "@summary" matches expected "@expected".',
+        array(

Odd wrapping/indentation here; all except of the array keys should be on one line.

Revised, but we may have to agree to disagree here. I refuse to write code that extends past column 80 unless it is absolutely required by Drupal coding standards.

Log in or register to post comments

Comment #122

pillarsdotnet commented 13 May 2011 at 15:11

Status	File	Size
new	text_summary-221257-122.patch	15.54 KB

Grrr... Ignore the previous patch.

Log in or register to post comments

Comment #123

pillarsdotnet commented 13 May 2011 at 15:13

Status	File	Size
new	text_summary-221257-123.patch	15.56 KB

... And that one, too. I *really* need to get some sleep...

Log in or register to post comments

Comment #124

pillarsdotnet commented 15 May 2011 at 16:52

I see one minor problem:

+    // Using plain text format:
+    $expected = array (
+      0 => "<p>Hi</p>\n<p>folks</p>\n<p>!</p>",

That should read, // Using filtered_html format: but I'll wait for @sun's review.

If anyone from the International community would like to propose tests of non-English sentence parsing, I for one would welcome the assistance.

Log in or register to post comments

Comment #125

AlexisWilke commented 15 May 2011 at 06:21

I think we want and need to retain the text_summary() functionality for no format.

Note that this is something I mentioned. For a fix in D6, I don't think that function should be changed, but it is the right place to generate the teaser. Only problem, it either needs to understand HTML or have another function that does and calls text_summary() to handle the actual text to be cut for teaser purposes.

Log in or register to post comments

Comment #126

pillarsdotnet commented 15 May 2011 at 12:07

@AlexisWilke, @sun:

Of course the patch in #123 cannot be backported to D6 because D6 does not require the PHP DOM extension. For D6, an approach such as the one taken by #111 is required.

In D7 and D8, when you set the $format argument to NULL, that's the same as specifying the filter_fallback_format() filter, which is 'plain_text'.

The patch in #123 correctly handles such cases by calling check_markup() before looking for sentence and line breaks.

You seem to be requiring that text_summary() should treat a NULL $format argument as if 'full_html' had been specified, instead.

While that may be more consistent with the original code, I believe that it is a bug and should be treated as such.

A further advantage of this "unconditional invocation" of check_markup() is that text_summary() will correctly handle cases where the format includes such filters as Markdown, Image Assist, or Bbcode.

How can text_summary() be made HTML-aware without first invoking the required format filters to transform its input into HTML?

Log in or register to post comments

Comment #127

pillarsdotnet commented 15 May 2011 at 16:51

Status	File	Size
new	text_summary-221257-127.patch	16.02 KB

Corrected problem noted in #124 and also explicitly added the filter module to dependencies.

Log in or register to post comments

Comment #128

AlexisWilke commented 15 May 2011 at 20:44

@pillarsdotnet,

I did not see the code sun was talking about. However, something such as:

¡Hello!

would require "special" handling in that the first exclamation mark should not be viewed to end the previous sentence. It actually starts the next sentence. That's one special case I know of.

Another thing is the dash. In general, it isn't much of a problem in English. However, in French, cutting a word with a dash can be a problem. Although I guess the code was supposed to not cut words but on spaces. So that would not be a concern.

As for the text_summary(), I was referencing the problem in D6. The teaser function in there would call the text_summary() as is, whether the body had HTML or not. Anyway, the main reason for my current solution is to change a minimal number of things and possible all in the same file (which it is right now.)

Thank you.
Alexis Wilke

Log in or register to post comments

Comment #129

pillarsdotnet commented 16 May 2011 at 00:14

However, something such as:

¡Hello!

would require "special" handling in that the first exclamation mark should not be viewed to end the previous sentence. It actually starts the next sentence. That's one special case I know of.

The '¡' character is not an exclamation point. And even if it were, my code wouldn't use it as a break point. The regex I'm using breaks after an end-of-sentence character only if it is preceded by a word character and followed by a space character. However, it would break after common abbreviations such as 'Dr.' and 'Mr.', so I'm open to suggestions for a smarter algorithm, if you have any. For instance, how would you write code that splits at the end of a sentence, but does not split in the middle of the following sentence?

'They were like Dr. Jekyll and Mr. Hyde.'

As for the text_summary(), I was referencing the problem in D6.

The patch in question is not being submitted against D6. It is being submitted against D8. Are you suggesting that a patch against D8 be rejected on the grounds that it is unsuitable for D6? If so, this and most other D8 patches should simply be "Closed (won't fix)".

Log in or register to post comments

Comment #130

AlexisWilke commented 16 May 2011 at 05:26

No, not exactly... text_summary() is properly labeled and does what it is supposed to do. It just cannot be used the way it is (at least in D6.)

We should have an html_summary() function that does the same thing as text_summary() but has knowledge of HTML. The fact is that views and CCK call text_summary() to create teasers, just the same as the normal core node_teaser() implementation does and... it breaks those teasers as badly as the node teasers.

Any module should be able to call a very simple function to generate a teaser from HTML text they have (i.e. the Taxonomy could do that with descriptions.) But if you put any specialization in the node_teaser() directly, it won't be available to other modules and that's a shame.

As for periods after Mr. Ms. Sr. ... I don't know anything about those.

Now, note that European are not unlikely to put spaces all over the place: ¡ Hola ! is probably common in Spain. Just saying. I understand that if you don't find a full stop, ! or ?, then you ignore anyway so this isn't a problem.

Thank you.
Alexis Wilke

Log in or register to post comments

Comment #131

pillarsdotnet commented 16 May 2011 at 15:49

@AlexisWilke:

The patch in #127 causes all tests to pass and addresses all the objections raised so far. I would like for it to eventually reach "RTBC" status, however optimistic that may seem.

If you have an additional objection to the proposed patch against D8, please:

Clearly state your objection.
Suggest an additional test that illustrates your objection by failing in connection with the proposed patch.
Suggest a revised patch that overcomes the objection and enables the additional test to pass.

If, however, you are still talking about D6, or about D6 contrib modules, please hold your comments until this issue once again reaches D6 backport status. By that time the comment count will no doubt have reached three hundred or more, and I wouldn't want your insights to be lost in the middle.

Log in or register to post comments

Comment #132

pillarsdotnet commented 16 May 2011 at 14:30

Status	File	Size
new	text_summary-221257-132.patch	15.95 KB

Another slight tweak. Since the splitting regex only contains assertions, there is no need for the additional parameters to preg_split().

~~Note that the approach used cannot be extended to include Unicode end-of-sentence marks unless we require PHP 5.1.0 or later.~~

~~See http://php.net/manual/regexp.reference.unicode.php~~

EDIT: Never mind; there isn't a Sentence_Terminal character class yet.

See http://unicode.org/review/pr-23.html

Log in or register to post comments

Comment #133

pillarsdotnet commented 16 May 2011 at 14:31

Status	File	Size
new	text_summary-221257-133.patch	17.45 KB

Tried to add support for Unicode Sentence_Terminal characters by explicit list. Wonder if it will work?

Log in or register to post comments

Comment #134

Jamie Holly commented 16 May 2011 at 13:05

D7 requires PHP 5.2.5+. D8 will probably be the same, or could be PHP 5.3+ - that's currently in discussion. I would go ahead and use the Unicode, and then if it makes it to a D6 back port status, a work around could be worked on.

However, I kind of wonder if we aren't look at this all wrong. I can think of other instances where an end-of-sentence character might not signify an actual end of sentence, for example in code blocks:

$foo = (! $bar);

(Yeah it's a sucky coding standard, but you get the idea. Add to that the other variations in different coding languages, and there is probably many more examples out there.)

Maybe instead of end-of-sentence, a better approach would be splitting at end-of-block-elements or even br's.

(Numerous times I've seen where teaser breaks have been put in automatically with only one sentence left in the article (I've seen this in Wordpress as well). It really leaves the impression that the site is just looking to drive-up page views. )

Going this route would seem a lot simpler and with much better performance, plus make automatically placed breaks seem much more logical, especially to the reader/visitor. On top of that, it would be a much more "fool proof" approach, though probably not 100% yet.

Of course this is a very significant change from how the function operated before, so getting it into even D7 might be near impossible.

Log in or register to post comments

Comment #135

pillarsdotnet commented 16 May 2011 at 13:21

Status	File	Size
new	text_summary-221257-135.patch	17.48 KB

Perhaps this will better address @sun's objections to nested indentation.

EDIT: Sorry about the crosspost.

@#134, I mostly agree, despite the fact that your proposed example wouldn't force a break anyway. The open parenthesis character is not part of the [:word:] class.

Log in or register to post comments

Comment #136

pillarsdotnet commented 16 May 2011 at 13:26

Status	File	Size
new	text_summary-221257-136.patch	15.5 KB

Okay, here's a patch that avoids breaking in the middle of a XML_TEXT_NODE, as suggested in #134. I expect some of the tests to fail.

Hmm... As an intermediate approach, we could turn off sentence-splitting if any of the ancestor tags are 'code'.

According to @sun, performance is a non-issue. This function can take as long as it wants to, since the results are cached.

;-)

Log in or register to post comments

Comment #137

pillarsdotnet commented 16 May 2011 at 15:23

Status	File	Size
new	text_summary-221257-137.patch	17.77 KB

Same as #135, but avoids splitting text nodes within code blocks. A test is needed.

Log in or register to post comments

Comment #138

pillarsdotnet commented 16 May 2011 at 14:18

Status	File	Size
new	text_summary-221257-138.patch	17.78 KB

Interesting. Converting UTF entities to characters is not supported by our testbots. What, I wonder, is the proper way to create Unicode characters? For now, I'll just quote them verbatim, but the result is unreadable in my text editor.

This one passes tests locally; hopefully the testbot will agree.

Log in or register to post comments

Comment #139

pillarsdotnet commented 16 May 2011 at 14:27

Status	File	Size
new	text_summary-221257-139.patch	17.78 KB

Found and fixed a spelling error in a comment.

Log in or register to post comments

Comment #140

pillarsdotnet commented 16 May 2011 at 15:21

Status	File	Size
new	text_summary-221257-140.patch	17.79 KB

Forgot to add the $parents parameter to the recursive call. Still need a test that would otherwise split in the middle of a code block.

Log in or register to post comments

Comment #141

pillarsdotnet commented 19 May 2011 at 01:32

Status	File	Size
new	text_summary-221257-141.patch	19.75 KB

Added tests for code blocks and for Unicode Sentence_Terminal characters. Note that the test failures in #140 have nothing to do with this patch. Also included in #823380-111: Read More link is always visible on teaser.

Log in or register to post comments

Comment #142

pillarsdotnet commented 19 May 2011 at 03:51

Status:

Needs review

» Closed (duplicate)

Marking this as a dup since it's rolled into #823380-111: Read More link is always visible on teaser.

Log in or register to post comments

Comment #143

AlexisWilke commented 19 May 2011 at 07:20

pillarsdotnet,

Is that a way to close this issue on D7/D6 users?!

Or will that other issue be back ported instead?

Thank you.
Alexis

Log in or register to post comments

Comment #144

pillarsdotnet commented 19 May 2011 at 07:28

Don't worry; I want this backported as much as you do.

Log in or register to post comments

Comment #145

sun

German

Karlsruhe

commented 19 May 2011 at 14:12

Status:

Closed (duplicate)

» Needs review

#823380: Read More link is always visible on teaser. is not related, does not conflict, and is not a duplicate of this issue.

Log in or register to post comments

Comment #146

pillarsdotnet commented 19 May 2011 at 15:53

@sun -- It's not a dup, but the patches are related; they do overlap; they do break each other; they were merged at the suggestion of intoxination in #823380-90: Read More link is always visible on teaser., and that issue is much shorter and hence more likely to actually receive a review.

Log in or register to post comments

Comment #147

zilverdistel commented 24 May 2011 at 08:21

Subscribing. And I also don't think this is a duplicate of #823380.

--EDIT--
I'd like +1 for the need to backport this to D7.

Log in or register to post comments

Comment #148

pillarsdotnet commented 24 May 2011 at 11:33

@zilverdistel: See #109.

Log in or register to post comments

Comment #149

pillarsdotnet commented 28 May 2011 at 04:22

Issue tags:

-Needs backport to D6, -teaser, -Needs backport to D7

#141: text_summary-221257-141.patch queued for re-testing.

Log in or register to post comments

Comment #150

pillarsdotnet commented 1 June 2011 at 22:28

Issue tags:

+Needs backport to D6, +teaser, +Needs backport to D7

#141: text_summary-221257-141.patch queued for re-testing.

Log in or register to post comments

Comment #151

sun

German

Karlsruhe

commented 6 June 2011 at 20:01

Category:

bug

» task

I like the new implementation. Need to do another in-depth review soon.

However, this issue existed in the past 2-3 Drupal core versions already, and the new code implements an entirely new approach to hopefully resolve the issue finally in a rock solid way, but this also means we're improving the previously existing code in Drupal core. A major annoyance, but not really a bug.

Log in or register to post comments

Comment #152

pillarsdotnet commented 13 June 2011 at 01:18

#141: text_summary-221257-141.patch queued for re-testing.

Log in or register to post comments

Comment #153

fmosca commented 14 June 2011 at 12:31

subscribing, +1 for the need to backport this to D7

Log in or register to post comments

Comment #154

pillarsdotnet commented 26 June 2011 at 03:14

Issue tags:

-Needs backport to D6, -teaser, -Needs backport to D7

#141: text_summary-221257-141.patch queued for re-testing.

Log in or register to post comments

Comment #155

jct commented 23 July 2011 at 20:20

Issue tags:

+Needs backport to D6, +teaser, +Needs backport to D7

#141: text_summary-221257-141.patch queued for re-testing.

Log in or register to post comments

Comment #156

jct commented 23 July 2011 at 20:21

Attempted to add issue summary

Log in or register to post comments

Comment #156.0

jct commented 23 July 2011 at 20:21

Issue summary:

View changes

Adding issue summary

Log in or register to post comments

Comment #156.1

jct commented 23 July 2011 at 20:39

Issue summary:

View changes

Changing my summary to one provided in a comment

Log in or register to post comments

Comment #157

mangelp commented 4 August 2011 at 08:50

Hi, I've tried this patch in a drupal 7 site and it seems to work fine, but I've found that if the text to summarize starts with a sentence longer than the maximum length the summary is returned empty because _text_summarize stop processing sentences when it finds a sentence with a length longer than the maximum allowable length left.

Is that a bug or is the expected behaviour?

Log in or register to post comments

Comment #158

pillarsdotnet commented 5 August 2011 at 03:12

... _text_summarize stop processing sentences when it finds a sentence with a length longer than the maximum allowable length left.

Is that a bug or is the expected behaviour?

Possibly sub-optimal behavior, depending on whether you are willing to have a summary that stops in the middle of a sentence.

Log in or register to post comments

Comment #159

nancydru

she/her/hers

English

Boston

commented 6 August 2011 at 16:34

From the Merriam-Webster dictionary for "teaser:"

an advertising or promotional device intended to arouse interest or curiosity especially in something to follow

Stopping in the middle of a sentence is certainly going to arouse more interest than blanks.

Log in or register to post comments

Comment #160

Akaoni commented 23 August 2011 at 00:51

Subscribing.
+1 for D7 backport.

Log in or register to post comments

Comment #161

zouz747 commented 6 September 2011 at 10:04

Version:	8.x-dev	» 7.8
Category:	task	» feature
Priority:	Major	» Normal

I all, I'm new to Drupal and I see the utf8_truncate function being called in text_summary() handles the add_ellipsis parameter which BTW could be added in the UI configuration screen for teasers (Y/N) for setting. Assuming it's been set somewhere, shouldn't the parameter be passed to text_summary() from text_field_formatter_view()?

Log in or register to post comments

Comment #162

dddave commented 6 September 2011 at 10:08

Version:	7.8	» 8.x-dev
Category:	feature	» task
Priority:	Normal	» Major

Being new is fine, fiddling wildly with the issue settings not. ;)

Log in or register to post comments

Comment #163

rkodrupal commented 28 September 2011 at 18:08

sorry i haven't been keeping up with this thread since #49 but i recently discovered how to 'break' the code ... have a teaser that exceeds the length of an em string, then re-edit the article (shortening it) so it breaks in the middle of the em string.

again i'm still on drupal 6.0 and using the code mentioned in #49, but I thought I'd share this in case it's helpful for further code development.

here's the offending code that broke on re-editing:
Our in-house <em>judiciary</em><a id="/sites/default/files/snapshot/1499.jpg" class="imgtip" href="/node/1499"><img title="Click here for more information." alt="" src="/sites/all/themes/fordrupal/zincout/icons/internal_link_icon.gif"></a> gets to

Log in or register to post comments

Comment #164

pillarsdotnet commented 28 September 2011 at 18:41

@rkodrupal:

since #49 but i recently discovered how ... it breaks in the middle of the em string.

The current code in #141 makes this impossible. If you think it is necessary, I will add your example to the tests to prove it.

@NancyDru:

Stopping in the middle of a sentence is certainly going to arouse more interest than blanks.

The Unicode line-break algorithm is ... complicated.

I'm not sure it's worth implementing. Are you?

Log in or register to post comments

Comment #165

rkodrupal commented 28 September 2011 at 19:15

not a big deal on my part ... the breakage is quite evident on the screen and i can manually edit the teaser. just sharing this in case it was worthwhile. we do strive for perfection heh?

is the #141 code drupal 6 friendly? i'm losing track of the versions.

k

Log in or register to post comments

Comment #166

pillarsdotnet commented 28 September 2011 at 21:20

Title:

text_summary() should be made HTML-aware.

» text_summary() should output valid HTML and Unicode text.

Status	File	Size
new	text_summary-221257-166-d6.patch	13.24 KB
new	text_summary-221257-166-d7.patch	21.16 KB
new	text_summary-221257-166.patch	21.16 KB

@rkodrupal:

is the #141 code drupal 6 friendly? i'm losing track of the versions.

No. I can roll a D6 patch, but I'm sure that the API changes will not be accepted. Please let me know how it works for you; I'm getting weird results in my tests.

@NancyDru:

Stopping in the middle of a sentence

Okay; implemented as a fallback only if no valid sentence break is found:

+    // If no suitable sentence break is found, we split before any Unicode
+    // separator character.
+    $word_splitter = '/(?=\p{Z})/u';

Log in or register to post comments

Comment #167

pillarsdotnet commented 28 September 2011 at 21:40

Status	File	Size
new	text_summary-221257-167-d6.patch	13.28 KB

@rkodrupal:

I'm getting weird results in my tests.

This version makes the weirdness go away, for d6. The problem is that d6 code assumes that:

Javascript is enabled.
The teaser is an exact substring of the body.

Log in or register to post comments

Comment #168

28 September 2011 at 21:56

Status:

Needs review

» Needs work

The last submitted patch, text_summary-221257-166.patch, failed testing.

Log in or register to post comments

Comment #169

pillarsdotnet commented 28 September 2011 at 22:30

Status	File	Size
new	text_summary-221257-169-d6.patch	13.34 KB
new	text_summary-221257-169-d7.patch	21.16 KB
new	text_summary-221257-169.patch	21.16 KB

Fixed failing test. (When did that line disappear from filter_dom_serialize()?)

Log in or register to post comments

Comment #170

rkodrupal commented 5 October 2011 at 01:08

sorry for delay. i just updated to drupal 6.22 thinking it would be a good test of text_summary-221257-169-d6.patch.

if i create a new node it splits the teaser fine. if i re-edit that same node it inserts an annoying manual break ... duplicating everything within the teaser above and below the break.

so i was unable to execute the test from #163 because i never made it that far.

before you also included an update for node.pages.inc. is that still needed with this latest patch for d6?

thanks for all you great work on this.

k

Log in or register to post comments

Comment #171

pillarsdotnet commented 5 October 2011 at 01:23

Status:

Needs work

» Needs review

if i create a new node it splits the teaser fine. if i re-edit that same node it inserts an annoying manual break ... duplicating everything within the teaser above and below the break.

Sorry, but that's just part of the broken-by-design d6 javascript-powered teaser-break weirdness, and a good example of why that featurebug was removed in d7.

I don't know how to fix it so that it works as expected, both:

When the teaser summary is included with the node body.

and:

When the teaser summary is not included with the node body.

before you also included an update for node.pages.inc. is that still needed with this latest patch for d6?

I'm sorry, but looking back through the last few versions, I don't see one that updated node.pages.inc. Can you provide a link?

Furthermore, this issue is currently targeting version 8.x. It's nice to get the 7.x and 6.x backports out of the way, but problems with them should not distract from getting the patch reviewed and into 8.x, which must happen first.

Therefore, setting back to needs review.

Log in or register to post comments

Comment #172

rkodrupal commented 5 October 2011 at 15:38

I vote for making it work when the teaser summary is not included with the node body. There should be no reason for D6 to add in a manual teaser break on its own, and that was the main benefit for the fix done in http://drupal.org/node/221257#comment-3735536 (#49 above)

... the patch at #49 also includes the node.pages.inc change.

Thanks

Log in or register to post comments

Comment #173

pillarsdotnet commented 5 October 2011 at 16:00

@#172 by rkodrupal:

the patch at #49 also includes the node.pages.inc change.

Okay, I'll try to remember that the next time I roll a backport. For now, I'm just trying to get this accepted into 8.x, which lacks a node_body_field() in its node.pages.inc file.

See: http://drupal.org/update/modules/6/7#body_teaser_fields
And: #372743: Body and teaser as fields

Log in or register to post comments

Comment #173.0

pillarsdotnet commented 5 October 2011 at 16:00

Issue summary:

View changes

Corrected pointer to current patch.

Log in or register to post comments

Comment #174

AlexisWilke commented 5 October 2011 at 20:39

Actually, in D8 you probably want to change the scheme.

A much better idea is to have a flag to tell you whether the teaser was generated or not and completely forget about the comparison between the beginning of the body and the teaser (which is a crazy idea.)

Once you have that flag, you know when you edit whether the teaser is part of the body or not.

That would not work for D7 and D6, of course.

Good luck.
Alexis

Log in or register to post comments

Comment #175

pillarsdotnet commented 5 October 2011 at 21:21

@#174 by AlexisWilke:

A much better idea is to have a flag to tell you whether the teaser was generated or not and completely forget about the comparison between the beginning of the body and the teaser (which is a crazy idea.)

Did you mean to comment on #823380: Read More link is always visible on teaser.?

Log in or register to post comments

Comment #176

AlexisWilke commented 6 October 2011 at 02:58

I'm the one who wrote the patch in #49. You don't need to have this function:

node_compare_teaser_and_body()

if you have a flag to know that the teaser was auto-generated.

Unfortunately, in D6, that was the easiest way to fix the problem, but in a new version that should not be done that way... The Read More link has nothing to do with that, although a flag to mark that the teaser == body is a good idea too. 8-)

Thank you.
Alexis Wilke

Log in or register to post comments

Comment #177

pillarsdotnet commented 6 October 2011 at 03:32

@#176 by AlexisWilke on October 6, 2011 at 2:58am:

I'm the one who wrote the patch in #49.

Okay; I apologize for not recognizing you as an early contributor to this issue.

You don't need to have this function:

node_compare_teaser_and_body()

The patch from #169 does not contain such a function, nor can I find one anywhere in core.

if you have a flag to know that the teaser was auto-generated.

Such a flag, if it were to exist, would having nothing to do with this patch. The text_summary() function does not deal with node or entity objects. It only deals with plain text.

Unfortunately, in D6, that was the easiest way to fix the problem, but in a new version that should not be done that way...

And the new version is not done that way, which is why I am having difficulty understanding the relevance of your commentary. Have you actually read the code? If not, please do. The only thing holding up this issue is the lack of a thorough code review by someone other than myself.

The Read More link has nothing to do with that, although a flag to mark that the teaser == body is a good idea too. 8-)

Again, this is why I thought you were referring to the other issue, where such commentary would be more relevant.

Log in or register to post comments

Comment #178

rkodrupal commented 6 October 2011 at 17:10

for other folks reading this thread i'm stickin' with the patch in #49 for drupal 6.22 ... i did need to split the patch between the .module and the .inc ... the patch for .module works well in drupal 6.22, but not the patch for .inc ... the .inc patch is only 1 line so just manually make the change.

of course this doesn't fix the minor issue i raised in #163.

unfortunately delays in updates to the contrib modules are keeping me from going to drupal 7, so any worries about drupal 8 are far off into my future.

k

Log in or register to post comments

Comment #179

David_Rothstein commented 21 November 2011 at 05:07

Status:

Needs review

» Needs work

A couple comments after reading through this issue:

```
 function text_summary($text, $format = NULL, $size = NULL) {
...
+    return trim(check_markup(substr($text, 0, $delimiter), $format));
```
The check_markup() function has already been called on the text before it was ever passed to text_summary(). Therefore, this code is going to result in check_markup() being called twice. (See also: #1347920: text_summary() checks for specific filters, but it should be agnostic to filters.)
There were comments above stating that we don't have to worry about the performance of text_summary() because it's cached.... Unfortunately, though, it's actually not cached (although it probably should be). See #1347910: The result of text_summary() should be cached rather than calculated every time a teaser is viewed. So unless that's fixed, we do have to think a bit about performance here.

Log in or register to post comments

Comment #180

effulgentsia commented 21 December 2011 at 02:14

Thanks for everyone's work on this. In addition to the feedback in #179,

+++ b/modules/field/modules/text/text.module
@@ -380,68 +369,147 @@ function text_summary($text, $format = NULL, $size = NULL) {
+  if ($body->hasChildNodes()) {
+    $node = $summary->appendChild($doc->createElement($body->tagName));
+    $parents[] = $body->tagName;
+    foreach ($body->childNodes as $child) {
+      if ($size > 0) {
+        $size = _text_summarize($child, $size, $node, $doc, $parents);
+      }
+      else {
+        break;
+      }
+    }
   }

This fails to include self-closing tags (like images) in the summary. It also fails to include tag attributes (like href).

For anyone interested, I added a Improved Text Trim sandbox project that alters the text field trimming formatters to use a trimming function modeled on #169 with #179 and the above incorporated.

@pillarsdotnet: #169 needs to be rerolled for the D8 "core" folder introduction. Do you want to do that and incorporate the fixes from my sandbox that you approve of, or would you like me to post a patch here with the fixes?

Log in or register to post comments

Comment #181

effulgentsia commented 21 December 2011 at 02:20

Title:

text_summary() should output valid HTML and Unicode text.

» text_summary() should output valid HTML and Unicode text, and not count markup characters as part of the text length

Log in or register to post comments

Comment #182

andrey-z commented 6 January 2012 at 23:29

Status:	Needs work	» Needs review
Issue tags:	-Needs backport to D6, -teaser, -Needs backport to D7

#169: text_summary-221257-169.patch queued for re-testing.

Log in or register to post comments

Comment #183

7 January 2012 at 03:47

Status:	Needs review	» Needs work
Issue tags:		+Needs backport to D6, +teaser, +Needs backport to D7

The last submitted patch, text_summary-221257-169.patch, failed testing.

Log in or register to post comments

Comment #184

pillarsdotnet commented 23 January 2012 at 18:57

@effulgentsia -- Sorry; I've been offline for a long time and not likely to get back to this soon. Please post whatever patches you feel appropriate.

Log in or register to post comments

Comment #185

NROTC_Webmaster commented 25 February 2012 at 00:39

sub

Log in or register to post comments

Comment #186

pillarsdotnet commented 6 March 2012 at 08:40

@#179 Posted by David_Rothstein on November 21, 2011 at 5:07am:

The check_markup() function has already been called on the text before it was ever passed to text_summary(). ... See #1347920: text_summary() checks for specific filters, but it should be agnostic to filters

That has not been the case in my tests. Has something changed in core? (checking...) Wow. I guess it has. Re-rolling accordingly.

See #1347910: The result of text_summary() should be cached rather than calculated every time a teaser is viewed. So unless that's fixed, we do have to think a bit about performance here.

I'm glad there is another issue to worry about performance and caching so that I don't have to worry about it in this issue. (grin!)

Log in or register to post comments

Comment #187

pillarsdotnet commented 6 March 2012 at 09:16

Status	File	Size
new	text_summary-221257-187.patch	22.68 KB

@#180 Posted by effulgentsia on December 21, 2011 at 2:14am

Thanks.

Re-rolled using your sandbox as a guide.

Log in or register to post comments

Comment #188

pillarsdotnet commented 6 March 2012 at 09:17

Status:

Needs work

» Needs review

Grr....

Log in or register to post comments

Comment #189

pillarsdotnet commented 6 March 2012 at 09:18

Also needs test-cases for the shortcomings pointed out by Effulgentsia in #180:

This fails to include self-closing tags (like images) in the summary. It also fails to include tag attributes (like href).

Log in or register to post comments

Comment #190

6 March 2012 at 09:53

Status:

Needs review

» Needs work

The last submitted patch, text_summary-221257-187.patch, failed testing.

Log in or register to post comments

Comment #191

pillarsdotnet commented 6 March 2012 at 10:48

Status	File	Size
new	text_summary-221257-191.patch	21.46 KB

No clue why the fatal error occurred. Re-rolling with some trivial changes and re-uploading.

Log in or register to post comments

Comment #192

pillarsdotnet commented 6 March 2012 at 10:48

Status:

Needs work

» Needs review

Go, testbot!

Log in or register to post comments

Comment #193

6 March 2012 at 12:45

Status:	Needs review	» Needs work
Issue tags:	-Needs backport to D6, -teaser, -Needs backport to D7

The last submitted patch, text_summary-221257-191.patch, failed testing.

Log in or register to post comments

Comment #194

pillarsdotnet commented 6 March 2012 at 15:02

Status:

Needs work

» Needs review

#191: text_summary-221257-191.patch queued for re-testing.

Log in or register to post comments

Comment #195

6 March 2012 at 15:45

Status:	Needs review	» Needs work
Issue tags:		+Needs backport to D6, +teaser, +Needs backport to D7

The last submitted patch, text_summary-221257-191.patch, failed testing.

Log in or register to post comments

Comment #196

Mark_L6n commented 19 May 2012 at 01:05

Determining sentence boundaries is a complex issue, even if only English is considered. Periods are frequently used for purposes other than sentence boundaries, particularly abbreviations. The difficulty can be seen by the number of academic papers on it...take a quick look at what Google Scholar returns for "sentence boundaries". People have used rule-based, statistical, neural-network, machine-learning approaches, etc. to determine whether a stop is a sentence boundary or not.
You could do a quick-and-dirty processor (that wouldn't always be accurate) by searching for a stop followed by white space followed by a capital letter, disallowing Mr., Mrs., Dr. etc.
However, that's only for English and still incomplete.
So of course using the nearest markup boundary is much better. Perhaps a basic sentence boundary detector could be implemented, but allowed only as an option, with a note explaining it won't always be accurate.

Log in or register to post comments

Comment #197

pillarsdotnet commented 19 May 2012 at 15:30

@Mark123 -- Patches are always welcome.

Log in or register to post comments

Comment #198

dozymoe commented 7 August 2012 at 15:50

Maybe you can think of text_summary as another text-formats filter, that is inserted right before the last filter, which is usually html corrector, or if user choose, htmLawed.

If it can be seen like that, then the call to _filter_htmlcorrector() is outside of text_summary() function's contract.

Log in or register to post comments

Comment #199

tim.plunkett

he/him

English

Philadelphia

commented 21 December 2012 at 23:54

Category:

task

» feature

This is a feature request.

Log in or register to post comments

Comment #200

jenlampton

she / her

commented 22 December 2012 at 23:41

Patch at #76 still applies cleanly to 7.18. In case anyone else still needs a fix today :) And I'm not sure why this isn't a bug report.

Log in or register to post comments

Comment #201

David_Rothstein commented 24 December 2012 at 19:48

Category:	feature	» bug
Priority:	Major	» Normal

Yeah, as written, this is definitely a bug report (regardless of whether the patch here also goes into feature request territory)...

However, I'm not convinced the main bug (faulty HTML being produced in teasers) exists anywhere except Drupal 6? That's because of #1235062: text_summary() ignores filter status - basically, in Drupal 7 and Drupal 8, the two bugs should cancel each other out and ensure the HTML corrector filter always runs.

I think we should split the difference and call it a non-major bug. Even in Drupal 6, I would think that enabling the HTML corrector filter gets around the main part of this bug.

Log in or register to post comments

Comment #202

lightsurge commented 8 January 2013 at 12:27

I can't find any reference to text_summary() in 8.x, has it been removed and that was why this issue was put to feature request?

Log in or register to post comments

Comment #203

lightsurge commented 8 January 2013 at 12:31

Appears here:
http://drupalcontrib.org/api/drupal/drupal%21modules%21field%21modules%2...

But not here:
http://api.drupal.org/api/drupal/modules!field!modules!text!text.module/...

Log in or register to post comments

Comment #204

David_Rothstein commented 8 January 2013 at 14:18

text_summary() is still in Drupal 8 (in core/modules/text/text.module).

On api.drupal.org it only appears for Drupal 7 right now; I'm not sure why but I believe that hasn't been rebuilt in a while and the module has moved around recently.

Log in or register to post comments

Comment #205

jenlampton

she / her

commented 7 March 2013 at 08:30

Priority:

Normal

» Major

Bumping back up to major because I'm still seeing this problem in D7.

I'm not sure which of the two things that are supposed to "cancel each other out" is failing, but without the patch (I'm using this one from #76) my site is still hosed.

Above mentioned patch still applies to 7.21, for others who are also currently hosed :)

Log in or register to post comments

Comment #206

dcam commented 24 May 2013 at 04:06

Issue tags:

+Needs issue summary update

http://drupal.org/node/1427826 contains instructions for updating the issue summary with the summary template.

The summary may need to be updated with information from comments.

Log in or register to post comments

Comment #207

ttkaminski commented 18 July 2013 at 21:41

Not sure why no one brought this up, but it seems that if you have a post that has a div tag that spans the  marker, then the teaser summary will contain invalid HTML. Example:

<div>
  <p>Hello</p>
  <!--break-->
  <p>there</o>
</div>

text_summary() will return:

<div>
  <p>Hello</p>

Which breaks the entire layout of the site because of the missing div close tag. My recommendation is to include scenario in the tests and also perform HTML correction after the text is trimmed via the 

Log in or register to post comments

Comment #208

swentel commented 23 September 2013 at 12:16

Component:

field system

» text.module

Moving to text module

Log in or register to post comments

Comment #209

Stalski commented 25 September 2013 at 12:08

Just tried to read this huge issue. To go forward with this, can anyone give a status update summary?

Especially :
- What's the work that needs to be done?
- Which patch is the most probable to start with to move forward?
- As lots of issues change the testFirstSentence test case, I think it might be a good idea to have a separate test for this issue, no?

Log in or register to post comments

Comment #210

alexverb commented 23 October 2013 at 21:45

I don't want to sidetrack this issue. But I've provided a contrib solution for D7 that might be of interest to this issue: https://drupal.org/node/2118995. It provides a way to trim text without taking HTML tags into account in about 20 lines of code:

/**
 * Returns trimmed HTML without counting HTML tags.
 */
function _field_html_trim_get_trimmed_html($safe_value, $trim_length, $ellipsis) {

  // Get text without any formatting.
  $text = strip_tags(str_replace('<', ' <', $safe_value));
  $text = preg_replace('/\n|\r|\t/m', ' ', $text);
  $text = str_replace('&nbsp;', ' ', $text);
  $text = str_replace("\xc2\xa0", ' ', $text);
  $text = trim(preg_replace('/\s\s+/', ' ', $text));

  if (strlen($text) > $trim_length) {
    // Shorten the text.
    $short_text = truncate_utf8($text, $trim_length, 1);
    // Break text up in words.
    $words = str_word_count($short_text, 1);
    // Build regular expression pattern.
    $pattern = implode("(.*?)", $words);
    $pattern = '/(.*?)' . $pattern . '/';
    // Find the word pattern in our HTML.
    if (preg_match($pattern, $safe_value, $result)) {
      // Close any unclosed HTML tags and append ellipsis.
      $safe_value = _filter_htmlcorrector($result[0] . $ellipsis);
    }
  }

  return $safe_value;
}

I have not read this entire issue. But I've looked at some patches and see that DOMDocument is used. Probably to get to sentence level for cutting off. My solution does not take that into account, but might be an interesting approach to the HTML problem.

Log in or register to post comments

Comment #210.0

alexverb commented 23 October 2013 at 21:45

Issue summary:

View changes

Updated info, links, and status.

Log in or register to post comments

Comment #211

jtbayly commented 29 December 2013 at 22:28

After close to a year of struggling to figure out how our entire homepage layout gets destroyed every few months, I finally tracked it down to this bug.

Isn't this really (at least) two different bugs?

1. "Correct faulty and chopped off HTML" is completely ignored on Summary field, thereby breaking sites on a regular basis. (If you use a WYSIWYG tool of any sort and use a break to determine the summary, there are all kinds of ways to cause problems. Any tag that isn't closed prior to the break will destroy your site wherever that summary is displayed.) Don't we have a fix for this, and can't we just apply it? I know it will be painful for a bit, but really it's a catch 22. We can't wait forever.

2. The teaser isn't quite as long as it should be sometimes, because HTML code and is getting counted as words.

Would it help to split them up into 2 different issues?

-Joseph

Log in or register to post comments

Comment #212

tarekdj commented 12 September 2014 at 16:42

Issue summary:	View changes
Status:	Needs work	» Needs review

Status	File	Size
new	text_summary-221257-212.patch	3.48 KB

Patch based on #210.

Log in or register to post comments

Comment #213

12 September 2014 at 17:36

Status:

Needs review

» Needs work

The last submitted patch, 212: text_summary-221257-212.patch, failed testing.

Log in or register to post comments

Comment #214

valderama commented 23 September 2014 at 12:27

Until this gets fixed in core one can use smart_trim module with a patch from this issue applied: https://www.drupal.org/node/1916060

Log in or register to post comments

Comment #215

jenlampton

she / her

commented 16 October 2014 at 07:16

Since the most recent patch no longer applies cleanly to D7 also, still applying the patch in #76 to 7.32

@jtbayly I think you are exactly right with your explanation of the two problems, and we did have two (or more) issues, but since this one had a working patch, the others were closed and marked a dupe of this one.

Log in or register to post comments

Comment #216

jenlampton

she / her

commented 17 January 2015 at 01:49

Patch in #76 still applies clealy to 7.34. Tried the patch in #169 again also, but no luck there.

Log in or register to post comments

Comment #217

jenlampton

she / her

commented 1 April 2015 at 03:19

Issue summary:

View changes

removing current issue status from issue

Log in or register to post comments

Comment #218

bburg

he/him

Washington, DC

commented 15 May 2015 at 19:42

Issue summary:

View changes

Updating broken link.

Log in or register to post comments

Comment #219

bburg

he/him

Washington, DC

commented 15 May 2015 at 19:42

Confirming this is still issue in D8 head.

Log in or register to post comments

Comment #220

mdrayer commented 15 May 2015 at 22:40

Issue tags:

+Triaged at DrupalCon Los Angeles 2015

This is a triage at the Los Angeles Sprints. @bburg and I did the triage.

We reproduced the issue against Drupal 8 head with the following steps (similar to steps in summary):

Set text field format to 'Summary or trimmed' with Trimmed limit set to 200 characters
Create new content, insert the following text into the Source for the text field:
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce diam est, finibus a lobortis ac, mattis imperdiet sapien. Sed non libero ac augue aliquam volutpat. Praesent sollicitudin ipsum ac amet.</p>
The trim will count the HTML markup in the character count; so instead of displaying "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce diam est, finibus a lobortis ac, mattis imperdiet sapien. Sed non libero ac augue aliquam volutpat. Praesent sollicitudin ipsum ac amet." (199 characters), the field outputs as "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce diam est, finibus a lobortis ac, mattis imperdiet sapien. Sed non libero ac augue aliquam volutpat."

Note that HTML entities and Unicode characters seemed to get decoded before the trim count went into effect, so those passed our testing. HTML markup seems to be the only problem here. The HTML corrector step in the summary is unnecessary and seems like a whole other issue.

Log in or register to post comments

Comment #221

bburg

he/him

Washington, DC

commented 15 May 2015 at 22:42

Status	File	Size
new	text_summary-221257-221.patch	3.49 KB

Rerolled patch against current head. Resolved conflict generated from #2361745: Remove usage of truncate_utf8(). This will fail tests.

Log in or register to post comments

Comment #222

kgoel commented 16 May 2015 at 00:19

Status:

Needs work

» Needs review

Log in or register to post comments

Comment #223

bburg

he/him

Washington, DC

commented 16 May 2015 at 00:34

While several tests need to be rewritten to address the new behavior being added by this issue, at least one appears to be legitimately failing. namely testFirstSentenceQuestion(). This is because the new function added by this patch _html_trim_get_trimmed_html() strips out the ending sentence character.

Log in or register to post comments

Comment #224

16 May 2015 at 00:48

Status:

Needs review

» Needs work

The last submitted patch, 221: text_summary-221257-221.patch, failed testing.

Log in or register to post comments

Comment #225

xjm

she/her

English

commented 20 May 2015 at 19:28

(Saving proposed issue credit for discussion and triage participants at LA.)

Log in or register to post comments

Comment #226

jtbayly commented 12 June 2015 at 16:50

"The HTML corrector step in the summary is unnecessary and seems like a whole other issue."

I'm not sure exactly what you mean by that. The issue title starts out with "text_summary() should output valid HTML..." Are you saying that that should be a different issue? I would say it's the more important of the two bugs under discussion here. Teaser length being off by a little bit isn't nearly as big of a deal as outputting bad HTML onto the homepage.

Regardless, if you don't intend to address that part of it, let's change the title here and I'll create another issue. Let me know.

And thanks for working on this.

-Joseph

Log in or register to post comments

Comment #227

bburg

he/him

Washington, DC

commented 13 June 2015 at 14:44

Status	File	Size
new	TextSummary.txt	2.02 KB

@jtbayly, To recreate the reported issue, one must turn off the HTML corrector filter. So the reported problem essentially boils down to: If we turn off the feature that was created to fix this bug in the first place, then the bug occurs. Which doesn't make a whole lot of sense as an issue to me. Now we can discuss whether this should just be default behavior for the summary text formatter in the first place? That's what we were thinking by "The HTML corrector step in the summary is unnecessary and seems like a whole other issue."

I did start work on a TextSummary class to create summaries while preserving markup integrity. Unfortunately, I haven't had much time since the week after DrupalCon to work on this. It's not remotely ready, hence why I haven't posted a patch, but I am only sharing this now to start a discussion on the direction I was thinking about heading in. Rather than relying on regular expressions to manipulate HTML, I was thinking of constructing the snippet as a DOM and traversing through the node tree recursively. Example attached (beware, I haven't looked at this in almost a month).

Log in or register to post comments

Comment #228

wim leers

Ghent 🇧🇪🇪🇺

commented 15 June 2015 at 15:54

Devil's advocate:

Shouldn't we just remove auto-truncating? It's a bad practice anyway. Have a separate teaser field instead.

Log in or register to post comments

Comment #229

bburg

he/him

Washington, DC

commented 16 June 2015 at 13:11

@wim-leers, Good question. Because content editors are lazy? I've used auto-truncating as a backup on most of my projects. But I definitely get that as a general practice, it's better to curate your own summaries.

Great session at DrupalCon btw.

Log in or register to post comments

Comment #230

jtbayly commented 16 June 2015 at 16:01

I'm pretty sure you've missed one part of this bug, which is the one that I happen to care about. Either that, or I'm majorly confused.

I don't use auto-truncation. I use a WYSIWYG tool, which allows me to set a breakpoint and auto-fills the summary field with the content found prior to the breakpoint. The problem is that we end up with broken HTML in the summary field, even when the HTML corrector is on. In other words you *don't* need to turn off the HTML corrector to reproduce this bug. You need to turn it on and put bad HTML in the summary field.

Here is *my* summary of the two parts to this issue, as I wrote in #211 above:

1. "Correct faulty and chopped off HTML" is completely ignored on Summary field, thereby breaking sites on a regular basis.
2. The teaser isn't quite as long as it should be sometimes, because HTML code and is getting counted as words.

And like I said before, #1 is a much bigger issue, because it completely DESTROYS every page where the summary field is displayed with broken HTML. On the other hand, it seems to me you are only working on #2?

That's why I want to see this broken into two issues. I see how they are related, but I really think that it's confusing to try to deal with two problems at once, especially when one of them is major and the other (in my opinion) should be marked as minor.

Can't somebody just write a patch that makes the "Correct faulty and chopped off HTML" actually work on the summary field? It seems so simple... (as a non-coder).

Warmly, and on my knees begging...
-Joseph

Log in or register to post comments

Comment #231

geerlingguy commented 11 January 2016 at 21:13

Shouldn't we just remove auto-truncating? It's a bad practice anyway. Have a separate teaser field instead. (— #228)

For my own sites (where I'm the only one managing content), I typically use a summary field and use that. For almost any project where there's no OCD-level content administrator, summary fields become next to worthless, as admins rarely type in a good summary, and instead copy and paste a line of text (which makes no sense out of context), type in a couple words to fill the field, or do something similarly unhelpful.

I think the main thing we're aiming for is something like a google search summary result: https://www.google.com/?q=test (or even the kind of summary text you get through default Solr search results with Apachesolr or Search API Solr).

In the results, the summaries are all of similar length, but they are all relevant text (through the use of a slightly more intelligent filter). For my needs, something like Smart Trim usually works, and it would be nice to have that option in core for any fields where Wysiwyg and/or images/files are allowed, especially on larger multi-author or community sites.

Also, IIRC, Drupal 6 used to promote the 'summary/body' split a bit more prominently, and it was still with us in Drupal 7 ('show summary' for body fields), but was removed from Drupal 8 because the UX is always awkward, people rarely used summaries correctly, and content updates would often result in incorrect/outdated summaries anyways :(

Log in or register to post comments

Comment #232

lightsurge commented 19 January 2016 at 12:17

I agree with @geerlingguy to get admins to use summary fields takes training and re-training, which isn't always practical. To keep things simple if a teaser can be extracted from the main body of text, in many cases it's more successful. Fine to have a separate summary field, but admins wouldn't necessarily want it accessible to all users, and would want it auto-magically filled in where it's empty.

Something like Google search result summaries would be great, the thing is that and solr sort of rely on having a search keyword/phrase etc. in order to pick out what they show? Seems like any clever summary is going to be the realm of contrib and even plugins to third party software. For example, it would be good to have some kind of wrapper module for Open Text Summarizer http://libots.sourceforge.net/ which despite the ancient last release date seems to still have good support in modern distros http://packages.ubuntu.com/wily/libots0 and there's existing php wrappers to look at https://github.com/znck/summarizer

Edit: I had a go with OTS at creating google-like summaries https://www.drupal.org/node/2652726 I quite like the result although obviously it doesn't really help this core issue.

Log in or register to post comments

Comment #233

Jamie Holly commented 19 January 2016 at 14:36

Given the importance of a dedicated summary field, especially in terms of SEO, social media and design, I can't help but wonder if this 8 year old issue is that important anymore? Layout has changed a lot since this issue first emerged. Back then using teasers (summaries) was pretty much common place, but design trends over the past several years have gone more card or summary based.

Auto text summarization is something I've played with a lot, trying out numerous systems in various languages, as well as service based. I believe the problem with that is that it becomes a crutch. A computer isn't going to write an effective summary anymore than it will write an effective headline, and summaries are almost as important as headlines anymore.

Back to the issue at hand, I think we really need to re-evaluate. First off, a backport to D6 is basically not going to happen at this point, given we're less than a month away from D6's EOL. Looking forward, is having a Body+Summary field really the best path forward? Perhaps in future versions of Drupal, the better method would be a dedicated summary field and then body, leaving the Body+Summary in for legacy purposes for one version, then moving it to contrib after that.

Thinking that, the best route forward for this issue will be to fix the actual issue at hand, ie: getting valid HTML. I don't think spending time on reworking the actual functionality of this field is really a priority, given I can actually see a future Drupal without it.

Log in or register to post comments

Comment #234

geerlingguy commented 19 January 2016 at 20:05

Issue tags:

-Needs backport to D6

Yeah, definitely not heading back to D6 at this point.

Log in or register to post comments

Comment #235

alexverb commented 27 February 2016 at 07:10

In my mind the easiest solution is to grab a hold of CakePHP's truncate function like I did for Field HTML Trim.

This function has several options.

To include or exclude HTML from trim count.
Adding an ellipsis.
Unicode support.
Cut of at character level or word level.
Properly closes unclosed HTML tags.

I've rigorously tested this function and compared it's performance against a DOMDocument approach. I think it's the best way to go. I'd be happy to write a patch and adjust the function to Core's needs if the maintainer is interested in this approach.

Log in or register to post comments

Comment #236

jenlampton

she / her

commented 6 March 2016 at 23:08

Patch in #76 still applies clealy to 7.43.

Log in or register to post comments

Comment #237

dqd

London | N.Y.C | Paris | Hamburg | Berlin

commented 4 April 2016 at 04:25

(EDIT) Firsts things first: there is an issue with D8.1 beta body field trimmed view (e.g. teaser, frontpage) which renders the formatted and trimmed text with a starting but not with a closing html <p> tag. I assume that this is part of this issue (?), if not, we need further testing for a new issue to set up.

Now back to topic:

Some thoughts on the summary yes/no paradigm discussion above: both main points are valid and do not neutralize each other. Automated teaser text IS and ever WILL BE important in bigger projects or web applications to prevent some data inconsistency in many editorial desk concepts. (" Dear editor, if you change the article, please don't forget to update the summary"). And of course it would mean a lot of loss to remove the summary option for those who are picky about would should be extracted for a preview. I think, to have both options at hand isn't obsolete for a data contructs building CMF like Drupal is.

Another thought on this: Despite of that, Views is in Core now for D8 and is the default frontpage provider (but with teaser view by default). We may should think about merging forces here since Views also supports field length trimming? And does this issue affects Views trimming text also? Maybe we should add a Views tag to get attention of Views maintainers here? Just thinking loud ...

@alexverb: interesting solution. Thanks for this contribution. Any performance concerns to consider for further testing under D8 ?

Log in or register to post comments

Comment #238

4 April 2016 at 04:25

Version:

8.0.x-dev

» 8.1.x-dev

Drupal 8.0.6 was released on April 6 and is the final bugfix release for the Drupal 8.0.x series. Drupal 8.0.x will not receive any further development aside from security fixes. Drupal 8.1.0-rc1 is now available and sites should prepare to update to 8.1.0.

Bug reports should be targeted against the 8.1.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Log in or register to post comments

Comment #239

jenlampton

she / her

commented 31 August 2016 at 02:58

Patch in #76 still applies clealy to 7.50.

Log in or register to post comments

Comment #240

31 August 2016 at 02:58

Version:

8.1.x-dev

» 8.2.x-dev

Drupal 8.1.9 was released on September 7 and is the final bugfix release for the Drupal 8.1.x series. Drupal 8.1.x will not receive any further development aside from security fixes. Drupal 8.2.0-rc1 is now available and sites should prepare to upgrade to 8.2.0.

Bug reports should be targeted against the 8.2.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Log in or register to post comments

Comment #241

31 August 2016 at 02:58

Version:

8.2.x-dev

» 8.3.x-dev

Drupal 8.2.6 was released on February 1, 2017 and is the final full bugfix release for the Drupal 8.2.x series. Drupal 8.2.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.3.0 on April 5, 2017. (Drupal 8.3.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.3.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Log in or register to post comments

Comment #242

31 August 2016 at 02:58

Version:

8.3.x-dev

» 8.4.x-dev

Drupal 8.3.6 was released on August 2, 2017 and is the final full bugfix release for the Drupal 8.3.x series. Drupal 8.3.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.4.0 on October 4, 2017. (Drupal 8.4.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.4.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Log in or register to post comments

Comment #243

jenlampton

she / her

commented 29 March 2018 at 03:30

Patch in #76 still applies clealy to 7.56.

Log in or register to post comments

Comment #244

dqd

London | N.Y.C | Paris | Hamburg | Berlin

commented 13 September 2017 at 03:56

Oh Dear ... another long holding one ... Lost it completely out of my focus. Maybe I should raise voice for it on https://www.drupal.org/node/2905741 ? But I think its too late already.

@jenlampton: heh thanks for frequently reporting that the patch still applies cleanly :)

Just to come back again to @Wim Leers on #228

Shouldn't we just remove auto-truncating? It's a bad practice anyway. Have a separate teaser field instead.

if this became a show stopper for anyone to make any decision or to push this issue forward, let me remember that this functionality not only applies for "teasing" any content but also for any listings with shortened text in different lengths and admin previews etc. I think this function needs to work properly, also for admin content use cases. And just to make sure that I do not mix things up here: text_summary() is doing the trimming in most of the core trimmed text previews, correct?

#62: The Views are using a copy of the Core function to extract the Teaser. They do not have the problem of comparing teaser and body as the Core does, but they do the same really bad job at not closing the tags... (this is the case for any truncate function in the Views.)

Is this still the case? I should take a short chat with @dawehner on this ...

(EDIT: Cutted last part to paste it for a new comment since this is a completely different discussion)

Log in or register to post comments

Comment #245

dqd

London | N.Y.C | Paris | Hamburg | Berlin

commented 13 September 2017 at 04:20

Somebody mentioned an example of when using an embed video is breaking the expected behavior. ~~I wanted to quote it but I can't find it anymore in the rush.~~ I found it:

#66: I actually just noticed this yesterday on a D7 site. It cut off a YouTube embed code in the middle. That's pretty bad, as it affects a lot of block-level elements (including forms).

I wanted to demonstrate that we maybe need a new clear road map about what should be counted and how? (Another issue) I think this was one of the main reasons for the function to work for some not like expected (not talking about the bug but about the all over behavior often stated in this issue here). The POV and expectations of each user on this is very different and shouldn't be underrated. It's not as clear as it looks like at the first glance: "Is a video part of a summary and if, how much should it count (items)? Do we need different counter types like a word counter, a character counter and HTML element counter? Can we mix them?" But this should be part of another issue.

But as #83 already stated (as I found out later, errm):

Thinking about this some more, there could be all kinds of additional conditions on summary generation.
Maximum display length (not including markup)
Maximum storage length (including markup)
Maximum line width / word-wrap
Maximum number of lines
Allowed tags/styles/fonts/etc.

Probably it would be best to allow some sort of hook allowing modules or themes to add conditions of their own.

But as I sad this is part of another issue if it doesn't still exist and it also should take contrib area into account. e.g. extending Smart_Trim?

Log in or register to post comments

Comment #246

13 September 2017 at 04:20

Version:

8.4.x-dev

» 8.5.x-dev

Drupal 8.4.4 was released on January 3, 2018 and is the final full bugfix release for the Drupal 8.4.x series. Drupal 8.4.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.5.0 on March 7, 2018. (Drupal 8.5.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.5.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Log in or register to post comments

Comment #247

jenlampton

she / her

commented 29 March 2018 at 03:33

@diqidoq you're welcome! I sometimes wonder if it's too much 'noise' but I do appreciate being able to easily find the patch I have to apply to each core update, and I hope that helps others too. So...

Patch in #76 still applies (with offset) to 7.58.

Log in or register to post comments

Comment #248

jenlampton

she / her

commented 25 April 2018 at 20:13

Patch in #76 still applies (with offset) to 7.59.

Log in or register to post comments

Comment #249

25 April 2018 at 20:13

Version:

8.5.x-dev

» 8.6.x-dev

Drupal 8.5.6 was released on August 1, 2018 and is the final bugfix release for the Drupal 8.5.x series. Drupal 8.5.x will not receive any further development aside from security fixes. Sites should prepare to update to 8.6.0 on September 5, 2018. (Drupal 8.6.0-rc1 is available for testing.)

Bug reports should be targeted against the 8.6.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Log in or register to post comments

Comment #250

jenlampton

she / her

commented 17 January 2019 at 04:01

Patch in #76 still applies (with offset) to 7.63.

Log in or register to post comments

Comment #251

calebtr commented 5 March 2019 at 23:01

I want to add that, while patch #76 does indeed apply, it is also buggy: when text is clipped in the middle of a closing tag so that </ are the last two characters, the tag is left open. Sure, it is an edge case, but it means that the original problem of the auto-summary producing invalid HTML still exists.

I like the DOMDocument approach in #141 better overall, but unfortunately that one is not applying for me. Echoing comment #231, for Drupal 7 this is handled nicely by contrib - and is perhaps the best way forward for Drupal 8+ as well.

Log in or register to post comments

Comment #252

wim leers

Ghent 🇧🇪🇪🇺

commented 4 July 2019 at 13:15

We're running into this again over at #2940029 — see #2940029-89: Add an input filter to display embedded Media entities and #2940029-90: Add an input filter to display embedded Media entities.

Log in or register to post comments

Comment #253

effulgentsia commented 11 July 2019 at 18:04

FYI: I opened #3067116: text_summary() returns malformed (not normalized) HTML for basic_html and other formats that use filter_html instead of filter_htmlcorrector and #3067124: text_summary() returns a plain string, even if passed a MarkupInterface object. Their scope is smaller than what's in this issue, but linking them here because of the partial overlap, especially the former issue.

Log in or register to post comments

Comment #254

norman.lol

he/him

German

Berlin

commented 6 August 2019 at 13:20

Log in or register to post comments

Comment #255

6 August 2019 at 13:20

Version:

8.6.x-dev

» 8.8.x-dev

Drupal 8.6.x will not receive any further development aside from security fixes. Bug reports should be targeted against the 8.8.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.9.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Log in or register to post comments

Comment #256

6 August 2019 at 13:20

Version:

8.8.x-dev

» 8.9.x-dev

Drupal 8.8.7 was released on June 3, 2020 and is the final full bugfix release for the Drupal 8.8.x series. Drupal 8.8.x will not receive any further development aside from security fixes. Sites should prepare to update to Drupal 8.9.0 or Drupal 9.0.0 for ongoing support.

Bug reports should be targeted against the 8.9.x-dev branch from now on, and new development or disruptive changes should be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Log in or register to post comments

Comment #257

6 August 2019 at 13:20

Version:

8.9.x-dev

» 9.2.x-dev

Drupal 8 is end-of-life as of November 17, 2021. There will not be further changes made to Drupal 8. Bugfixes are now made to the 9.3.x and higher branches only. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Log in or register to post comments

Comment #258

6 August 2019 at 13:20

Version:

9.2.x-dev

» 9.3.x-dev

Log in or register to post comments

Comment #259

larowlan

🇦🇺🏝.au GMT+10

commented 25 February 2022 at 01:16

Issue tags:

+Bug Smash Initiative

This really could use an issue summary update now that #3067116: text_summary() returns malformed (not normalized) HTML for basic_html and other formats that use filter_html instead of filter_htmlcorrector is in.

What remains to be done here?

Log in or register to post comments

Comment #260

25 February 2022 at 01:16

Version:

9.3.x-dev

» 9.4.x-dev

Drupal 9.3.15 was released on June 1st, 2022 and is the final full bugfix release for the Drupal 9.3.x series. Drupal 9.3.x will not receive any further development aside from security fixes. Drupal 9 bug reports should be targeted for the 9.4.x-dev branch from now on, and new development or disruptive changes should be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Log in or register to post comments

Comment #261

25 February 2022 at 01:16

Version:

9.4.x-dev

» 9.5.x-dev

Drupal 9.4.9 was released on December 7, 2022 and is the final full bugfix release for the Drupal 9.4.x series. Drupal 9.4.x will not receive any further development aside from security fixes. Drupal 9 bug reports should be targeted for the 9.5.x-dev branch from now on, and new development or disruptive changes should be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Log in or register to post comments

Comment #262

25 February 2022 at 01:16

Version:

9.5.x-dev

» 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Log in or register to post comments

Comment #263

leeksoup commented 2 March 2024 at 21:20

Status	File	Size
new	Screenshot from 2024-03-02 15-05-54.png	80.76 KB
new	Screenshot from 2024-03-02 15-13-50.png	16.27 KB

@larowlan Re #259

In D 10.2.3, text_summary (trimmed) output is still much shorter than the selected number of characters in "Trimmed limit." It appears that it still counts markup within the character count.

E.g. I set a Trimmed limit to 1500 characters but get this for trimmed output:
trimmed output 900 chars
This is only 900 characters.

If the node contains something with more markup, such as bullet points, the trimmed value is even shorter. e.g. with the same filler text put into a numbered list, like this:

Meeting recording:
[video embed]

Your questions -- answered!

This is some filler text that I am using to test the trimmed text function.

This is some filler text that I am using to test the trimmed text function. This is some filler text that I am using to test the trimmed text function.

This is some filler text that I am using to test the trimmed text function. This is some filler text that I am using to test the trimmed text function.

This is some filler text that I am using to test the trimmed text function. This is some filler text that I am using to test the trimmed text function. This is some filler text that I am using to test the trimmed text function.

This is some filler text that I am using to test the trimmed text function.

This is some filler text that I am using to test the trimmed text function.

This is some filler text that I am using to test the trimmed text function.

And continues on with more paragraphs of text here.

Then the trimmed output looks like this:
only 139 characters this time
With the same text as in the first screenshot, this time the trimmed value only has 139 characters.

Log in or register to post comments

Comment #264

larowlan

🇦🇺🏝.au GMT+10

commented 4 March 2024 at 01:02

Thanks @leeksoup - can you update the issue summary with remaining tasks etc?

Log in or register to post comments

Comment #265

leeksoup commented 5 March 2024 at 22:09

Issue summary:

View changes

@larowlan - Does that work?

Log in or register to post comments

Comment #266

leeksoup commented 19 March 2024 at 21:54

@larowlan - Do the remaining items need to be split off into a new / separate issue?

Log in or register to post comments

Comment #267

larowlan

🇦🇺🏝.au GMT+10

commented 20 March 2024 at 18:45

Issue tags:

-Needs issue summary update, -teaser

I think this issue is fine, thank you for updating the issue summary!

Log in or register to post comments

Comment #268

pameeela commented 26 April 2024 at 02:48

Issue summary:

View changes

Should this be split into two issues? One for the valid HTML and one for excluding markup from character count? I think the character count part of it can't really be called a bug since it is explicitly tested to work that way, meaning it is intentional behaviour. I do agree that it makes sense to exclude it but that seems like a feature request.

I also think this would need to be opt-in for existing sites because it will change what is displayed for some sites.

Updated IS to be a bit more clear.

Log in or register to post comments

Comment #269

pameeela commented 26 April 2024 at 02:49

Issue summary:

View changes

Log in or register to post comments

Comment #270

anybody

German

Porta Westfalica

commented 4 September 2024 at 15:53

@pameeela I agree the focus should be to fix the broken HTML. Not counting the HTML characters can be a less relevant follow-up feature!

Log in or register to post comments

Comment #271

pameeela commented 10 September 2024 at 06:20

Title:	text_summary() should output valid HTML and Unicode text, and not count markup characters as part of the text length	» text_summary() should output valid HTML and Unicode text
Status:	Needs work	» Postponed (maintainer needs more info)

Updating this issue to reflect that the bug reported was about invalid markup. However, I'm unable to reproduce it on D11, so I think maybe it's fixed in CKE5? I can't reproduce it using basic or full HTML regardless of whether 'Correct faulty and chopped off HTML' is enabled.

Marking postponed in case I'm missing something obvious. I will create a separate issue for excluding tags from trimming, as already noted.

Log in or register to post comments

Comment #272

pameeela commented 10 September 2024 at 06:26

Status:

Postponed (maintainer needs more info)

» Closed (duplicate)

Actually, I just noticed #3067116: text_summary() returns malformed (not normalized) HTML for basic_html and other formats that use filter_html instead of filter_htmlcorrector so I think it was fixed there. but by then this issue had already expanded to include the trimming. So going to close this one.

Log in or register to post comments

Comment #273

pameeela commented 10 September 2024 at 06:36

Created #3473241: Strip tags from formatted text before trimming

Log in or register to post comments

text_summary() should output valid HTML and Unicode text

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

Original report by gpk

Steps to reproduce

Remaining tasks

Comments