-
In D6, the function to auto-generate a shortened (summary or teaser) version of a node is called node_teaser(), and is found in node.module.
-
In D7, this function was renamed text_summary() and moved to text.module.
See: http://drupal.org/update/modules/6/7#body_teaser_fields
And: #372743: Body and teaser as fields -
The current code generates invalid HTML, such as summaries that end in the middle of a closing tag. In fact, current tests require this mis-behavior, as does the WYSIWYG module.
(See: #741606: Teaser splitter / text fields with summary support
And: #934628: You specified that the summary should not be shown) -
The patch under review ensures that auto-generated summaries contain valid HTML, and alters the tests to match.
-
Related issue: #823380: Read More link is always visible on teaser.
Original report by gpk
Steps to reproduce
As requested here: http://drupal.org/node/220783#comment-728258.
Set trimmed length to 200.
Remove the HTML corrector from the Full HTML input format (or create a new input format without HTML input corrector). (See http://drupal.org/node/221252.)
Create a new page, Full HTML input format (or whichever you created), with this content:
The maximum number of characters used in the trimmed version of a post. <!-- <p>Drupal will use this setting to determine at which offset long posts should be trimmed.</p> The trimmed version of a post is typically used as a teaser when displaying the post on the main page, in XML feeds, etc. To disable teasers, set to 'Unlimited'. --> Note that this setting will only affect new or updated content and will not affect existing teasers.
Preview the page, look at the result and the HTML source ...
Also see #263
Remaining tasks
- Add handling of body content to remove markup before counting characters to test against selected trimmed length.
- Review and test.
- Add / modify tests to coordinate with this change.
User interface changes
None.
API changes
None.
Data model changes
Unknown.
Release notes snippet
Trimmed text value handling fixed to not count markup characters as part of the text length.
Comment | File | Size | Author |
---|---|---|---|
#263 | Screenshot from 2024-03-02 15-13-50.png | 16.27 KB | leeksoup |
#263 | Screenshot from 2024-03-02 15-05-54.png | 80.76 KB | leeksoup |
#227 | TextSummary.txt | 2.02 KB | bburg |
#221 | text_summary-221257-221.patch | 3.49 KB | bburg |
#212 | text_summary-221257-212.patch | 3.48 KB | tarekdj |
Comments
Comment #1
gpk CreditAttribution: gpk commentedOK I'm assuming this now gets fixed first in 7.x-dev?
Comment #3
AlexisWilke CreditAttribution: AlexisWilke commentednode_teaser() is still very badly "broken" (written?)
It truncates tags and comments anywhere and it counts characters in UTF-8 and in bytes. It's all wrong.
Not only that, it won't close tags properly, so if you do not use the HTML corrector on ALL filters, teasers will break your themes.
Then on Edit, it computes the $include variable like this:
Here we can see that the teaser is expected to be an exact copy of the body. If not, this test breaks. Guess what, when you properly close tags at the end of the teaser, that $include computation breaks big time.
By the way, the variable should be named "$included" with a D at the end. Personal point of view though. 8-)
There is my version of node_teaser() that works great, every time!
Here is my sample code that breaks the default code (as of 6.19):
<p>The Internet is littered with thousands of directories and yet there still people who create new ones... like me here on <a href="/" target="_blank" title="Snap! Websites"><strong>Snap! Websites</strong></a>. Well... I like the idea of a central repository of directories that I could use and thought that it would be nice to share them with you and even better, I offer your to <a href="http://snapwebsites.info/node/add/directory?destination=thank-you-submit..." target="_blank" title="Click now and add your directory to my list on Snap! Websites.">submit your own directories</a> to my list so you too can use the list to check out your favorite directories.</p>
The core version will truncate even before looping through the content and that truncates inside one of the anchors tag. That means you do not have a correct tag at all (it cuts in the middle of an attribute value.)
WARNING: My version properly counts the number of characters VISIBLE (to the very end users, well... it counts spaces that we don't really see per se. 8-) ). When you say 600, it won't be 600 bytes as your version. We can do it either way, it just makes a lot more sense to me to count visible characters. We could also have two limits if necessary. But the current version is not working for people who dare enter any kind of HTML in their posts.
I'll be back later with a patch that fixes the $include variable too.
Thank you.
Alexis Wilke
Comment #4
AlexisWilke CreditAttribution: AlexisWilke commentedtagging
Comment #5
AlexisWilke CreditAttribution: AlexisWilke commentedOkay, the fix for the teaser test in the node.pages.inc was easy enough. 8-)
I certainly hope you'll consider fixing your teaser code in D6... If not in D6, at least in D7. It is clear that your first call:
is already wrong because that can cut a comment or tag in half and thus you cannot know whether the last < character is part of a tag or a little mistake from the user (since the corresponding > is no present! so if you find "</p" you won't know that it is the end of the P tag... and you may actually keep that in there!)
Thank you.
Alexis Wilke
Comment #6
AlexisWilke CreditAttribution: AlexisWilke commentedAh! I see another fix to the truncate_utf8() function...
I'm not wondering if the truncate should not be in that truncate_utf8() function and not in node_teaser()!
#768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug)
That would mean all the code I write to fix the node_teaser() function should be moved inside the truncate_utf8() which should probably take a flag $html to know whether the string may contain HTML or not.
Thank you.
Alexis Wilke
Comment #7
NancyDruMarked #248842: Teaser breaks are picking up parts of html code as duplicate.
Alexis, does this remove the tendency to let
<p>
dominate the breakpoint selection as the current node_teaser has?+1 for back-porting to D6. Definitely for D7. So shouldn't this be tagged? And maybe made "critical?"
Comment #8
dddave CreditAttribution: dddave commentedAh, the joys of teaser creation. ;) A constant source for support requests and headscratchers. I think this warrants an upgrade to major.
The attached patch is for D6, right? Has this any chance of inclusion without being solved for D7 first? If not and a D7 patch is built first I totally vote for back.porting to D6.
Comment #9
AlexisWilke CreditAttribution: AlexisWilke commented@dddave,
D6 is definitively broken in that regard... so I suspect D7 is too.
The patch I provided is for D6, indeed (hence ...-6.x-... in the name).
Now, as I mentioned in #6 we need to know/understand whether the truncation should occur in the truncate_utf8() function or in the node_teaser. It seems to me that the truncate function would make a lot more sense. Also, to be noted, the "..." does not appear in my version (nor in any teaser that was shrunk further by the "old" node_teaser().)
In D7, the "..." should actually be defined in the Post settings because someone may want to put something else there. But that's a different topic. 8-)
@NancyDru,
Hey! Nancy! I've noticed you've been working on D7 quite a bit these days...
Breaks... My code does not consider <p> as a specific break point. Actually, the node_teaser() was not really working that way (only seemingly.) Instead, I parse all the characters from the start and skip tags at once (anything between < and > is left totally untouched.) And just like the HTML filter used to fix HTML, I manage a stack with all the tags currently opened (taking in account empty tags such as IMG, BR and HR—HR the forgotten!).
When the parser finds our that enough text characters were found, it truncates around the last space that makes the result shorter or equal in length as defined in the Post settings (i.e. 600 by default.)
There are interesting side effects to that method (to me). For instance, the ALT, TITLE, HREF, etc. attributes do not affect the number of characters that the user will see. Nor the IMG, EMBED of such tags (although that might be a problem in some circumstances like an RSS feed?)
Now, if you think that we should cut at a paragraph end and if the 1st paragraph is too long then revert to this algorithm, it would be easy enough to implement. I don't see a problem in cutting any one paragraph in half as long as it is done correctly HTML wise. 8-)
Thank you.
Alexis
Comment #10
NancyDruI don't have a problem cutting paragraph at then end of a sentence, but not in the middle of one. However, I DO have a problem cutting a list item up; I want the whole thing or none of it. I have modified the node_teaser() function to do that. But I still have the tag-breaking bug.
I (and my customer) consider this an absolute positive. I do want to count the visible part of a link (
<a ...>visible</a>
) I don't allow IMG or EMBED in my feeds, so that's still positive.My customer's request (demand): "We want teasers to be relatively consistent - about 15 lines." Yes, I know there are a lot of variables about what constitutes a "line."
It does sound like you could easily do a better job of correction than the HTML Corrector filter too.
-
Now if you could fix multiple blanks in a row... And the evil
<p> </p>
Comment #11
AlexisWilke CreditAttribution: AlexisWilke commented@NancyDru,
Hmmm... not cutting a list sounds like a good idea, but a page that starts with a list would "suffer"...
This page, for instance, starts with a table of contents list...
http://linux.m2osw.com/booting
Of course, in a way, it's a bad example since the corresponding teaser would not include the table, but many people write blogs starting with a list of say... the 10 reasons why I don't like teasers in Drupal.
And we could tell people, on those pages, add a "manual" teaser. The problem is when you hit save you generally do not see how the teaser looks like.
Also, no one commented about #6...
Thank you.
Alexis Wilke
P.S. "evil"
<p> </p>
is something I took care of with my own filter a long time ago 8-) See here http://www.m2osw.com/mo_trimmer_filterComment #12
NancyDruPersonally I think #6 should be a "clean up" issue after this gets fixed.
Comment #13
AlexisWilke CreditAttribution: AlexisWilke commentedNancy,
Hmmm... The current node_teaser() does this (more or less):
1. Any content? No return ''
2. Any <!--break--> in the content? got the teaser
3. Is content smaller in total # of chars to the max. length allowed in the teaser? Yes return body as is
4. Truncate content using truncate_utf8()
5. [try to] Fix the truncated content to fix the max. # of chars (remove the possible '...' in the process)
Point 4. is gone in my version... If we want to have a truncate function that can truncate HTML (because many modules do a similar work as the Core teaser, for instance the Views may be used to "generate a teaser"!)
So may be we should have a truncate_html(), in which case I guess we can first fix this function and later extract the code and create that separate truncate_html().
Is that your idea?
Thank you.
Alexis Wilke
Comment #14
NancyDruBTW, there has always been a flaw with
in that<!--break-->
is removed. This results in a "read more" when the break is placed at the end of the node to force a full node to appear (generally seen on shorter nodes that are slightly longer than a normal teaser). There is no need to remove the break because it is a comment to browsers.Comment #15
AlexisWilke CreditAttribution: AlexisWilke commentedWe need to have a separate function that compares the teaser and the body properly. I fixed the $include definition but that test should be in a function instead and extended to support more capabilities. See sample below.
I'm also wondering about a break <!--break--> at the very beginning because then the teaser is empty and thus the node_teaser() is called each time. Since that should be really rare, we should be fine. But it could be that an empty teaser means don't show the teaser? I guess the user can turn off the corresponding flag if he/she does not want that to appear.
Otherwise I just noticed that trim() is being used all over the place. Maybe we should have a drupal_trim() that removes the <p> </p> too. That way it would be fixed in many situations!
Comment #16
NancyDruI wouldn't call it drupal_trim if it does more than what PHP's trim will do (but I guess trim could do tags if you list them). Having a drupal_ form of a PHP function implies that it be a multibyte safe version (e.g. drupal_substr, drupal_strtolower).
Comment #17
AlexisWilke CreditAttribution: AlexisWilke commentedOkay, for now there is a patch that takes the more... in account and also adds a "..."
This fixes the comparison between teaser and body. But it renders the include flag computation quite complex. I'm wondering if the node_teaser() should not actually return the teaser AND the body. We could pass the include flag (i.e.
isset($node->teaser_include) && !$node->teaser_include
) and determine within the function what needs to be returned instead of recomputing how to cut the body outside.My concern here is that would change the return value of the node_teaser() function which some other module might be using...
Thank you.
Alexis
P.S. The case of the list is not yet taken in account.
Comment #18
NancyDruNope, that would change the API and guarantee it would not be accepted.
Comment #19
AlexisWilke CreditAttribution: AlexisWilke commentedNancy,
I know. Although we may want to do that in 7.x...
Have you tested my patch yet? 8-)
Thank you.
Alexis
Comment #20
Evan Leeson CreditAttribution: Evan Leeson commentedThank you all for working on this issue. I spent some time today struggling with this because the writers have started putting a lot of links in their first sentences and this was causing this issue to manifest. Using the teaser break solves it but it will be much better when they don't have to.
Comment #21
AlexisWilke CreditAttribution: AlexisWilke commentedlectric,
Did you test #17? 8-)
Thank you.
Alexis
Comment #22
NancyDruI have asked for permission to test it (on the customer's clock). Even if they say no, I may test replacing my version on my own time (of which I have very little).
One possible way to get this in (probably 8.x) would be to open an issue to "Make node_teaser() pluggable," which seems to be a popular way to say, "I have a better idea, but you probably won't like it."
Comment #23
greg.harveyMarked #363889: Strange behaviour when node teaser and first part of node body don't match perfectly as duplicate - not actually certain this is the case, but the issue is very old and no way of testing any more, so no sense leaving it open.
Comment #24
brisath CreditAttribution: brisath commentedSubscribe, same problem with teasers.
Comment #25
AlexisWilke CreditAttribution: AlexisWilke commented@NancyDru (and others),
Btw, in regard to lists... How would you handle the case where a post starts with a list and that list is too long to fit?
1. Return an empty teaser
2. Cut the list...
3. Return the whole list (which could be REALLY big)
In cases 1 and 3 we could also emit a warning, but since any function could call the node_teaser() function, the warning may not automatically be appropriate.
Thank you.
Alexis
Comment #26
NancyDruOn my site, my code cuts the list. And my customer is happy with that.
Comment #27
NancyDruAll my work is waiting for better specs, so I'm going to put this into my code and test it. I am rereading in detail.
#10/11/25 - I didn't say not cutting a list - that I already do. I don't want a list ITEM cut. I tried that and it got real ugly real fast. So the only time it might be a problem is when the first list item is too big. I haven't encountered that (yet).
#13 Truncate_html is probably a good idea but will guarantee waiting until D8. It is interesting that even though the API for drupal_validate_utf8 states "All functions designed to filter input should use drupal_validate_utf8 to ensure they operate on valid UTF-8 strings...", only one core function actually calls it.
#15 Given that trim() appears to not be multibyte aware, and some users use something other than utf8 for their databases, it seems that there is indeed a need for drupal_trim(), that would validate and drupal_convert_to_utf8(). However that is a separate issue because it would be forced into D8 even though I would call it a bug in both D6 and D7.
Comment #28
NancyDruWhy are you using substr() rather than drupal_substr()?
The addition to
seems superfluous with an immediate return.
If Dries ever looks at this, he will want more comments and properly formatted (start with capital, end with punctuation).
Comment #29
NancyDruLooks pretty good. We can definitely decrease our teaser length setting. This version produced an even longer one than my version because of links. It did, however end at the beginning of a list item, so I get
9. ...
The ending "..." will definitely classify this as a changed API. As much as I kind of like it, you might want to remove that.
Comment #30
NancyDruI don't know if it's this change or a change I made to the editor, but I am now getting duplicated content when I edit a node.
I am also now getting:
UPDATE: #934628: You specified that the summary should not be shown seems to be caused by this code. The generated teaser (complete with ...) is stuck on the front of the body in the edit textarea.
Comment #31
NancyDruComment #32
AlexisWilke CreditAttribution: AlexisWilke commentedNancy,
Thank you for taking the time (you could have taken a break instead! 8-) )
#27 - #10/11/25 - okay, that makes more sense. Still, not cutting a certain tag makes it a special case! Now, we can use the stack to know where we're at. If we find an <li> tag, then we would need to remove everything from it on. I guess that requires a position linked with the stack so we know where it's located.
#27 - #13 - truncate_html() would make sense, I think. And the truncate_utf8() is being worked on for other reasons... but that does not take HTML in account.
#27 - #15 - trim() probably does not need to be UTF-8 compatible unless we want to trim special spaces in which case yes, we'd need a UTF-8 equivalent. Otherwise the characters removed are controls (i.e. "\n") and the space (" "). Those are safe in UTF-8. We could verify that it does not remove the character though since that one uses code 0xA0 which is a UTF-8 encoding character. From the PHP documentation, is not removed.
#28 - drupal_substr() - I have to look into it, I do not think it makes any difference because I have < and > as delimiters which cannot be confounded with UTF-8 characters. But I guess that does not matter much.
#28 - PHP handling - if you are asking me to fix old code, I'm fine with it, but I thought that patches were supposed to only change what needed to be changed... I know that my patch changes so many things that it's hard to see what was there before and what's new. This PHP test I kept as it was.
#29 - ... - from what I understand, the truncate_utf8() was expected to add the '...', but then the node_teaser() was likely to remove it as it would have to shorten the data some more. So it was never there! See #30 too.
#30 - duplicate teaser - this is one of the problems I've been running into. I may not have posted my latest fix that removes the '...' in case it was added because that does not match the body (which may be another reason why the '...' was not included with truncate_utf8()!)
#31 - I'm on it. 8-)
Comment #33
NancyDru#28: It's not the check for php, it's setting the $teaser_len that I wonder about, because you are immediately going to return, so what is the point? IIRC this is done again a few lines down.
#30: Please read the WYSIWYG maintainers comments in #934628: You specified that the summary should not be shown. Even altered HTML (like closing an unclosed tag) could trigger the same problem. Apparently, there is a exact match comparison that is causing the problem.
Comment #34
AlexisWilke CreditAttribution: AlexisWilke commentedNancy,
#28 because $teaser_len is an [out] variable. Which is why this whole thing is so ugly. The auto-teaser function should either not save the teaser or save a flag marking the teaser as an auto-teaser and not try to be smart and compare potentially really large strings together.
Yes. For #30 I have a function now. That one single function does a "smart" compare, but I did not take the '...' in account in the last version I proposed here. So I don't have an issue with it on my end anymore but here that's a problem. (and this issue is the whole point of fixing the teaser so it all works including that sort of a problem!)
Thank you.
Alexis
Comment #35
NancyDruHaving to call a new function means "API change" which guarantees that this will never make it into 6.x or 7.x. That's a shame because this algorithm produces much better teasers.
Comment #36
rkodrupal CreditAttribution: rkodrupal commentedThe code in #3 works great! I even tried a very long
Comment #37
rkodrupal CreditAttribution: rkodrupal commentedsorry that double quote in the prior post was originally the word
Comment #38
rkodrupal CreditAttribution: rkodrupal commented...aargh ... bl0kqu0te
Comment #39
dddave CreditAttribution: dddave commentedPlease don't change the category.
Comment #40
AlexisWilke CreditAttribution: AlexisWilke commentedOkay, I finally got a new patch. I still include the ellipsis because I'm 99% sure it was intended to be there just not done because it would break everything with the old code.
I added some documentation and comments to make sure people understand why this and that. Maybe a little more would be required... Especially, in regard to the drupal_strlen(), drupal_substr() calls.
1) All the mb_str...() functions would need a corresponding drupal_str...() call, for instance using strpos() and then drupal_substr() is not going to work well.
2) In my case, I check the string with $body[$o] which is a byte access which totally ignores UTF-8.
2.1) It is okay because I use drupal_strlen() to compute the length of the visible text and compares that against the $size argument
2.2) It is required unless you offer me drupal_strpos() and a way to check a string character by character (and I don't recall seeing that in PHP anyway, although there is mb_split() ... good luck on bodies that are 100Kb in size!)
3) It is now documented, and I already had a comment about the UTF-8 checks when there are required in the algorithm!
Final word:
Personally, whether this patch is accepted or not is not a concern for me. My teasers work. 8-)
If someone from Core thinks it would be a good idea to share... then apply the patch! (after we had a few RTBC, of course) And since it can actually destroy the content of your node, I would strong advise that it be applied. But again, not my problem. I'm safe on that one now!
@rkodrupal, note that you can click on Edit to fix your posts instead of posting 2 or 3 fixes one after another.
Comment #42
AlexisWilke CreditAttribution: AlexisWilke commented#40: node-6.x-html_teaser3.patch queued for re-testing.
Comment #44
AlexisWilke CreditAttribution: AlexisWilke commented"Ensure the patch applies to the lastest checkotu of the code-base."
Why is that failing all the time?! I've tested against the latest and it works perfectly for me... How can it be bogus on that test server? One command line???
Comment #45
greg.harveyI haven't tried to apply this yet, but maybe the testbot is unhappy with
node.module.orig
? This is a stab in the dark, but perhaps changing those first two lines will work? Another attempt attached, nothing changed except the initial diff lines.Comment #47
rkodrupal CreditAttribution: rkodrupal commentedi used the patch in #45 and it applied successfully to node.module. The patch in #5 previously worked too, and it patched both node.module and node.pages.inc.
Was the intention for the patch in #45 to only change node.module?
Again, thanks for this great work. it was a huge timesaver for me.
... oh and I have a very small screen and the 'edit' button in this new version of drupal.org is off the page ... but i see it now.
Comment #48
AlexisWilke CreditAttribution: AlexisWilke commentedThank you greg for looking into it. I checked another thread where a patch worked, checking with an extra line at the very beginning... I don't think that matters, but we'll see.
There is the exact error from the tests goes like this:
Comment #49
AlexisWilke CreditAttribution: AlexisWilke commentedGood catch rkodrupal, there is still a one line fix in the node.pages.inc file.
There is yet another patch. This time there is a line before and no .orig extension. But I don't think that's the problem... 8-}
Comment #51
AlexisWilke CreditAttribution: AlexisWilke commentedThere we go...
http://drupal.org/node/961172#comment-3716348
#961172: All D6 Core patches are failing
Comment #52
rkodrupal CreditAttribution: rkodrupal commentedpatch at #17 doesn't seem to work with blockquote ... i found it impossible to successfully break the following for a teaser ...
Domino Causality sets the guiding principles in the search for innovative drugs. We only can improve productivity by constant attention to the entire line of causality, of the domino chain. Merely optimizing or improving any single link is not helpful.
"<"blockquote">"Dr. Moriarity, a metaphysician, sets up fifty numbered dominoes standing in a straight line with their dots facing the same way on a table in a room, but puts a blind in front of the dominoes so that only numbers one and fifty are visible. You enter the room, and observe that domino number one and domino number fifty are lying flat with their tops pointing in the same direction… Does this mean that either domino caused the other to fall? Not necessarily ... Dr. Moriarty could have pushed over only dominoes number one and fifty, or bumped the table in a way that only these two dominoes fell, or that all the dominoes fell at once. It is essential to remove the blind and look at the intervening dominoes... Do their positions suggest they fell in sequence rather than being bumped or shaken? Did any reliable observers hear the telltale sound of dominoes slapping one another in sequence? From the positions of all the dominoes, can we eliminate rival causal mechanisms, such as earthquakes and wind, as well as human intervention? If the dominoes fell in sequence, can we tell by looking at whether the dots are face up whether the direction of the sequence was from number one to number fifty or the reverse? – "<"a href="/node/879"">"Bennett, George (1997). Process Tracing in Case Study Research"<"/a">""<"/blockquote">"
(note: doesn't do too well with "<"a">" passages either).
Comment #53
AlexisWilke CreditAttribution: AlexisWilke commentedrkodrupal,
I tried on one of my sites that includes #49 and it worked fine. You can see the teaser here:
http://gallery.m2osw.com/node
You may want to try with the latest patch?
Thank you.
Alexis
Comment #54
rkodrupal CreditAttribution: rkodrupal commentedconfirmed. stupid typo on my part. sorry for the false alarm.
Comment #55
AlexisWilke CreditAttribution: AlexisWilke commented#49: node-6.x-html_teaser3.patch queued for re-testing.
Comment #57
NancyDruA) Have you tested with the WYSIWYG module? Anything that alters the teaser from the first xxx bytes of the body is going to cause problems and actual node modifications.
B) Is there any hope that this would be in D7, let alone D6?
Comment #58
rkodrupal CreditAttribution: rkodrupal commentedfor what it's worth i've been using it extensively (100+ text-laden nodes with lots of uncommon tags) and it hasn't failed yet. i use wysiwyg.
that being said, the patch does cause intermittent problems when the teaser has already been set by the earlier node.module and you go back and change a node ... typically when this happens there's some goofy message about the changing ... a tip-off to carefully check the results.
Comment #59
Jukebox CreditAttribution: Jukebox commented#49: node-6.x-html_teaser3.patch queued for re-testing.
Comment #61
Jukebox CreditAttribution: Jukebox commentedThis is sort of off topic, but may be helpful to some.
I'm using views to display node titles and teasers. One node had an italicised paragraph at the beginning. Since the teaser is limited to a handful of words, it didn't seem to close the italics tag correctly. Also, random italic tags were inserted in the html. To fix this, I just went over to the teaser field and checked the "Strip HTML Tags" option and everything was back to normal.
Comment #62
AlexisWilke CreditAttribution: AlexisWilke commentedJukebox,
The Views are using a copy of the Core function to extract the Teaser. They do not have the problem of comparing teaser and body as the Core does, but they do the same really bad job at not closing the tags... (this is the case for any truncate function in the Views.)
What you are seeing (many <i> tags) is probably due to your browser "cleaning up" after the Views output.
You are more than welcome to report this problem to the Views people and have them close tags properly of any HTML truncated data buffer.
Thank you.
Alexis
Comment #63
pillarsdotnet CreditAttribution: pillarsdotnet commentedForward-ported to D7, cleaned up comments, and changed variable names to be (hopefully) more descriptive.
Comment #64
pillarsdotnet CreditAttribution: pillarsdotnet commentedTitle update
Comment #65
pillarsdotnet CreditAttribution: pillarsdotnet commentedSee also #823380: Read More link is always visible on teaser.
Comment #66
Jamie Holly CreditAttribution: Jamie Holly commentedI actually just noticed this yesterday on a D7 site. It cut off a YouTube embed code in the middle. That's pretty bad, as it affects a lot of block-level elements (including forms).
Comment #68
pillarsdotnet CreditAttribution: pillarsdotnet commented#63: text_summary-221257-63.patch queued for re-testing.
Comment #70
pillarsdotnet CreditAttribution: pillarsdotnet commentedTrying again, with typo-corrections:
Comment #71
pillarsdotnet CreditAttribution: pillarsdotnet commentedPatch named incorrectly?
Comment #72
pillarsdotnet CreditAttribution: pillarsdotnet commentedOh. duh. silly me again.
Comment #74
AlexisWilke CreditAttribution: AlexisWilke commentedPillarsdotnet,
Thank you for working on this one. 8-)
As a temporary solution people can forcibly use the <!--break--> comment... but go explain that to end users!
Anyway, your point about broken YouTube embedded video is probably not currently correctly handled by my patch, unless it's a single tag. (Is it just one <embed> tag or do they use the <object> tag?) It is however a good point because such an object, or a form, should never be cut. It either gets included or it remains excluded, but no cut! I could imagine a form cut just before the "Send Info" button. Too bad. 8-)
Thank you.
Alexis Wilke
Comment #75
pillarsdotnet CreditAttribution: pillarsdotnet commentedHacked text.test to make all tests pass.
Probably should add an embedded Youtube Video test...
Comment #76
pillarsdotnet CreditAttribution: pillarsdotnet commentedFixed an "Uninitialized string offset" warning.
Just for drill, I did test an embedded video -- looks like the entire video counts as a single character, using this algorithm. I'm not sure whether this is desirable or not. Maybe all
<object>...</object>
sequences should be stripped from a text_summary?Comment #77
NancyDruPillars and Alexis, there are always going to be exceptions that are not well handled by coding and the end-user is going to have to be told to use a teaser break technique. Considering how long an embed/object tag sequence can be, this may be one of those cases. No matter how good your code is, there's someone out there who will find a way to break it.
Comment #78
AlexisWilke CreditAttribution: AlexisWilke commentedNancy,
I don't disagree that we'll find others. However, by not fixing anything, it's not going to be better. I've not seen any such discussion on other systems, especially WordPress. So some people know how to make it work right, in code.
Thank you.
Alexis
Comment #79
NancyDruI hope that Fields challenges WYSIWYG to stop doing what they were currently doing that broke this technique when I had a chance to test it. Perhaps you can "borrow" from WP if that is a much better algorithm. It is abundantly clear that this issue needs attention. I tried my hand at fixing it within the D6 framework and only partially succeeded. Your code produced a much better teaser than even mine did - core's was a very distant 3rd. Unfortunately, WYSIWYG does something that I think is wrong and that broke your code as it was at that time.
Comment #80
pillarsdotnet CreditAttribution: pillarsdotnet commentedI posted a combination of the patch in #76 above with the patch from #87 of #823380: Read More link is always visible on teaser., and intoxination proposed merging these two issues.
Comment #81
Jamie Holly CreditAttribution: Jamie Holly commented@NancyDru - actually a lot of the fixed in WP comes from TinyMCE. For example - if you put a more break within an object/embed code block, then switch back to the WYSIWYG view, TinyMCE will actually strip it out. That's due to the way TinyMCE parses out embed codes and replaces it with their own markup for the visual display, then again replaces it on clean-up when the final HTML is generated.
When it comes to actually balancing out tags on teaser/more views, I find WP much worse. I can't even begin to count the times I had to fix a broken front page on a client's WP site simply because someone put the more break in a weird spot.
Comment #82
NancyDruGood, that read more link has been bothering me since I started with Drupal and was told it wasn't worth fixing, even though it was simple to do so. I despise all those wysiwyg editors, but some people seem to think they have to have them.
Comment #83
pillarsdotnet CreditAttribution: pillarsdotnet commentedThinking about this some more, there could be all kinds of additional conditions on summary generation.
Probably it would be best to allow some sort of hook allowing modules or themes to add conditions of their own.
So the (pseudo-)code would look something like:
Comment #84
pillarsdotnet CreditAttribution: pillarsdotnet commentedNow that we have a real 8.x branch, I should code a patch.
Comment #85
NancyDruLine width and number of lines depends on CSS and the displaying device. Unless you're much smarter than I, such details are pretty close to impossible to determine at teaser time. For example 5 lines on a full width HD monitor for a PC may very well overflow the entire screen on a smart phone.
Comment #86
pillarsdotnet CreditAttribution: pillarsdotnet commentedSee http://drupal.org/project/mobile_theme
Comment #87
Jamie Holly CreditAttribution: Jamie Holly commentedOne thing that should be considered is that text_summary is always run in teaser view on D7, unlike in D6, where node_teaser was only run on node saves and then that value stored in the node_revisions table. That means text_summary can have a serious impact on performance if it gets to heavy processing wise.
Having said that, I got to give a big +1 to implementing a hookable system to allow modules to alter what happens in text_summary, such as deciding what block level elements should/shouldn't be cut. We just need to keep performance in mind when coming up with something.
Comment #88
NancyDruIs that also true to cached anonymous users? And does it mean that filters are also always called?
Comment #89
Jamie Holly CreditAttribution: Jamie Holly commentedNot for anonymous users, since the caching there is on the entire page.
Comment #90
NancyDruWould it be possible to add a teaser (or summary) cache to mitigate a potential performance issue? Perhaps something as simple as doing an MD5 on the text and saving the resulting teaser? This would be similar to the way filters cache data. Actually, if the teaser/summary is a field, why not turn this into a filter?
Comment #91
Jamie Holly CreditAttribution: Jamie Holly commentedI really think that should be in there. It would be nice to get a core maintainer to give us some feedback on the pros/cons of implementing that. Putting that caching in should be relatively simple.
Comment #92
yched CreditAttribution: yched commentedLong thread - could someone post an issue summary (as well as an explanation of why this lies in the 'field system' queue) ?
Comment #93
pillarsdotnet CreditAttribution: pillarsdotnet commentedSummary requested by ysched:
Comment #94
Jamie Holly CreditAttribution: Jamie Holly commented@yched - @pillarsdotnet pretty much summed it up very well there. The only thing I would add is that since text_summary is ran on every field that requires it on each page view, we really should keep an eye at performance on this, or (even better) work out a way to cache the teaser/summary text.
Comment #95
yched CreditAttribution: yched commented@pillarsdotnet : Thanks for the summary. The former node_teaser() indeed has been moved more or less as is within text.module, but I can hardly claim any expertise on this area of code, and I'm afraid I'm not really relevant to review this :-/. Not sure who would - maybe a 'git blame' on the D6 branch would give a couple hints on the people who contributed to the original node_teaser() function ?
Comment #96
pillarsdotnet CreditAttribution: pillarsdotnet commented@ysched --
Perhaps it would be more accurate to say that the patch in #76 is in "RTBC" status, as I believe it has been reviewed and found to work correctly by intoxination, AlexisWilke, NancyDru, rkodrupal, and myself. There is a related issue with WYSIWYG, but I at least can attest that the patch in #76 does not render WYSIWYG unusable.
Certainly it is an improvement over the current code.
In cases like this, where we all agree an an incremental improvement, but are still discussing further improvements, should the extended discussion move to a separate issue?
Comment #97
pillarsdotnet CreditAttribution: pillarsdotnet commentedGonna take that giant leap. Somebody slap me if I'm wrong.
Comment #98
greg.harveyPersonally, I can't fault the logic in #96. Don't let the great get in the way of the good, etc. etc. =)
Comment #99
yched CreditAttribution: yched commentedSorry, but this didn't get any real code review, and would also really need an approval from some core devs (no offense).
An explanation of what the patch actually does, where and why it modifies the existing code, might also help :-)
Comment #100
bryancasler CreditAttribution: bryancasler commentedsubscribe
Comment #101
AlexisWilke CreditAttribution: AlexisWilke commentedI'm not using D7 so I don't know whether the latest patch works. 8-)
However, I wanted to make a note about the <object> tag. I agree that you have to either put it in full or remove it altogether. If you only include a couple of <param> tags inside the object, it won't serve any purpose.
Not only that, this applies to <textarea> and <form>.
There are possibly a few others. This makes me think of a module I had a problem with... Yes! Footnotes.
The other tag to handle properly is <script>, although with WYSIWYG you probably won't have much problems with that one.
Footnotes also handles <style>, but that tag should never appear anywhere else than in the header so I'm much less worried about it.
Thank you.
Alexis
Comment #102
NancyDruYes, even with WYSIWYG, users find ways to stick scripts in. At my last customer, they kept trying to stuff their own GA code into the nodes. And I kept writing better filters to remove them. ;-) Oh, and don't forget that <script> is often accompanied by <noscript>, which can have anything you can imagine in it.
Comment #103
bryancasler CreditAttribution: bryancasler commentedIs there anything that would be holding up #76 from being committed?
Comment #104
pillarsdotnet CreditAttribution: pillarsdotnet commented@animelion Yes. See #1050616: Figure out backport workflow from Drupal 8 to Drupal 7
EDIT: Marked #220783: HTML markup is counted in for the length of trimmed posts as a dup of this issue.
Comment #105
Lars Toomre CreditAttribution: Lars Toomre commentedI really like the pluggable approach suggested in #83 with the suggestion that the middle hook be renamed 'filter' instead of 'strip'. I use quite a bit of embedded HTML in my node bodies and have had considerable problems in D6 with broken HTML tags from the default split. That has required me to manually place <!--break--> in many places and led me to this issue discussion this morning.
I suggest the use of 'filter' since what Alexis started this thread with, what NancyDru has done for her previous client and what I want to do is filter out some portion of the characters in a node body are tied to linking or display rather than the content that is presented to the user. I also really like the idea of the limit hook since quite likely one might like to display a different 'teaser' length based on the type of device being served to (desktop/mobile/XML/JSON) or type of feed being produced for Web Service/Twitter/Facebook/RSS.
Before the current patch is applied, I would hope that we could address the following issues in the default filtering of this patch code:
I am not sure whether minor versions of D7 will allow the introduction of new hooks. Certainly D6 will not at this point. Hence, whatever we design here for D8 will need to be condensed into one function that can be "swapped" into D6 and probably D7. My hope is that a fix for this behavior can be back-ported to both D7 and D6 soon.
My two cents ...
Comment #106
NancyDruAccording to Dries, if no existing API is modified, a new API may be allowed into current versions. We would have to fight hard to convince him though. And I agree with both parts of that policy. I doubt this will be introduced to D6 ever.
Comment #107
AlexisWilke CreditAttribution: AlexisWilke commentedMy fix does not change the API. It enhanced it only. Meaning that any module using the old API is compatible with my change.
And my fix is in my D6 install. I don't really care much whether it gets in Core on Drupal.org... 8-)
Thank you.
Alexis
Comment #108
pillarsdotnet CreditAttribution: pillarsdotnet commentedI'm about 30% of the way through implementing the approach I suggested in #83 as a contrib module. When I'm done, I should have d6/d7/d8 versions available.
Comment #109
pillarsdotnet CreditAttribution: pillarsdotnet commentedYou can help this issue by posting a review:
Decide whether all code and comments comply with Drupal coding standards.
Decide whether any included tests are necessary and sufficient.
Decide whether the patch actually solves the problem. There should be a reproducible test where the current code fails and the patched code succeeds.
A review that affirms success in all of the above should also:
Change the issue status from "Needs Review" to "Reviewed and Tested By the Community".
Otherwise, the review should:
Explain what is wrong with the proposed patch and/or supply a corrections to it.
Change the issue status from "Needs Review" to "Needs Work".
Comment #110
njardim72 CreditAttribution: njardim72 commentedhi,
I'm in core D7. is this issue fixed? if so could you please point the right direction?
I need to trim the feeds to 200 chars
thanks,
njardim
Comment #111
pillarsdotnet CreditAttribution: pillarsdotnet commentedGratuitous re-roll of #76 above.
Comment #112
AlexisWilke CreditAttribution: AlexisWilke commentednjardim72,
It's not fixed in any version.
pillarsdotnet is working on a D8 version and I offered a patch for D6. I don't think anyone posted a patch for D7 yet.
Good luck.
Alexis
Comment #113
pillarsdotnet CreditAttribution: pillarsdotnet commentedThe d8 patch should apply cleanly to d7. There isn't much difference between d7 and d8 yet.
Comment #114
sunUnfortunately, this patch is far from being RTBC.
1) All information in phpDoc are notes. The @note directive is therefore useless and shouldn't be used. Use regular phpDoc paragraphs instead.
2) Everything that goes through text_summary() is supposed to be cached by the caller, so speed does not matter. Additionally, there's only a marginal difference between Drupal's PHP wrappers and directly calling the native functions. We _always_ use Drupal's PHP wrappers, unless there's a valid technical reason for ignoring UTF-8 byte sequences. If there is such a reason, it needs to be clearly stated and explained in an inline comment.
Missing phpDoc for the new function argument. It doesn't seem to be used anywhere?
The new code could use some more inline comments about what it is doing and more importantly, why.
Overall, it looks like using preg_split() or a real DOMDocument would be more appropriate for the purpose of this function. If this has been tried already or purposively wasn't done for some reason, then it should be documented in an inline comment.
I don't understand why substr() is not UTF-8 safe but the strlen() is. This will have unexpected consequences.
Note that the old code used an entirely different approach, for which it may have been valid to ignore UTF-8 byte sequences. The old code might perfectly be buggy with regard to not using PHP wrappers - we don't know, since the existing tests do not cover UTF-8 at all.
1) Wrong indentation.
2) The code assumes that HTML tags are lowercase, which may not be the case.
There should be a blank line between switch cases, unless a case "falls through" to another one (which should be explained in an inline comment).
Wrong indentation.
It looks like you're unconditionally tying to do something along the lines of the HTML Corrector filter. In that case, it would make much more sense to call _filter_htmlcorrector() instead, as it's much more reliable (and also covered by tests).
Additionally, _filter_htmlcorrector() uses a DOMDocument for parsing, which in turn makes us circle back into the question whether text_summary() shouldn't use DOMDocument, too.
Coding style error.
The changed expectations are wrong. With enabled HTML corrector, the expectations 1-4 are invalid, since tags are not properly closed.
Furthermore, our expectation for 6+ is that the text summary ends on a paragraph.
In turn, it looks like we've lost the essential idea and functionality of text_summary(). This needs to be retained.
Lastly, no tests have been added that actually verify and prove the desired behavior regarding HTML.
I'd also like to know why a simpler approach like in #220783: HTML markup is counted in for the length of trimmed posts isn't sufficient.
Powered by Dreditor.
Comment #115
BrightBoldSubscribing, since the other issue was closed as dupe.
Comment #116
AlexisWilke CreditAttribution: AlexisWilke commentedsun,
The @note is from Doxygen... 8-) It can make a note appear differently.
One reason for using the strpos(), strlen(), etc. directly is speed. Even if you have caches, speed is always an issue, especially if you have hundreds of users accessing your website.
Now, there is another reason for using them: the Drupal API does not offer all the functions used here. So unless you'd accept to also add all the missing functions, the safest was to use the "simple" functions.
Finally, it doesn't make any difference in this case because what we skip with the strpos() and strlen() and substr() calls are tags which cannot be UTF-8 (since we ignore the attributes.) Where you absolutely need to use UTF-8 is to count the number of characters between tags.
This should answer this worry:
If you ask, I'll give you detailed explanation about UTF-8. I can assure you that what I wrote is not only correct, it makes things work a lot better.
Good catch! We should have a strtolower() here. Again, you don't need UTF-8 since tag names are ASCII.
The stack restoration, that you say is equivalent to HTML Corrector, is A LOT FASTER than calling that filter after we're done (at least under D6). Plus, we already have all the information we need to have. Why should we call another function instead of unwinding our own stack?
The main problem of the existing Core version is that the current version does this:
And the result in $teaser is completely whatever, including a tag cut in half.
This being said, in D8, using DOM could be a good solution. After all, you can then go through the DOM and cut everything after the max. size for your teaser was reached. It's probably much slower on very large pages, but that's not the majority of the pages out there.
Thank you.
Alexis
P.S. I offered the D6 version, the D7 patch is not from me, but some of the code is a copy from my own corrections. 8-)
Comment #117
pillarsdotnet CreditAttribution: pillarsdotnet commentedOkay, this is completely new code, relying on the PHP DOM extension.
Comment #118
pillarsdotnet CreditAttribution: pillarsdotnet commentedBah. Didn't mean to include my custom changes to the standard profile...
Comment #119
pillarsdotnet CreditAttribution: pillarsdotnet commentedCouple of minor changes, trying to anticipate objections from @sun...
Comment #120
sunYay, this looks much better now :)
I think it would increase readability if the _text_summarize() helper function would return the new $size instead of changing it by reference.
Any reason why we can't simply use $summary_node->saveXml() instead? If so, please document inline, not on issue.
This doesn't make sense to me. I'd at least like to know the reasons for introducing it (as inline comment), but actually, it looks like a major bug if you pass no HTML into something and suddenly deal with HTML along the way.
Might be caused by the unconditional invocation of check_markup(), or alternatively, because the DOM documents are loading text nodes into the body without a wrapping block-level element (DIV/P), so DOM might automatically inject one.
I think we want and need to retain the text_summary() functionality for no format. It is useful for text [field] contents that does not have a text format associated, but still needs to be potentially rendered in "teaser" use-cases. The complex phrase detection logic in text_summary() is actually the only way in Drupal core to extract proper summaries from text strings.
Also note that the old text_summary() contained some more international punctuation characters, which should be retained.
Odd wrapping/indentation here; all except of the array keys should be on one line.
Powered by Dreditor.
Comment #121
pillarsdotnet CreditAttribution: pillarsdotnet commentedDone. I also added a doxygen header for the helper function.
Added inline comment noting that there is no
DOMNode::saveXml()
function.As you suspected, DOM is automatically injecting
<p>
tags. Revised the inline comment to make this clearer.The revised function works for plaintext, as tested by
testLongSentence()
.However, the example text in
testLength()
isn't very useful for the default plaintext format.Calling
check_markup("<p>\nHi\n</p>\n<p>\nfolks\n<br />\n!\n</p>", NULL)
yields
"<p><br />\nHi<br />\n</p><br />\n<p><br />\nfolks<br />\n<br /><br />\n!<br />\n</p>"
.If you really think it's useful to summarize that mess, I'll put it back.
Added the missing character, but I don't know what it is. Do you?
Revised, but we may have to agree to disagree here. I refuse to write code that extends past column 80 unless it is absolutely required by Drupal coding standards.
Comment #122
pillarsdotnet CreditAttribution: pillarsdotnet commentedGrrr... Ignore the previous patch.
Comment #123
pillarsdotnet CreditAttribution: pillarsdotnet commented... And that one, too. I *really* need to get some sleep...
Comment #124
pillarsdotnet CreditAttribution: pillarsdotnet commentedI see one minor problem:
That should read,
// Using filtered_html format:
but I'll wait for @sun's review.If anyone from the International community would like to propose tests of non-English sentence parsing, I for one would welcome the assistance.
Comment #125
AlexisWilke CreditAttribution: AlexisWilke commentedI think we want and need to retain the text_summary() functionality for no format.
Note that this is something I mentioned. For a fix in D6, I don't think that function should be changed, but it is the right place to generate the teaser. Only problem, it either needs to understand HTML or have another function that does and calls text_summary() to handle the actual text to be cut for teaser purposes.
Comment #126
pillarsdotnet CreditAttribution: pillarsdotnet commented@AlexisWilke, @sun:
Of course the patch in #123 cannot be backported to D6 because D6 does not require the PHP DOM extension. For D6, an approach such as the one taken by #111 is required.
In D7 and D8, when you set the
$format
argument toNULL
, that's the same as specifying thefilter_fallback_format()
filter, which is'plain_text'
.The patch in #123 correctly handles such cases by calling
check_markup()
before looking for sentence and line breaks.You seem to be requiring that
text_summary()
should treat aNULL
$format
argument as if'full_html'
had been specified, instead.While that may be more consistent with the original code, I believe that it is a bug and should be treated as such.
A further advantage of this "unconditional invocation" of
check_markup()
is thattext_summary()
will correctly handle cases where the format includes such filters as Markdown, Image Assist, or Bbcode.How can
text_summary()
be made HTML-aware without first invoking the required format filters to transform its input into HTML?Comment #127
pillarsdotnet CreditAttribution: pillarsdotnet commentedCorrected problem noted in #124 and also explicitly added the filter module to dependencies.
Comment #128
AlexisWilke CreditAttribution: AlexisWilke commented@pillarsdotnet,
I did not see the code sun was talking about. However, something such as:
¡Hello!
would require "special" handling in that the first exclamation mark should not be viewed to end the previous sentence. It actually starts the next sentence. That's one special case I know of.
Another thing is the dash. In general, it isn't much of a problem in English. However, in French, cutting a word with a dash can be a problem. Although I guess the code was supposed to not cut words but on spaces. So that would not be a concern.
As for the text_summary(), I was referencing the problem in D6. The teaser function in there would call the text_summary() as is, whether the body had HTML or not. Anyway, the main reason for my current solution is to change a minimal number of things and possible all in the same file (which it is right now.)
Thank you.
Alexis Wilke
Comment #129
pillarsdotnet CreditAttribution: pillarsdotnet commentedThe
'¡'
character is not an exclamation point. And even if it were, my code wouldn't use it as a break point. The regex I'm using breaks after an end-of-sentence character only if it is preceded by a word character and followed by a space character. However, it would break after common abbreviations such as'Dr.'
and'Mr.'
, so I'm open to suggestions for a smarter algorithm, if you have any. For instance, how would you write code that splits at the end of a sentence, but does not split in the middle of the following sentence?'They were like Dr. Jekyll and Mr. Hyde.'
The patch in question is not being submitted against D6. It is being submitted against D8. Are you suggesting that a patch against D8 be rejected on the grounds that it is unsuitable for D6? If so, this and most other D8 patches should simply be "Closed (won't fix)".
Comment #130
AlexisWilke CreditAttribution: AlexisWilke commentedNo, not exactly... text_summary() is properly labeled and does what it is supposed to do. It just cannot be used the way it is (at least in D6.)
We should have an html_summary() function that does the same thing as text_summary() but has knowledge of HTML. The fact is that views and CCK call text_summary() to create teasers, just the same as the normal core node_teaser() implementation does and... it breaks those teasers as badly as the node teasers.
Any module should be able to call a very simple function to generate a teaser from HTML text they have (i.e. the Taxonomy could do that with descriptions.) But if you put any specialization in the node_teaser() directly, it won't be available to other modules and that's a shame.
As for periods after Mr. Ms. Sr. ... I don't know anything about those.
Now, note that European are not unlikely to put spaces all over the place: ¡ Hola ! is probably common in Spain. Just saying. I understand that if you don't find a full stop, ! or ?, then you ignore anyway so this isn't a problem.
Thank you.
Alexis Wilke
Comment #131
pillarsdotnet CreditAttribution: pillarsdotnet commented@AlexisWilke:
The patch in #127 causes all tests to pass and addresses all the objections raised so far. I would like for it to eventually reach "RTBC" status, however optimistic that may seem.
If you have an additional objection to the proposed patch against D8, please:
If, however, you are still talking about D6, or about D6 contrib modules, please hold your comments until this issue once again reaches D6 backport status. By that time the comment count will no doubt have reached three hundred or more, and I wouldn't want your insights to be lost in the middle.
Comment #132
pillarsdotnet CreditAttribution: pillarsdotnet commentedAnother slight tweak. Since the splitting regex only contains assertions, there is no need for the additional parameters to
preg_split()
.Note that the approach used cannot be extended to include Unicode end-of-sentence marks unless we require PHP 5.1.0 or later.See http://php.net/manual/regexp.reference.unicode.phpEDIT: Never mind; there isn't a Sentence_Terminal character class yet.
See http://unicode.org/review/pr-23.html
Comment #133
pillarsdotnet CreditAttribution: pillarsdotnet commentedTried to add support for Unicode Sentence_Terminal characters by explicit list. Wonder if it will work?
Comment #134
Jamie Holly CreditAttribution: Jamie Holly commentedD7 requires PHP 5.2.5+. D8 will probably be the same, or could be PHP 5.3+ - that's currently in discussion. I would go ahead and use the Unicode, and then if it makes it to a D6 back port status, a work around could be worked on.
However, I kind of wonder if we aren't look at this all wrong. I can think of other instances where an end-of-sentence character might not signify an actual end of sentence, for example in code blocks:
(Yeah it's a sucky coding standard, but you get the idea. Add to that the other variations in different coding languages, and there is probably many more examples out there.)
Maybe instead of end-of-sentence, a better approach would be splitting at end-of-block-elements or even br's.
(Numerous times I've seen where teaser breaks have been put in automatically with only one sentence left in the article (I've seen this in Wordpress as well). It really leaves the impression that the site is just looking to drive-up page views. )
Going this route would seem a lot simpler and with much better performance, plus make automatically placed breaks seem much more logical, especially to the reader/visitor. On top of that, it would be a much more "fool proof" approach, though probably not 100% yet.
Of course this is a very significant change from how the function operated before, so getting it into even D7 might be near impossible.
Comment #135
pillarsdotnet CreditAttribution: pillarsdotnet commentedPerhaps this will better address @sun's objections to nested indentation.
EDIT: Sorry about the crosspost.
@#134, I mostly agree, despite the fact that your proposed example wouldn't force a break anyway. The open parenthesis character is not part of the
[:word:]
class.Comment #136
pillarsdotnet CreditAttribution: pillarsdotnet commentedOkay, here's a patch that avoids breaking in the middle of a
XML_TEXT_NODE
, as suggested in #134. I expect some of the tests to fail.Hmm... As an intermediate approach, we could turn off sentence-splitting if any of the ancestor tags are
'code'
.According to @sun, performance is a non-issue. This function can take as long as it wants to, since the results are cached.
;-)
Comment #137
pillarsdotnet CreditAttribution: pillarsdotnet commentedSame as #135, but avoids splitting text nodes within
code
blocks. A test is needed.Comment #138
pillarsdotnet CreditAttribution: pillarsdotnet commentedInteresting. Converting UTF entities to characters is not supported by our testbots. What, I wonder, is the proper way to create Unicode characters? For now, I'll just quote them verbatim, but the result is unreadable in my text editor.
This one passes tests locally; hopefully the testbot will agree.
Comment #139
pillarsdotnet CreditAttribution: pillarsdotnet commentedFound and fixed a spelling error in a comment.
Comment #140
pillarsdotnet CreditAttribution: pillarsdotnet commentedForgot to add the
$parents
parameter to the recursive call. Still need a test that would otherwise split in the middle of a code block.Comment #141
pillarsdotnet CreditAttribution: pillarsdotnet commentedAdded tests for code blocks and for Unicode Sentence_Terminal characters. Note that the test failures in #140 have nothing to do with this patch. Also included in #823380-111: Read More link is always visible on teaser.
Comment #142
pillarsdotnet CreditAttribution: pillarsdotnet commentedMarking this as a dup since it's rolled into #823380-111: Read More link is always visible on teaser.
Comment #143
AlexisWilke CreditAttribution: AlexisWilke commentedpillarsdotnet,
Is that a way to close this issue on D7/D6 users?!
Or will that other issue be back ported instead?
Thank you.
Alexis
Comment #144
pillarsdotnet CreditAttribution: pillarsdotnet commentedDon't worry; I want this backported as much as you do.
Comment #145
sun#823380: Read More link is always visible on teaser. is not related, does not conflict, and is not a duplicate of this issue.
Comment #146
pillarsdotnet CreditAttribution: pillarsdotnet commented@sun -- It's not a dup, but the patches are related; they do overlap; they do break each other; they were merged at the suggestion of intoxination in #823380-90: Read More link is always visible on teaser., and that issue is much shorter and hence more likely to actually receive a review.
Comment #147
zilverdistel CreditAttribution: zilverdistel commentedSubscribing. And I also don't think this is a duplicate of #823380.
--EDIT--
I'd like +1 for the need to backport this to D7.
Comment #148
pillarsdotnet CreditAttribution: pillarsdotnet commented@zilverdistel: See #109.
Comment #149
pillarsdotnet CreditAttribution: pillarsdotnet commented#141: text_summary-221257-141.patch queued for re-testing.
Comment #150
pillarsdotnet CreditAttribution: pillarsdotnet commented#141: text_summary-221257-141.patch queued for re-testing.
Comment #151
sunI like the new implementation. Need to do another in-depth review soon.
However, this issue existed in the past 2-3 Drupal core versions already, and the new code implements an entirely new approach to hopefully resolve the issue finally in a rock solid way, but this also means we're improving the previously existing code in Drupal core. A major annoyance, but not really a bug.
Comment #152
pillarsdotnet CreditAttribution: pillarsdotnet commented#141: text_summary-221257-141.patch queued for re-testing.
Comment #153
fmosca CreditAttribution: fmosca commentedsubscribing, +1 for the need to backport this to D7
Comment #154
pillarsdotnet CreditAttribution: pillarsdotnet commented#141: text_summary-221257-141.patch queued for re-testing.
Comment #155
jct CreditAttribution: jct commented#141: text_summary-221257-141.patch queued for re-testing.
Comment #156
jct CreditAttribution: jct commentedAttempted to add issue summary
Comment #156.0
jct CreditAttribution: jct commentedAdding issue summary
Comment #156.1
jct CreditAttribution: jct commentedChanging my summary to one provided in a comment
Comment #157
mangelp CreditAttribution: mangelp commentedHi, I've tried this patch in a drupal 7 site and it seems to work fine, but I've found that if the text to summarize starts with a sentence longer than the maximum length the summary is returned empty because _text_summarize stop processing sentences when it finds a sentence with a length longer than the maximum allowable length left.
Is that a bug or is the expected behaviour?
Comment #158
pillarsdotnet CreditAttribution: pillarsdotnet commentedPossibly sub-optimal behavior, depending on whether you are willing to have a summary that stops in the middle of a sentence.
Comment #159
NancyDruFrom the Merriam-Webster dictionary for "teaser:"
Stopping in the middle of a sentence is certainly going to arouse more interest than blanks.
Comment #160
Akaoni CreditAttribution: Akaoni commentedSubscribing.
+1 for D7 backport.
Comment #161
zouz747 CreditAttribution: zouz747 commentedI all, I'm new to Drupal and I see the utf8_truncate function being called in text_summary() handles the add_ellipsis parameter which BTW could be added in the UI configuration screen for teasers (Y/N) for setting. Assuming it's been set somewhere, shouldn't the parameter be passed to text_summary() from text_field_formatter_view()?
Comment #162
dddave CreditAttribution: dddave commentedBeing new is fine, fiddling wildly with the issue settings not. ;)
Comment #163
rkodrupal CreditAttribution: rkodrupal commentedsorry i haven't been keeping up with this thread since #49 but i recently discovered how to 'break' the code ... have a teaser that exceeds the length of an em string, then re-edit the article (shortening it) so it breaks in the middle of the em string.
again i'm still on drupal 6.0 and using the code mentioned in #49, but I thought I'd share this in case it's helpful for further code development.
here's the offending code that broke on re-editing:
Our in-house <em>judiciary</em><a id="/sites/default/files/snapshot/1499.jpg" class="imgtip" href="/node/1499"><img title="Click here for more information." alt="" src="/sites/all/themes/fordrupal/zincout/icons/internal_link_icon.gif"></a> gets to
Comment #164
pillarsdotnet CreditAttribution: pillarsdotnet commented@rkodrupal:
The current code in #141 makes this impossible. If you think it is necessary, I will add your example to the tests to prove it.
@NancyDru:
The Unicode line-break algorithm is ... complicated.
I'm not sure it's worth implementing. Are you?
Comment #165
rkodrupal CreditAttribution: rkodrupal commentednot a big deal on my part ... the breakage is quite evident on the screen and i can manually edit the teaser. just sharing this in case it was worthwhile. we do strive for perfection heh?
is the #141 code drupal 6 friendly? i'm losing track of the versions.
k
Comment #166
pillarsdotnet CreditAttribution: pillarsdotnet commented@rkodrupal:
No. I can roll a D6 patch, but I'm sure that the API changes will not be accepted. Please let me know how it works for you; I'm getting weird results in my tests.
@NancyDru:
Okay; implemented as a fallback only if no valid sentence break is found:
Comment #167
pillarsdotnet CreditAttribution: pillarsdotnet commented@rkodrupal:
This version makes the weirdness go away, for d6. The problem is that d6 code assumes that:
Comment #169
pillarsdotnet CreditAttribution: pillarsdotnet commentedFixed failing test. (When did that line disappear from filter_dom_serialize()?)
Comment #170
rkodrupal CreditAttribution: rkodrupal commentedsorry for delay. i just updated to drupal 6.22 thinking it would be a good test of text_summary-221257-169-d6.patch.
if i create a new node it splits the teaser fine. if i re-edit that same node it inserts an annoying manual break ... duplicating everything within the teaser above and below the break.
so i was unable to execute the test from #163 because i never made it that far.
before you also included an update for node.pages.inc. is that still needed with this latest patch for d6?
thanks for all you great work on this.
k
Comment #171
pillarsdotnet CreditAttribution: pillarsdotnet commentedSorry, but that's just part of the broken-by-design d6 javascript-powered teaser-break weirdness, and a good example of why that featurebug was removed in d7.
I don't know how to fix it so that it works as expected, both:
When the teaser summary is included with the node body.
and:
When the teaser summary is not included with the node body.
I'm sorry, but looking back through the last few versions, I don't see one that updated node.pages.inc. Can you provide a link?
Furthermore, this issue is currently targeting version 8.x. It's nice to get the 7.x and 6.x backports out of the way, but problems with them should not distract from getting the patch reviewed and into 8.x, which must happen first.
Therefore, setting back to needs review.
Comment #172
rkodrupal CreditAttribution: rkodrupal commentedI vote for making it work when the teaser summary is not included with the node body. There should be no reason for D6 to add in a manual teaser break on its own, and that was the main benefit for the fix done in http://drupal.org/node/221257#comment-3735536 (#49 above)
... the patch at #49 also includes the node.pages.inc change.
Thanks
Comment #173
pillarsdotnet CreditAttribution: pillarsdotnet commented@#172 by rkodrupal:
Okay, I'll try to remember that the next time I roll a backport. For now, I'm just trying to get this accepted into 8.x, which lacks a
node_body_field()
in itsnode.pages.inc
file.See: http://drupal.org/update/modules/6/7#body_teaser_fields
And: #372743: Body and teaser as fields
Comment #173.0
pillarsdotnet CreditAttribution: pillarsdotnet commentedCorrected pointer to current patch.
Comment #174
AlexisWilke CreditAttribution: AlexisWilke commentedActually, in D8 you probably want to change the scheme.
A much better idea is to have a flag to tell you whether the teaser was generated or not and completely forget about the comparison between the beginning of the body and the teaser (which is a crazy idea.)
Once you have that flag, you know when you edit whether the teaser is part of the body or not.
That would not work for D7 and D6, of course.
Good luck.
Alexis
Comment #175
pillarsdotnet CreditAttribution: pillarsdotnet commented@#174 by AlexisWilke:
Did you mean to comment on #823380: Read More link is always visible on teaser.?
Comment #176
AlexisWilke CreditAttribution: AlexisWilke commentedI'm the one who wrote the patch in #49. You don't need to have this function:
node_compare_teaser_and_body()
if you have a flag to know that the teaser was auto-generated.
Unfortunately, in D6, that was the easiest way to fix the problem, but in a new version that should not be done that way... The Read More link has nothing to do with that, although a flag to mark that the teaser == body is a good idea too. 8-)
Thank you.
Alexis Wilke
Comment #177
pillarsdotnet CreditAttribution: pillarsdotnet commented@#176 by AlexisWilke on October 6, 2011 at 2:58am:
Okay; I apologize for not recognizing you as an early contributor to this issue.
The patch from #169 does not contain such a function, nor can I find one anywhere in core.
Such a flag, if it were to exist, would having nothing to do with this patch. The
text_summary()
function does not deal with node or entity objects. It only deals with plain text.And the new version is not done that way, which is why I am having difficulty understanding the relevance of your commentary. Have you actually read the code? If not, please do. The only thing holding up this issue is the lack of a thorough code review by someone other than myself.
Again, this is why I thought you were referring to the other issue, where such commentary would be more relevant.
Comment #178
rkodrupal CreditAttribution: rkodrupal commentedfor other folks reading this thread i'm stickin' with the patch in #49 for drupal 6.22 ... i did need to split the patch between the .module and the .inc ... the patch for .module works well in drupal 6.22, but not the patch for .inc ... the .inc patch is only 1 line so just manually make the change.
of course this doesn't fix the minor issue i raised in #163.
unfortunately delays in updates to the contrib modules are keeping me from going to drupal 7, so any worries about drupal 8 are far off into my future.
k
Comment #179
David_Rothstein CreditAttribution: David_Rothstein commentedA couple comments after reading through this issue:
The check_markup() function has already been called on the text before it was ever passed to text_summary(). Therefore, this code is going to result in check_markup() being called twice. (See also: #1347920: text_summary() checks for specific filters, but it should be agnostic to filters.)
Comment #180
effulgentsia CreditAttribution: effulgentsia commentedThanks for everyone's work on this. In addition to the feedback in #179,
This fails to include self-closing tags (like images) in the summary. It also fails to include tag attributes (like href).
For anyone interested, I added a Improved Text Trim sandbox project that alters the text field trimming formatters to use a trimming function modeled on #169 with #179 and the above incorporated.
@pillarsdotnet: #169 needs to be rerolled for the D8 "core" folder introduction. Do you want to do that and incorporate the fixes from my sandbox that you approve of, or would you like me to post a patch here with the fixes?
Comment #181
effulgentsia CreditAttribution: effulgentsia commentedComment #182
andrey-z CreditAttribution: andrey-z commented#169: text_summary-221257-169.patch queued for re-testing.
Comment #184
pillarsdotnet CreditAttribution: pillarsdotnet commented@effulgentsia -- Sorry; I've been offline for a long time and not likely to get back to this soon. Please post whatever patches you feel appropriate.
Comment #185
NROTC_Webmaster CreditAttribution: NROTC_Webmaster commentedsub
Comment #186
pillarsdotnet CreditAttribution: pillarsdotnet commented@#179 Posted by David_Rothstein on November 21, 2011 at 5:07am:
That has not been the case in my tests. Has something changed in core? (checking...) Wow. I guess it has. Re-rolling accordingly.
I'm glad there is another issue to worry about performance and caching so that I don't have to worry about it in this issue. (grin!)
Comment #187
pillarsdotnet CreditAttribution: pillarsdotnet commented@#180 Posted by effulgentsia on December 21, 2011 at 2:14am
Thanks.
Re-rolled using your sandbox as a guide.
Comment #188
pillarsdotnet CreditAttribution: pillarsdotnet commentedGrr....
Comment #189
pillarsdotnet CreditAttribution: pillarsdotnet commentedAlso needs test-cases for the shortcomings pointed out by Effulgentsia in #180:
Comment #191
pillarsdotnet CreditAttribution: pillarsdotnet commentedNo clue why the fatal error occurred. Re-rolling with some trivial changes and re-uploading.
Comment #192
pillarsdotnet CreditAttribution: pillarsdotnet commentedGo, testbot!
Comment #194
pillarsdotnet CreditAttribution: pillarsdotnet commented#191: text_summary-221257-191.patch queued for re-testing.
Comment #196
Mark_L6n CreditAttribution: Mark_L6n commentedDetermining sentence boundaries is a complex issue, even if only English is considered. Periods are frequently used for purposes other than sentence boundaries, particularly abbreviations. The difficulty can be seen by the number of academic papers on it...take a quick look at what Google Scholar returns for "sentence boundaries". People have used rule-based, statistical, neural-network, machine-learning approaches, etc. to determine whether a stop is a sentence boundary or not.
You could do a quick-and-dirty processor (that wouldn't always be accurate) by searching for a stop followed by white space followed by a capital letter, disallowing Mr., Mrs., Dr. etc.
However, that's only for English and still incomplete.
So of course using the nearest markup boundary is much better. Perhaps a basic sentence boundary detector could be implemented, but allowed only as an option, with a note explaining it won't always be accurate.
Comment #197
pillarsdotnet CreditAttribution: pillarsdotnet commented@Mark123 -- Patches are always welcome.
Comment #198
dozymoe CreditAttribution: dozymoe commentedMaybe you can think of text_summary as another text-formats filter, that is inserted right before the last filter, which is usually html corrector, or if user choose, htmLawed.
If it can be seen like that, then the call to _filter_htmlcorrector() is outside of text_summary() function's contract.
Comment #199
tim.plunkettThis is a feature request.
Comment #200
jenlamptonPatch at #76 still applies cleanly to 7.18. In case anyone else still needs a fix today :) And I'm not sure why this isn't a bug report.
Comment #201
David_Rothstein CreditAttribution: David_Rothstein commentedYeah, as written, this is definitely a bug report (regardless of whether the patch here also goes into feature request territory)...
However, I'm not convinced the main bug (faulty HTML being produced in teasers) exists anywhere except Drupal 6? That's because of #1235062: text_summary() ignores filter status - basically, in Drupal 7 and Drupal 8, the two bugs should cancel each other out and ensure the HTML corrector filter always runs.
I think we should split the difference and call it a non-major bug. Even in Drupal 6, I would think that enabling the HTML corrector filter gets around the main part of this bug.
Comment #202
lightsurge CreditAttribution: lightsurge commentedI can't find any reference to text_summary() in 8.x, has it been removed and that was why this issue was put to feature request?
Comment #203
lightsurge CreditAttribution: lightsurge commentedAppears here:
http://drupalcontrib.org/api/drupal/drupal%21modules%21field%21modules%2...
But not here:
http://api.drupal.org/api/drupal/modules!field!modules!text!text.module/...
Comment #204
David_Rothstein CreditAttribution: David_Rothstein commentedtext_summary() is still in Drupal 8 (in core/modules/text/text.module).
On api.drupal.org it only appears for Drupal 7 right now; I'm not sure why but I believe that hasn't been rebuilt in a while and the module has moved around recently.
Comment #205
jenlamptonBumping back up to major because I'm still seeing this problem in D7.
I'm not sure which of the two things that are supposed to "cancel each other out" is failing, but without the patch (I'm using this one from #76) my site is still hosed.
Above mentioned patch still applies to 7.21, for others who are also currently hosed :)
Comment #206
dcam CreditAttribution: dcam commentedhttp://drupal.org/node/1427826 contains instructions for updating the issue summary with the summary template.
The summary may need to be updated with information from comments.
Comment #207
ttkaminski CreditAttribution: ttkaminski commentedNot sure why no one brought this up, but it seems that if you have a post that has a div tag that spans the
<!--break-->
marker, then the teaser summary will contain invalid HTML. Example:text_summary()
will return:Which breaks the entire layout of the site because of the missing div close tag. My recommendation is to include scenario in the tests and also perform HTML correction after the text is trimmed via the
<!--break-->
Comment #208
swentel CreditAttribution: swentel commentedMoving to text module
Comment #209
Stalski CreditAttribution: Stalski commentedJust tried to read this huge issue. To go forward with this, can anyone give a status update summary?
Especially :
- What's the work that needs to be done?
- Which patch is the most probable to start with to move forward?
- As lots of issues change the testFirstSentence test case, I think it might be a good idea to have a separate test for this issue, no?
Comment #210
alexverb CreditAttribution: alexverb commentedI don't want to sidetrack this issue. But I've provided a contrib solution for D7 that might be of interest to this issue: https://drupal.org/node/2118995. It provides a way to trim text without taking HTML tags into account in about 20 lines of code:
I have not read this entire issue. But I've looked at some patches and see that DOMDocument is used. Probably to get to sentence level for cutting off. My solution does not take that into account, but might be an interesting approach to the HTML problem.
Comment #210.0
alexverb CreditAttribution: alexverb commentedUpdated info, links, and status.
Comment #211
jtbayly CreditAttribution: jtbayly commentedAfter close to a year of struggling to figure out how our entire homepage layout gets destroyed every few months, I finally tracked it down to this bug.
Isn't this really (at least) two different bugs?
1. "Correct faulty and chopped off HTML" is completely ignored on Summary field, thereby breaking sites on a regular basis. (If you use a WYSIWYG tool of any sort and use a break to determine the summary, there are all kinds of ways to cause problems. Any tag that isn't closed prior to the break will destroy your site wherever that summary is displayed.) Don't we have a fix for this, and can't we just apply it? I know it will be painful for a bit, but really it's a catch 22. We can't wait forever.
2. The teaser isn't quite as long as it should be sometimes, because HTML code and is getting counted as words.
Would it help to split them up into 2 different issues?
-Joseph
Comment #212
tarekdj CreditAttribution: tarekdj commentedPatch based on #210.
Comment #214
valderama CreditAttribution: valderama commentedUntil this gets fixed in core one can use smart_trim module with a patch from this issue applied: https://www.drupal.org/node/1916060
Comment #215
jenlamptonSince the most recent patch no longer applies cleanly to D7 also, still applying the patch in #76 to 7.32
@jtbayly I think you are exactly right with your explanation of the two problems, and we did have two (or more) issues, but since this one had a working patch, the others were closed and marked a dupe of this one.
Comment #216
jenlamptonPatch in #76 still applies clealy to 7.34. Tried the patch in #169 again also, but no luck there.
Comment #217
jenlamptonremoving current issue status from issue
Comment #218
bburgUpdating broken link.
Comment #219
bburgConfirming this is still issue in D8 head.
Comment #220
mdrayer CreditAttribution: mdrayer at Forum One commentedThis is a triage at the Los Angeles Sprints. @bburg and I did the triage.
We reproduced the issue against Drupal 8 head with the following steps (similar to steps in summary):
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce diam est, finibus a lobortis ac, mattis imperdiet sapien. Sed non libero ac augue aliquam volutpat. Praesent sollicitudin ipsum ac amet.</p>
Note that HTML entities and Unicode characters seemed to get decoded before the trim count went into effect, so those passed our testing. HTML markup seems to be the only problem here. The HTML corrector step in the summary is unnecessary and seems like a whole other issue.
Comment #221
bburgRerolled patch against current head. Resolved conflict generated from #2361745: Remove usage of truncate_utf8(). This will fail tests.
Comment #222
kgoel CreditAttribution: kgoel at Forum One commentedComment #223
bburgWhile several tests need to be rewritten to address the new behavior being added by this issue, at least one appears to be legitimately failing. namely testFirstSentenceQuestion(). This is because the new function added by this patch _html_trim_get_trimmed_html() strips out the ending sentence character.
Comment #225
xjm(Saving proposed issue credit for discussion and triage participants at LA.)
Comment #226
jtbayly CreditAttribution: jtbayly commented"The HTML corrector step in the summary is unnecessary and seems like a whole other issue."
I'm not sure exactly what you mean by that. The issue title starts out with "text_summary() should output valid HTML..." Are you saying that that should be a different issue? I would say it's the more important of the two bugs under discussion here. Teaser length being off by a little bit isn't nearly as big of a deal as outputting bad HTML onto the homepage.
Regardless, if you don't intend to address that part of it, let's change the title here and I'll create another issue. Let me know.
And thanks for working on this.
-Joseph
Comment #227
bburg@jtbayly, To recreate the reported issue, one must turn off the HTML corrector filter. So the reported problem essentially boils down to: If we turn off the feature that was created to fix this bug in the first place, then the bug occurs. Which doesn't make a whole lot of sense as an issue to me. Now we can discuss whether this should just be default behavior for the summary text formatter in the first place? That's what we were thinking by "The HTML corrector step in the summary is unnecessary and seems like a whole other issue."
I did start work on a TextSummary class to create summaries while preserving markup integrity. Unfortunately, I haven't had much time since the week after DrupalCon to work on this. It's not remotely ready, hence why I haven't posted a patch, but I am only sharing this now to start a discussion on the direction I was thinking about heading in. Rather than relying on regular expressions to manipulate HTML, I was thinking of constructing the snippet as a DOM and traversing through the node tree recursively. Example attached (beware, I haven't looked at this in almost a month).
Comment #228
Wim LeersDevil's advocate:
Shouldn't we just remove auto-truncating? It's a bad practice anyway. Have a separate teaser field instead.
Comment #229
bburg@wim-leers, Good question. Because content editors are lazy? I've used auto-truncating as a backup on most of my projects. But I definitely get that as a general practice, it's better to curate your own summaries.
Great session at DrupalCon btw.
Comment #230
jtbayly CreditAttribution: jtbayly commentedI'm pretty sure you've missed one part of this bug, which is the one that I happen to care about. Either that, or I'm majorly confused.
I don't use auto-truncation. I use a WYSIWYG tool, which allows me to set a breakpoint and auto-fills the summary field with the content found prior to the breakpoint. The problem is that we end up with broken HTML in the summary field, even when the HTML corrector is on. In other words you *don't* need to turn off the HTML corrector to reproduce this bug. You need to turn it on and put bad HTML in the summary field.
Here is *my* summary of the two parts to this issue, as I wrote in #211 above:
1. "Correct faulty and chopped off HTML" is completely ignored on Summary field, thereby breaking sites on a regular basis.
2. The teaser isn't quite as long as it should be sometimes, because HTML code and is getting counted as words.
And like I said before, #1 is a much bigger issue, because it completely DESTROYS every page where the summary field is displayed with broken HTML. On the other hand, it seems to me you are only working on #2?
That's why I want to see this broken into two issues. I see how they are related, but I really think that it's confusing to try to deal with two problems at once, especially when one of them is major and the other (in my opinion) should be marked as minor.
Can't somebody just write a patch that makes the "Correct faulty and chopped off HTML" actually work on the summary field? It seems so simple... (as a non-coder).
Warmly, and on my knees begging...
-Joseph
Comment #231
geerlingguy CreditAttribution: geerlingguy as a volunteer commentedFor my own sites (where I'm the only one managing content), I typically use a summary field and use that. For almost any project where there's no OCD-level content administrator, summary fields become next to worthless, as admins rarely type in a good summary, and instead copy and paste a line of text (which makes no sense out of context), type in a couple words to fill the field, or do something similarly unhelpful.
I think the main thing we're aiming for is something like a google search summary result: https://www.google.com/?q=test (or even the kind of summary text you get through default Solr search results with Apachesolr or Search API Solr).
In the results, the summaries are all of similar length, but they are all relevant text (through the use of a slightly more intelligent filter). For my needs, something like Smart Trim usually works, and it would be nice to have that option in core for any fields where Wysiwyg and/or images/files are allowed, especially on larger multi-author or community sites.
Also, IIRC, Drupal 6 used to promote the 'summary/body' split a bit more prominently, and it was still with us in Drupal 7 ('show summary' for body fields), but was removed from Drupal 8 because the UX is always awkward, people rarely used summaries correctly, and content updates would often result in incorrect/outdated summaries anyways :(
Comment #232
lightsurge CreditAttribution: lightsurge commentedI agree with @geerlingguy to get admins to use summary fields takes training and re-training, which isn't always practical. To keep things simple if a teaser can be extracted from the main body of text, in many cases it's more successful. Fine to have a separate summary field, but admins wouldn't necessarily want it accessible to all users, and would want it auto-magically filled in where it's empty.
Something like Google search result summaries would be great, the thing is that and solr sort of rely on having a search keyword/phrase etc. in order to pick out what they show? Seems like any clever summary is going to be the realm of contrib and even plugins to third party software. For example, it would be good to have some kind of wrapper module for Open Text Summarizer http://libots.sourceforge.net/ which despite the ancient last release date seems to still have good support in modern distros http://packages.ubuntu.com/wily/libots0 and there's existing php wrappers to look at https://github.com/znck/summarizer
Edit: I had a go with OTS at creating google-like summaries https://www.drupal.org/node/2652726 I quite like the result although obviously it doesn't really help this core issue.
Comment #233
Jamie Holly CreditAttribution: Jamie Holly commentedGiven the importance of a dedicated summary field, especially in terms of SEO, social media and design, I can't help but wonder if this 8 year old issue is that important anymore? Layout has changed a lot since this issue first emerged. Back then using teasers (summaries) was pretty much common place, but design trends over the past several years have gone more card or summary based.
Auto text summarization is something I've played with a lot, trying out numerous systems in various languages, as well as service based. I believe the problem with that is that it becomes a crutch. A computer isn't going to write an effective summary anymore than it will write an effective headline, and summaries are almost as important as headlines anymore.
Back to the issue at hand, I think we really need to re-evaluate. First off, a backport to D6 is basically not going to happen at this point, given we're less than a month away from D6's EOL. Looking forward, is having a Body+Summary field really the best path forward? Perhaps in future versions of Drupal, the better method would be a dedicated summary field and then body, leaving the Body+Summary in for legacy purposes for one version, then moving it to contrib after that.
Thinking that, the best route forward for this issue will be to fix the actual issue at hand, ie: getting valid HTML. I don't think spending time on reworking the actual functionality of this field is really a priority, given I can actually see a future Drupal without it.
Comment #234
geerlingguy CreditAttribution: geerlingguy as a volunteer commentedYeah, definitely not heading back to D6 at this point.
Comment #235
alexverb CreditAttribution: alexverb as a volunteer commentedIn my mind the easiest solution is to grab a hold of CakePHP's truncate function like I did for Field HTML Trim.
This function has several options.
I've rigorously tested this function and compared it's performance against a DOMDocument approach. I think it's the best way to go. I'd be happy to write a patch and adjust the function to Core's needs if the maintainer is interested in this approach.
Comment #236
jenlamptonPatch in #76 still applies clealy to 7.43.
Comment #237
dqd(EDIT) Firsts things first: there is an issue with D8.1 beta body field trimmed view (e.g. teaser, frontpage) which renders the formatted and trimmed text with a starting but not with a closing
html <p>
tag. I assume that this is part of this issue (?), if not, we need further testing for a new issue to set up.Now back to topic:
Some thoughts on the summary yes/no paradigm discussion above: both main points are valid and do not neutralize each other. Automated teaser text IS and ever WILL BE important in bigger projects or web applications to prevent some data inconsistency in many editorial desk concepts. (" Dear editor, if you change the article, please don't forget to update the summary"). And of course it would mean a lot of loss to remove the summary option for those who are picky about would should be extracted for a preview. I think, to have both options at hand isn't obsolete for a data contructs building CMF like Drupal is.
Another thought on this: Despite of that, Views is in Core now for D8 and is the default frontpage provider (but with teaser view by default). We may should think about merging forces here since Views also supports field length trimming? And does this issue affects Views trimming text also? Maybe we should add a Views tag to get attention of Views maintainers here? Just thinking loud ...
@alexverb: interesting solution. Thanks for this contribution. Any performance concerns to consider for further testing under D8 ?
Comment #239
jenlamptonPatch in #76 still applies clealy to 7.50.
Comment #243
jenlamptonPatch in #76 still applies clealy to 7.56.
Comment #244
dqdOh Dear ... another long holding one ... Lost it completely out of my focus. Maybe I should raise voice for it on https://www.drupal.org/node/2905741 ? But I think its too late already.
@jenlampton: heh thanks for frequently reporting that the patch still applies cleanly :)
Just to come back again to @Wim Leers on #228
if this became a show stopper for anyone to make any decision or to push this issue forward, let me remember that this functionality not only applies for "teasing" any content but also for any listings with shortened text in different lengths and admin previews etc. I think this function needs to work properly, also for admin content use cases. And just to make sure that I do not mix things up here: text_summary() is doing the trimming in most of the core trimmed text previews, correct?
Is this still the case? I should take a short chat with @dawehner on this ...
(EDIT: Cutted last part to paste it for a new comment since this is a completely different discussion)
Comment #245
dqdSomebody mentioned an example of when using an embed video is breaking the expected behavior.
I wanted to quote it but I can't find it anymore in the rush.I found it:I wanted to demonstrate that we maybe need a new clear road map about what should be counted and how? (Another issue) I think this was one of the main reasons for the function to work for some not like expected (not talking about the bug but about the all over behavior often stated in this issue here). The POV and expectations of each user on this is very different and shouldn't be underrated. It's not as clear as it looks like at the first glance: "Is a video part of a summary and if, how much should it count (items)? Do we need different counter types like a word counter, a character counter and HTML element counter? Can we mix them?" But this should be part of another issue.
But as #83 already stated (as I found out later, errm):
But as I sad this is part of another issue if it doesn't still exist and it also should take contrib area into account. e.g. extending Smart_Trim?
Comment #247
jenlampton@diqidoq you're welcome! I sometimes wonder if it's too much 'noise' but I do appreciate being able to easily find the patch I have to apply to each core update, and I hope that helps others too. So...
Patch in #76 still applies (with offset) to 7.58.
Comment #248
jenlamptonPatch in #76 still applies (with offset) to 7.59.
Comment #250
jenlamptonPatch in #76 still applies (with offset) to 7.63.
Comment #251
calebtr CreditAttribution: calebtr commentedI want to add that, while patch #76 does indeed apply, it is also buggy: when text is clipped in the middle of a closing tag so that
</
are the last two characters, the tag is left open. Sure, it is an edge case, but it means that the original problem of the auto-summary producing invalid HTML still exists.I like the DOMDocument approach in #141 better overall, but unfortunately that one is not applying for me. Echoing comment #231, for Drupal 7 this is handled nicely by contrib - and is perhaps the best way forward for Drupal 8+ as well.
Comment #252
Wim LeersWe're running into this again over at #2940029 — see #2940029-89: Add an input filter to display embedded Media entities and #2940029-90: Add an input filter to display embedded Media entities.
Comment #253
effulgentsia CreditAttribution: effulgentsia at Acquia commentedFYI: I opened #3067116: text_summary() returns malformed (not normalized) HTML for basic_html and other formats that use filter_html instead of filter_htmlcorrector and #3067124: text_summary() returns a plain string, even if passed a MarkupInterface object. Their scope is smaller than what's in this issue, but linking them here because of the partial overlap, especially the former issue.
Comment #254
leymannxComment #259
larowlanThis really could use an issue summary update now that #3067116: text_summary() returns malformed (not normalized) HTML for basic_html and other formats that use filter_html instead of filter_htmlcorrector is in.
What remains to be done here?
Comment #263
leeksoup CreditAttribution: leeksoup as a volunteer commented@larowlan Re #259
In D 10.2.3, text_summary (trimmed) output is still much shorter than the selected number of characters in "Trimmed limit." It appears that it still counts markup within the character count.
E.g. I set a Trimmed limit to 1500 characters but get this for trimmed output:
This is only 900 characters.
If the node contains something with more markup, such as bullet points, the trimmed value is even shorter. e.g. with the same filler text put into a numbered list, like this:
Then the trimmed output looks like this:
With the same text as in the first screenshot, this time the trimmed value only has 139 characters.
Comment #264
larowlanThanks @leeksoup - can you update the issue summary with remaining tasks etc?
Comment #265
leeksoup CreditAttribution: leeksoup as a volunteer commented@larowlan - Does that work?
Comment #266
leeksoup CreditAttribution: leeksoup as a volunteer commented@larowlan - Do the remaining items need to be split off into a new / separate issue?
Comment #267
larowlanI think this issue is fine, thank you for updating the issue summary!