Problem/Motivation
The text_summary function in modules/field/modules/text/text.module will cut off text in the middle of a word if the length of the summary (passed in $size) it is producing is less than the length of the first sentence or paragraph. Also, if there are long sentences then the cut off will not be close to the desired size.
Proposed resolution
Cut at words instead of sentences. This is a problem we solved for Drupal (not 100% standard as the standard is insanely hard to implement but it's good enough): #604002: Poor search support of some Unicode scripts contains the relevant regular expression and #768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug) moved it into unicode.inc so it's available.
Remaining tasks
Use a regular expression to find the last PREG_CLASS_UNICODE_WORD_BOUNDARY plus the few tags being used now, port to D8, write test.
User interface changes
API changes
Data model changes
Comment | File | Size | Author |
---|---|---|---|
#19 | 1482178_19.patch | 8.9 KB | chx |
#7 | text-summary-word-break.patch | 1.22 KB | ezheidtmann |
#5 | 1482174-text-summary-word-break.patch | 701 bytes | longwave |
#3 | 1482174-text-summary-word-break.patch | 701 bytes | longwave |
Comments
Comment #1
pounardWrong project issue queue, I'm assigning it to the right one. Please beepy assign the right core version.
Comment #2
beepy CreditAttribution: beepy commentedThanks for fixing it; sorry about my errors. I don't know what core version to assign; I discovered the bug in an install of 7.8, but the bug still appears to exist in the latest version of 8.x I could find.
Comment #3
longwaveConfirm this is an issue with long first paragraphs and short summary trim lengths.
I realise this should be fixed in 8.x and backported, but here's a 7.x patch for testbot to chew on to start with.
Comment #5
longwaveUploaded the patch with 0 instead of 1, let's try that again (editing text that switches to RTL is hard!)
Comment #6
longwaveChanging version, adding tags.
Comment #7
ezheidtmann CreditAttribution: ezheidtmann commentedHitting this bug in 7.x; here's one alternate approach for the testbot.
Comment #8
ezheidtmann CreditAttribution: ezheidtmann commentedPassed testbot; bumping back up to 8.x
Comment #9
neRok CreditAttribution: neRok commentedThere is a different - but tightly coupled - issue #1620104: text_summary() returns very small summaries if no stop characters are hit.
I have tested the above patches on drupal 7.22.
I tested the patch in #7, and it did not work for the other issue.
I tested the patch in #5, and it did work for the other issue.
Merging the features of the 2 patches together, I came up with the following code. You would have to read both patches to understand the context of the code (where it goes). I am unable to make a patch at the moment, hopefully someone can my clues out and test the code. I also didnt bother to understand the $break_points array of arrays, but mashed it together in a way that hopefully works for both issues.
snip
Comment #10
swentel CreditAttribution: swentel commentedMoving to text module
Comment #11
hefox CreditAttribution: hefox commented#7 works for this specific problem in d7. What's the weird character in 5?
Comment #12
DamienMcKenna@hefox: It seems like the code is being mangled slightly by git because the code in the file is correct, it just doesn't convert properly to a git patch for some reason. Maybe that code should be converted to use the chr() command or something?
Comment #13
chx CreditAttribution: chx at Smartsheet commentedThis whole function looks like garbage to me. Premature optimization perhaps. Or just too clever. Maybe there was a time when this wanted to produce, I dunno, best summary? but users expect, you know, if they set this to 250 to get a roughly 250 characters long summary, while not cutting in the middle of a word. But what is the code, I do not even. I have PCRE and I am not afraid to use it unlike whoever authored this originally.
There's some truly strange oddity here: for some demented reason we picked "IDEOGRAPHIC FULL STOP" and "ARABIC QUESTION MARK" as sentence ending. These are "Punctuation, Other" and there are 513 such characters in Unicode. Why these two?? I have no idea but in a stable release we shall keep them. Perhaps Drupal 8.1 could add all others Po or perhaps all P which have a neat PCRE in search already... Please find a Drupal 7 patch attached.
Edit: this code is definitely prehistory. These Unicode characters were added in 2004 (back then this was in node_teaser), breaking on various HTML tags were added in late 2002. I think we can safely ignore whatever the intent was then.
Comment #14
dawehnerWhat does the bot say!
Comment #16
longwaveComment #18
chx CreditAttribution: chx at Smartsheet commentedI believe the policy is 8.0.x first but I didn't bother porting and testing until the community says "yes this is what we want" which can be done rather easily from the D7 patch. Also, testing will be *very* easy for this one ;)
Comment #19
chx CreditAttribution: chx at Smartsheet commentedStill D7 patch just to make sure someone has a clean ground when writing new tests -- the current tests are completely meaningless.
Comment #20
chx CreditAttribution: chx at Smartsheet commentedComment #21
chx CreditAttribution: chx at Smartsheet commentedComment #22
chx CreditAttribution: chx at Smartsheet commentedI did code this for D8 but I feel it changes text_summary too much. Posted #2661632: Trim to word boundary when using character count for the Smart Trim project.
Comment #23
Wim LeersWe should simply remove the automatic trimming and summary features from Drupal 8. They are broken by design.
This issue clearly demonstrates how trimming is broken for technical reasons. But it's also broken at a more fundamental level: trimming can never be smart enough (until we have strong AI in Drupal 8 core, which will never ever make sense).
Summaries are also broken, for similar reasons: #2671162: Also use text editor (CKEditor) for "summary" of a text field was mentioned in IRC and sparked a related discussion. See below.
Comment #36
smustgrave CreditAttribution: smustgrave at Mobomo commentedThis appears to be a duplicate of #2835615: Trimmed Formatter no work with html code space. which seems further along.
Will move over credit.