Problem / Motivation

Currently, the author column in {aggregator_items} is a 255-character varchar. However, there are definitely feed items out there where the author entry is longer than 255 characters. An example is an article authored by multiple people, as per the example below:

PDOException: SQLSTATE[22001]: String data, right truncated: 1406 Data too long for column 'author' at row 1: INSERT INTO {aggregator_item} (title, link, author, description, guid, timestamp, fid) VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3, :db_insert_placeholder_4, :db_insert_placeholder_5, :db_insert_placeholder_6); Array ( [:db_insert_placeholder_0] => Biodiversity loss and its impact on humanity [:db_insert_placeholder_1] => http://feeds.nature.com/~r/nature/rss/current/~3/R0f1-VRuwlw/nature11148 [:db_insert_placeholder_2] => Bradley J. Cardinale J. Emmett Duffy Andrew Gonzalez David U. Hooper Charles Perrings Patrick Venail Anita Narwani Georgina M. Mace David Tilman David A. Wardle Ann P. Kinzig Gretchen C. Daily Michel Loreau James B. Grace Anne Larigauderie Diane S. Srivastava Shahid Naeem [:db_insert_placeholder_3] => Biodiversity loss and its impact on humanity Nature 486, 7401 (2012). doi:10.1038/nature11148 Authors: Bradley J. Cardinale, J. Emmett Duffy, Andrew Gonzalez, David U. Hooper, Charles Perrings, Patrick Venail, Anita Narwani, Georgina M. Mace, David Tilman, David A. Wardle, Ann P. Kinzig, Gretchen C. Daily, Michel Loreau, James B. Grace, Anne Larigauderie, Diane S. Srivastava & Shahid Naeem. The most unique feature of Earth is the existence of life, and the most extraordinary feature of life is its diversity. Approximately 9 million types of plants, animals, protists and fungi inhabit the Earth. So, too, do 7 billion people. Two decades ago, at the first Earth Summit, the vast majority of the world’s nations declared that human actions were dismantling the Earth’s ecosystems, eliminating genes, species and biological traits at an alarming rate. This observation led to the question of how such loss of biological diversity will alter the functioning of ecosystems and their ability to provide society with the goods and services needed to prosper. Only local images are allowed. [:db_insert_placeholder_4] => [:db_insert_placeholder_5] => 1338933600 [:db_insert_placeholder_6] => 33 ) in aggregator_save_item() (line 156 of .../modules/aggregator/aggregator.processor.inc).

Proposed resolutions

Truncate the author list to 255 characters. This is the approach recommended as a general rule in #1284658: MySQL strict mode is causing PDO Exceptions when lengths of varchar are exceeded. We took this approach with the aggregator_item title field in #1244116: Feed items with title longer than 255 characters fail to insert into aggregator_item as titles were being encountered with more than 255 characters.

Remaining tasks

#1811458: User should be able to set the length of the aggregator field e.g. the author field is a followup issue to discuss whether we should increase the size of the database field to allow longer author texts.

Files: 
CommentFileSizeAuthor
#19 aggregator-test-only-1622974-19.patch3.57 KBiflista
FAILED: [[SimpleTest]]: [MySQL] 39,463 pass(es), 24 fail(s), and 14 exception(s).
[ View ]
#18 aggregator-1622974-18.patch4.55 KBiflista
PASSED: [[SimpleTest]]: [MySQL] 39,503 pass(es).
[ View ]
#18 interdiff.txt4.7 KBiflista
#15 interdiff-14.txt1.34 KBxjm
#14 aggregator-1622974-14.patch5.03 KBxjm
PASSED: [[SimpleTest]]: [MySQL] 42,127 pass(es).
[ View ]
#14 interdiff.txt0 bytesxjm
#13 aggregator-1622974-13-tests.patch4.02 KBxjm
FAILED: [[SimpleTest]]: [MySQL] 42,077 pass(es), 24 fail(s), and 2 exception(s).
[ View ]
#13 aggregator-1622974-13-combined.patch5.02 KBxjm
PASSED: [[SimpleTest]]: [MySQL] 42,115 pass(es).
[ View ]
#13 interdiff.txt905 bytesxjm
#11 aggregator-1622974-11-test.patch3.13 KBxjm
FAILED: [[SimpleTest]]: [MySQL] 42,067 pass(es), 24 fail(s), and 2 exception(s).
[ View ]
#11 aggregator-1622974-11-combined.patch4.13 KBxjm
FAILED: [[SimpleTest]]: [MySQL] 42,097 pass(es), 12 fail(s), and 0 exception(s).
[ View ]
#10 aggregator-1622974-10-tests.patch3.2 KBxjm
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch aggregator-1622974-10-tests.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]
#10 aggregator-1622974-10-combined.patch4.13 KBxjm
FAILED: [[SimpleTest]]: [MySQL] 42,097 pass(es), 12 fail(s), and 0 exception(s).
[ View ]

Comments

DocuAnt’s picture

To truncate the author field seems to be the right way for me because of the similarity to the title field.
Thanks in advance for the patch. :-)

marcingy’s picture

Title was truncated as the spec states the maximum length is 100 characters. I suppose the key question is the likelihood of someone querying against author but in the case of the example above it is 'authors'. I am incline to say make it a text field because in reality the above example shows that in truncating would invalidate a search anyway.

Pomliane’s picture

Priority:Normal» Major
Issue tags:+Needs backport to 6.x

Changing priority and tags with reference to #218004: Allow aggregator feed URLs longer than 255 characters

marcingy’s picture

Priority:Major» Normal

This is not a major bug.

marcingy’s picture

Priority:Normal» Major

Note to self read linked issue first ;)

cweagans’s picture

xjm’s picture

In this case I think we should truncate it. The URL is a special case.

joevansteen’s picture

Truncation is the approach I used to get a quick and dirty fix for myself under D7. I ran into this same scenario with a different "Nature" article. Scientific journals would seem more the culprits here. The only thing that might be a little cleaner is to try to avoid chopping a fraction of a name, and maybe inserting a "[more]" notice at the end to replace whatever is at the back of the field.

xjm’s picture

Assigned:Unassigned» xjm
Issue tags:+Needs tests

I'll write a test.

xjm’s picture

Assigned:xjm» Unassigned
Status:Active» Needs review
Issue tags:-Needs tests
StatusFileSize
new4.13 KB
FAILED: [[SimpleTest]]: [MySQL] 42,097 pass(es), 12 fail(s), and 0 exception(s).
[ View ]
new3.2 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch aggregator-1622974-10-tests.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]

@joevansteen, the truncate_utf8() function takes care of that for us (adding an ellipsis).

Tests and fix.

xjm’s picture

StatusFileSize
new4.13 KB
FAILED: [[SimpleTest]]: [MySQL] 42,097 pass(es), 12 fail(s), and 0 exception(s).
[ View ]
new3.13 KB
FAILED: [[SimpleTest]]: [MySQL] 42,067 pass(es), 24 fail(s), and 2 exception(s).
[ View ]

Oopsie, broken tests patch above. Fixed here.

Status:Needs review» Needs work

The last submitted patch, aggregator-1622974-11-combined.patch, failed testing.

xjm’s picture

Status:Needs work» Needs review
StatusFileSize
new905 bytes
new5.02 KB
PASSED: [[SimpleTest]]: [MySQL] 42,115 pass(es).
[ View ]
new4.02 KB
FAILED: [[SimpleTest]]: [MySQL] 42,077 pass(es), 24 fail(s), and 2 exception(s).
[ View ]

Adjusting another test that relies on the number of items in the sample feed.

xjm’s picture

StatusFileSize
new0 bytes
new5.03 KB
PASSED: [[SimpleTest]]: [MySQL] 42,127 pass(es).
[ View ]
+++ b/core/modules/aggregator/lib/Drupal/aggregator/Tests/FeedParserTest.phpundefined
@@ -38,6 +38,15 @@ function testRSS091Sample() {
+    $this->assertText('Long link feed item title');
+    $this->assertText('Long link feed item title');
...
+    $this->assertText('Long author feed item title');
+    $this->assertLinkByHref('http://example.com/long/author');
+    $this->assertText('Long author feed item title');

Oops, bad copypasta. Fixed in attached.

xjm’s picture

StatusFileSize
new1.34 KB

Upload failed apparently.

Berdir’s picture

Status:Needs review» Reviewed & tested by the community

Simple fix, has tests. Let's get this in.

+++ b/core/modules/aggregator/lib/Drupal/aggregator/Tests/FeedParserTest.phpundefined
@@ -38,6 +38,15 @@ function testRSS091Sample() {
+    $this->assertLinkByHref('http://example.com/tomorrow/and/tomorrow/and/tomorrow/creeps/in/this/petty/pace/from/day/to/day/to/the/last/syllable/of/recorded/time/and/all/our/yesterdays/have/lighted/fools/the/way/to/dusty/death/out/out/brief/candle/life/is/but/a/walking/shadow/a/poor/player/that/struts/and/frets/his/hour/upon/the/stage/and/is/heard/no/more/it/is/a/tale/told/by/an/idiot/full/of/sound/and/fury/signifying/nothing');

+++ b/core/modules/aggregator/tests/aggregator_test_rss091.xmlundefined
@@ -26,5 +26,16 @@
+      <author>I wanted to get out and walk eastward toward the park through the soft twilight, but each time I tried to go I became entangled in some wild, strident argument which pulled me back, as if with ropes, into my chair. Yet high over the city our line of yellow windows must have contributed their share of human secrecy to the casual watcher in the darkening streets, and I was him too, looking up and wondering. I was within and without, simultaneously enchanted and repelled by the inexhaustible variety of life.</author>

o.0

webchick’s picture

Version:8.x-dev» 7.x-dev
Status:Reviewed & tested by the community» Patch (to be ported)

Agreed with the fix, and thanks for the literature lesson in the process of the tests. ;)

Committed and pushed to 8.x. Moving to 7.x.

iflista’s picture

Assigned:Unassigned» iflista
Status:Patch (to be ported)» Needs review
StatusFileSize
new4.7 KB
new4.55 KB
PASSED: [[SimpleTest]]: [MySQL] 39,503 pass(es).
[ View ]

Backported to D7. Needs review.

iflista’s picture

StatusFileSize
new3.57 KB
FAILED: [[SimpleTest]]: [MySQL] 39,463 pass(es), 24 fail(s), and 14 exception(s).
[ View ]

Added test only version of patch.

Status:Needs review» Needs work
Issue tags:-needs backport to D7, -Needs backport to 6.x

The last submitted patch, aggregator-test-only-1622974-19.patch, failed testing.

xjm’s picture

Status:Needs work» Needs review
Issue tags:+needs backport to D7, +Needs backport to 6.x

#18: aggregator-1622974-18.patch queued for re-testing.

Berdir’s picture

Status:Needs review» Reviewed & tested by the community

Backport looks good too, RTBC (Patch in #18, #19 is test only)

no_longer_active_13’s picture

Status:Reviewed & tested by the community» Needs work

Truncating the author names is sorta not done in the publishing world AFAIK. It could lead to people not getting enough credit for their work. For example with research publications it is not uncommon that there are more than 3 authors.

It should really accommodate for all authors of a (research) publication. If there are multiple author names added to a publication then they should all be listed. How would you feel if your name got lobbed of a project because of an IT 'feature'? Not to happy, I imagine. ;-)

Could use a text field for example, and possibly limit this to for example 1500 characters. If people still run into trouble after that spec could always be upped. Having some kind of cap might be good though, for example standard allow for 30 really long names (50 names of 30 characters is 1500 characters). Perhaps make the number of characters user configurable? This could still leave the issue that names will get cut off.

Also if there is not already there should be some text in for example a readme.txt or at entry level that specifies how many characters are allowed per field. This way if the text is being lobbed off then it is more easy to discover why.

Edit from what I read here:
http://stackoverflow.com/questions/1303476/is-a-varchar20000-valid-in-mysql

In MySQL 5.0.3 and later versions varchar is allowed up to 65,535 characters. If that is the case could just set the varchar() to 1500 characters for example. This may be the least amount of work, while providing the solution. Could also do the same for the URL field. This needs further testing though.

Drupal 7 has as minimum MySQL spec 5.0.15.
What will the minimal MySQL spec for Drupal 8 be? It could be standardly implemented from 8. This would then still need a fix for 7.

xjm’s picture

Assigned:iflista» Unassigned
Status:Needs work» Reviewed & tested by the community

Thanks @design_dolphin. See #218004: Allow aggregator feed URLs longer than 255 characters for extensive discussion of why we are limiting the varchar, and why we used a text there. Also keep in mind that not all sites use MySQL for their storage.

My reasoning for truncation was that in each case, the author data is just a summary of the RSS item. The aggregator item still has a link to the actual item, so the full title, author, etc. information will be included there.

The current behavior when you have an author field over 255 characters isn't that the author is partially imported--it's that there's a fatal error that prevents any feed data from being imported at all. That's why this is a major bug. So let's commit #18 as a stopgap to resolve this major bug on production sites impacted by it, and then open a followup issue to discuss increasing the allowed length?

no_longer_active_13’s picture

Status:Reviewed & tested by the community» Needs work

o.k., I can see the reasoning behind that, and the need for a solution.

However, AFAIK there is no way to tell how the names are coming through the author field. (I find RSS feed a pain at times because they get improperly filled at the source.)

So how will varchar know if it is cutting of the persons first name or last name?
That might lead to strange displays of data. For example:
'John Smith, Jane Smith, Dr. Joe'

Would adding a text that states something like "Please see article for the full list of authors." if the author text exceeds the varchar limit be a cleaner solution?

For example something like this alfa pseudocode:

if (strlen($item['author']) => 255) {
$item['author'] = t('Please see article for the full list of authors.');
}

This won't solve the problem, however, in situations where the publisher insists on having the author names there.

How sure is it that the string coming through is UTF8? I've seen RSS feeds with all sorts of encoding. Won't this mess special characters up then? (I couldn't find any definitive info on aggregator encoding at the moment.)

If we're changing the field in the database to text, then a better solution may be to at the same time increase the string length entered for the field. Showing all the author names would solve the problem as well.

Edit:
Changed some text.

xjm’s picture

Status:Needs work» Reviewed & tested by the community

Please open a followup issue?

xjm’s picture

Reiterating, this issue is not changing any feature of Drupal 7. It's only stopping the whole feed import from breaking. Whether to lengthen the field is worth discussing, but it should not block this.

webchick’s picture

Version:7.x-dev» 6.x-dev
Status:Reviewed & tested by the community» Patch (to be ported)
Issue tags:-needs backport to D7

While I can definitely appreciate design_dolphin's point, xjm's correct that in this particular instance, I think truncating is the right thing to do, because:

- Aggregator items are ephemeral; reloading http://drupal.org/planet twice a week is going to come up with a completely different list of stuff. These are not intended to be reference pieces that stick around for a long time. AFAIK they don't even have their own URLs.
- Aggregator items are specifically intended to be a jumping off point to the "real" work that exists somewhere else, and will contain the full article, along with its full title and list of authors.
- Changing the data type of a field is not as easy as it sounds, due to cross-database concerns as well as query performance concerns.

If we were talking about something like Feeds module, whose primary purpose is to import this external content as actual site data, then I think the truncation argument makes a lot less sense. But that's not what we're talking abut here.

So, in the interest of closing down a major bug whose consequence is a database failure, Committed and pushed to 7.x. Thanks!

Now, to 6.x. :)

no_longer_active_13’s picture

I can understand the necessity for the current implementation, but I cannot help feel it is broken. Author credit can be a touchy subject in publishing. This is certainly the case when publishing content through a feed.

I don't have a whole lot of experience with the issue cue, so learning as I'm going (after having read the manual).

From what I understand the general consensus is that a follow up issue is necessary.
The follow up issue can be found here:
http://drupal.org/node/1811458

Please feel free to improve this issue where needed.

Have a nice day. :-)

xjm’s picture

Thanks @design_dolphin, that followup looks good. I'll link it in this issue's summary so folks investigating this issue can find it easily.

dcam’s picture

Should this issue be backported to 6.x? The related 255-character title field did not get ported to 6.x.

To be honest, I want to make certain because the 6.x code is quite a bit different than it is in 7.x.

webchick’s picture

Status:Patch (to be ported)» Postponed (maintainer needs more info)

Actually, that's a good point. Can you test it and see if a feed with a super long author field causes the site to die with a horrible database failure like in the original post? I suspect not, since D6 just uses db_query() wrappers and not PDO. If not, I think we can move back to 7.x and fixed. Sorry, I was just following the tags here. :)

Berdir’s picture

Yes, I believe this error only happens in 7.x because we enable the mysql strict mode, which isn't on in 6.x so MySQL should actually just anything that's too long. On the other side, it might break PostgreSQL, which 6.x kinda supports.

David_Rothstein’s picture

Issue tags:+7.17 release notes

Adding this one to the release notes and CHANGELOG.txt.

David_Rothstein’s picture

Issue summary:View changes

Updated issue summary.