Problem

D7 Field API always creates revision db field tables even though the nodes do not have revisioning turned on.
As a result every field table in database has exact same content as its revision table. My field_data_body table is 18.4MB large, so is field_revision_body. And the data inside is identical. I don't have a single revision on my website so why would there be a redundant information in the tables. As the database grows this will only add to overhead. Is there a reason for theses tables to hold duplicate data? Is is safe to empty these revision tables?

Advantages of resolution

This almost doubles SQL writes for nodes and decreases performance. Here is a case study that shows node write performance doubles using https://drupal.org/project/field_sql_norevisions.

Related issues

#120967: Separate revisions from node.module

Comments

agmin’s picture

The field_revision tables store the revision id, which never changes in these tables. The field_data tables store the node's current revision id and does not store old revision data. Basically the field_data tables only ever contain 1 entry per node. The field_revisions table will contain the number of revisions you've created, preserving the revision data. So if you never use revisions, then yes, the tables will contain identical data.

Since revisions are a core part of Drupal I don't think there's anything you can do about this. You can probably delete the contents of the field_revision tables (please test first, don't take my word for it!), but as soon as you start creating/editing nodes the field_revision tables will repopulate.

graytoby’s picture

That's what I was afraid of. I emptied these tables on test instance with no visible side effects so far. They also don't re-populate when editing nodes.

Perhaps in Drupal 8, revisions could be disabled if end user wishes to do so. I will probably never use revisions, so extra content in database is not desired.

damien tournoud’s picture

Status: Active » Closed (works as designed)

This is by design. field_revision* is the canonical storage. field_data* is a denormalization to make it easier to query the data in some cases.

giorgio79’s picture

Hello,

If we have revisions turned off, the revisions table seems to just be a duplication of the field table. In my case I have a pretty large table of several GBs, is it safe to truncate the revisions table if I have revisions turned off?

kscheirer’s picture

Status: Closed (works as designed) » Active

Sorry to reopen this, but I think this is very odd behavior as well.

We de-normalized our data and intentionally created duplicates in order to "make it easier to query the data in some cases". If all we're avoiding is a database table join, that doesn't sound like a great benefit to me. DB's are good at that kind of stuff. I also have a situation where a large number of nodes without any revisions - so this duplication is causing several extra GB of db space that's just wasted.

If the decision to duplicate is final, could we abstract the Revisions code into a core module, and allow folks to disable this functionality completely if they never want revisions on their site?

kscheirer’s picture

Version: 7.8 » 8.x-dev
Category: support » feature
sinn’s picture

I need store few thousands rows in DB and each row has ~4MB. It is incredible store the same field_data and field_revision and I think I should use my own table for it. It isn't comfortably because I need it as field.

js’s picture

As Drupal scales to suit larger sites, this becomes a greater problem. I have built hundreds of sites and never thought about this issue because the largest site to date had less than 30,000 nodes and lots of hardware.

My current site has over 200k nodes and will be growing substantially. I just moved one field into a
/project/data table and saved 30 GB. I should probably have used a MongoDB field, but I haven't gone through that learning curve as yet.

I have a few other fields with over a 1 GB redundant table, and they will be growing as well.

It would be nice to reduce this DB load, but I would like to use revisions on field by field level, but having it globally would not be good for this site. Only a few fields can be updated by users and they are not the biggest fields.

It is obvious to me that this will be a feature in the future. Likely was entity designs become more advanced for hybrid backends.

kevinquillen’s picture

Just noticed this as well.

A site we are building imports a data feed and has 120 fields, times 2 for the revision tables. We don't use, or plan to use, revisioning. There are thousands of records inserted into the data fields, which is okay, but since revisions copies it all, there are half a million records in over a dozen revisioning tables, totalling over 50MB each. This caused the db to go from 100mb to over 600mb, before its even gone into production.

Would it be okay to simply write a cron hook, get field instances for custom entities, and run a db truncate against those revision tables? Would that affect anything?

js’s picture

It sounds like it.

Perhaps, for the sites that care, a snippet in a hook to NULL the revision.* fields.

kscheirer’s picture

I found a post from Damien explaining more about the decision to denormalize. I'll link it here when I have time to track it down. As far as I can tell, he's the only person who knows much about when/why this happened.

kevinquillen’s picture

If you aren't using revisions, it just seems quite unnecessary...

peterx’s picture

The node table has the current version. You can access the field_revision tables at almost exactly the same speed as the field_data tables. You could drop the field_data tables completely and use only the field_revision tables. I cannot find a use case anywhere for the field_data tables from the access or performance perspectives.

In fact the whole database would be faster because you are reducing the number of tables and indexes. If the revision column in the index is a performance issue anywhere, move the revision column to the bottom of the primary index so the column is not used when you access unrevisioned data.

When reorganising the primary index, move delta below language. If a site uses language, the most common query would select by language and accept all deltas.

The node_revision table contains only one column not in node, the log column. If the log column was changed to a field, the node_revision table could be dropped.

js’s picture

@peterx, that sounds ideal.

On a large site I am updating some core tables directly and it would be nice not to deal with double tables, not to mention the extra space. I moved a couple of large fields to "data" tables to reduce the number of fields and save about 30 GB of database space (expensive, performance oriented, and backup sensitive space). I will be replicating the database, so that doubles the saving again. The site grows daily.

It would be nice to make changes to D7, since we will be using it for the next few years, considering the very slow uptake of D7 from 6.

kscheirer’s picture

Category: feature » bug
Priority: Normal » Major

Elevating to major, maybe this will get some more eyes. In Drupal 6 and below, we indeed kept all the node body information in a revision table, and we should go back to that. As many have noted now, it's a serious db size bloat problem for large sites and gives us 0 performance benefit.

marcingy’s picture

Category: bug » feature
Priority: Major » Normal

Priorities have meanings this is not major and is not a bug, things work.

marcingy’s picture

Category: feature » task

Maybe it should be a task.

kevinquillen’s picture

I agree. My database is now at 532 tables because I am using around 150 different fields across multiple entities.

kscheirer’s picture

Major priority is also used for tasks or features which consensus has agreed are important (such as improving performance or code refactoring), but which are not functional bugs. From http://drupal.org/node/45111.

So what kind of consensus do we need?

js’s picture

@marcingy I disagree with you. This is a major design flaw for some sites. It becomes more apparent with scale. For instance, a current project has pushed by database past 80 GB, which, at Rackspace has doubled the cost of that server instance, plus doubled the cost of the slave, plus pushed the size past what Rackspace backup.

Connecting to a large database with PhpMyAdmin is very slow, and equally slow changing pages to view the additional tables. Have 50 to 100 extra tables exacerbate the problem for no benefit.

On large sites custom work can be faster working directly with tables to avoid loading nodes (bundles) and limiting the queries is a cost and performance benefit.

For the vast majority of sites, these factors are not a concern.

David_Rothstein’s picture

Component: database system » field system

Pretty sure this is entirely internal to the field_sql_storage module, meaning that (in theory) in Drupal 7 you could write an alternate storage engine in contrib that did it differently.

Doesn't look like field_sql_storage has its own component in the issue queue, but this has nothing to do with the database API either, so I'm moving the issue accordingly.

peterx’s picture

"entirely internal to the field_sql_storage module"? If true and the storage engine is structured the right way, you should be able to modify the code at one point. Looking through just one Web site, I found 70 references to field_data and 14 references to field_revision in field storage. Building a modified field storage looks doable.

In another site, there are hundreds of references in the add-on modules. By the look of most references, thye are due to a lack of easy fast access functions in Fields. The add-on could add a few extra functions to make it the preferred module for module developers.

kevinquillen’s picture

I am still of the mindset that if you are not using any revisioning, the tables should not be created, thus, the option should be present in Config if using default field_sql_storage to prevent this from getting away from you.

That said, was the suggestion to duplicate field_sql_storage storage and remove revisions from that?

peterx’s picture

@kevinquillen, patching the field storage module would be a big change and possibly change the API. API changes are not allowed during a release. A separate add-on module can do anything immediately and is the main reason for developing an alternative module.

The best approach would be to branch field storage and build a working version. If the working version includes an API change, it would have to stay a contributed module.

If there are no API changes, it could replace the module in core or be retrofitted. You would have to perform lots of testing. I expect most core developers will reject it in favour of all the other things happening in D8. Some site owners will still jump at it as a D7 add-on enhancement and some significant conversions to the add-on module will help push the change into D8.

mototribe’s picture

I'm also very interested in a contrib module to turn off revisions. Ideally, it should give a selection for each content type to enable or disable revisions.
If the revisions for a content type are turned off it would also turn off the revisions for any fields in this content type.

kevinquillen’s picture

Someone already beat me to it:

http://drupal.org/project/field_sql_norevisions

I want revisions off, entirely- we find it a very rare use case to need it.

gokul.muralidharan’s picture

Though it has only one duplicated record against each field when revision are not enabled it is still large for high volume sites, This is really a large database load we need to carry around in memory. Any one had good success with field_sql_norevisions ?

modstore’s picture

I have a site with a db of ~9gb, and looking at table sizes, by not including the node revisions duplicate data, the db size could basically be halved.

I am hesitant though to try the field_sql_norevisions module, due to the possibility of future problems with contrib modules etc. Anyone given it a go with lots of contrib modules?

andypost’s picture

Suppose better to remove node_revision into own module (looks like D9) and implement field_sql_norevisions module storage that could be shipped with core

kevinquillen’s picture

protools’s picture

laurentcooper’s picture

Version: 8.x-dev » 7.20

http://drupal.org/project/field_sql_norevisions would be a good solution, but:

it has bugs that break some sites
cannot uninstall if it creates a bug, because it destroys all content

on the other hand, it has major performance increasing effect

i can understand that these revision tables are there for a reason, but when revisioning is disabled, what is the reason?
this single problem makes me reconsider using drupal, not because i am angry or something, but the difference these duplicate fields make in database size and table count is HUGE. i have many content types, and even more fields, with around 200k nodes so far. please fix this somehow, or at least create a workaround that does not break views :(

in the meantime i will try to make a module similar to the sql norevisions one, without being so destructive, but since i am not an experienced programmer, i have small chance. anyway, following this thread, if someone knows an answer or ideas, please write it.

kevinquillen’s picture

I think one of the reasons is you cannot disable revisioning at a global level, and modules like Drupal Commerce implement revisioning and other modules leverage the revisioning provided in EntityAPI, and as you said, Views. It seems messy.

A band-aid solution I came up with on a previous project was a cron task that simply truncated the tables that were far too large. Cut my database down by 75%. It does not solve the speed issue, though, for all the other contrib modules using revision tables.

I can't look for it right now, but I am sure there were discussions about the fact that if you have a site with 300 fields, you have 600 tables (not counting all the other non Field tables). You give a site editor ability to create Fields, they're gonna do it, and it is going to get messy and hard to maintain.

Most people may not mind a database that is 3, 5, 20 GB - but for an average Drupal database that is quite large, particularly if you have a dev team that needs to move it around daily (prod to stage, to local, to dev, etc) keeping everyone moving.

Other solutions are to use something like MongoDB but, it is not really proven yet for sites and not many hosts will provide that for you.

marcingy’s picture

Version: 7.20 » 8.x-dev

Please do not change versions.

Triskelion’s picture

Drupal version tracking should be rethought completely. Even on a small site, if hardware resources are limited the data bloat can impact performance.

Version tracking should be on a field by field basis. Presently the vid in the node table is reflected by the revision_id in every field table. That means that every revision will generate additional records in every field revision table, even if the field is unchanged.

If the revision_id is unique to the field, then the field_data_... tables would not have to store the field values, just the revision_id pointing to the current revision in the field_revision_... tables. A datestamp or hash could be added to the field revisions to group revisions that happened simultaneously, and a revision history could be established for each field. The uid of the user responsible for the revision could also be added for even better tracking, and the database bloat would be eliminated.

Drupal has been getting progressively heavier, and I think that we should be looking very seriously at performance issues like this before developers move away from Drupal because of these inefficiencies.

restyler’s picture

Mandatory revision tables mean ~50% increase in writes during inserts and updates. That's a lot. We have a project with lots of new nodes created every day, with big amount of fields, and writes (node creation/update process) are definitely not too fast even on our dedicated server.

Triskelion’s picture

@restyler - Not if the data is being stored once, which is why I feel the field_data_... tables should be nothing more than a join between entity and field_revision_... . That way a single write would suffice, particularly if version tracking was not implemented.

peterx’s picture

For some sites, revisions should be off as part of site configuration. 95% of my sites would have it off at the site level if the option was available.

For the rest, the current system or revisions by content type would be best. If a content type has revisions, it is faster to store all the data at the same revision. The alternative would be a horrible join on revisions getting the most recent revision less than or equal to the current revision. What you save in the initial write, you would lose several times over on every read.

A bigger saving for revisions would be merging multiple columns into one row. Assume you have content types A, B, C, etc. Field X, Y, and Z are only in B and not any other content type. The fastest access would be to place all the fields in the same table. Addressfield does that. For some of my sites, I replaced multiple field columns with Addressfield style modules and the speed difference was noticeable.

Switching off revisions at the content type would work because the code already does special things at the content type level. The only extra logic needed would be to work out if field X needs revisions, When field X is used in a content type that needs revisions, you have to look in revisions. The result would be field X would always have a revision table but the revision table would not be accessed for content types not using revisions. You reduce reads but are stuck with duplicate tables.

When a site needs point in time auditing, you need revisions on far more than fields, you need it on roles, taxonomy, everywhere. Revisions should have been a separate module with an API callable for any table. now that we have the entity API, we should have an entity revision API.

The other thing that should be easy to integrate is the Workbench moderation module. If we had a revision API and WM worked through the revision moderation, we could look at the case where content is revised and roles are adjusted to change access to the revised content. A site manager could see both previous revisions and who could see the revisions at the time they were published.

+1 vote for an Entity revision API module.

xandeadx’s picture

any updates?

andypost’s picture

Version: 8.x-dev » 9.x-dev
Category: task » feature

This is a feature that was not rtbc before Feb 16 2013 so moved to D9 queue.
Let's grow this in contrib while in D8 cycle

pancho’s picture

Title: Duplicate data in field_data* and field_revision* tables » Deeply decouple revisions from Entity API and Node module
Version: 9.x-dev » 8.x-dev
Category: feature » task
Issue tags: +Performance, +node revisions

Re #40:
Why would this be a feature? It's a task that could impressively improve performance for sites without revisions.
Still I absolutely agree that this looks like D9 material, simply because it's pretty much impossible to refactor the whole node storage that late in D8's release cycle.

However, if it should turn out that with D8 we don't arrive at a decent performance, we'd have to come back and check major performance improvements like this one. D8 is still very slow, and while it will certainly improve, that might very possibly not be enough. Therefore I'm wary of deferring this to D9 all too early, even though there is no clear perspective on getting this into D8.

Otherwise I agree with #29:

Suppose better to remove node_revision into own module (looks like D9) and implement field_sql_norevisions module storage that could be shipped with core

Even if it turns out we can't decouple node_revision's backend in the D8 cycle, for now it would make sense to start with decoupling the UI, see #120967: Separate revisions from node.module. Some 1.000 codelines less are at least something.

And yes, #38:

+1 vote for an Entity revision API module.

this would be the correct scope.

pere orga’s picture

Priority: Normal » Major

I don't know if 8 or 9, but I agree that this would be a major improvement.

swentel’s picture

Priority: Major » Normal
pancho’s picture

Status: Active » Postponed

Oh, #1497374: Switch from Field-based storage to Entity-based storage is still on the D8 track!
So great, let's postpone this one to it, to be a possible followup.

giorgio79’s picture

giorgio79’s picture

Issue summary: View changes

clarification

giorgio79’s picture

Issue summary: View changes

more

jibran’s picture

swentel’s picture

swentel’s picture

Issue summary: View changes

even more

giorgio79’s picture

nithinkolekar’s picture

rephrasing the OP's original issues

D7 Field API always creates revision db field tables even though the nodes do not have revisioning turned on.

Shouldn't it be nice to disable that option itself , if it won't fixed/added(feature) to d7. I am still agree with kscheirer's comment. If any option provided in core , which is not functional as it should be and raising performance issue should be considered as bug rather than feature/support/task.

berdir’s picture

Status: Active » Closed (works as designed)

If you look at D8, HEAD already no longer creates the revision field tables if the *entity type* does not support revisions. See fields on users for example.

D7 Field API always creates revision db field tables even though the nodes do not have revisioning turned on.

There is no setting to turn revisions on or off on nodes. The only thing you can control is if a new revision should be created *by default* when saving a node. It is always possible to create a revision manually if you have the necessary permission.

giorgio79’s picture

berdir:

Closed (works as designed)

Yes, works as designed, but that design is not necessarily good as explained in the issue. Unnecessary duplicate sql tables are bad.

#2083451: Reconsider the separate field revision data tables outlines the issues perfectly, especially comment https://www.drupal.org/node/2083451#comment-7841725

It may be sufficient to keep one of the issues open.

root66’s picture

Is it possible to change the revision data tables from InnoDB to MyISAM in order to decrease the InnoDB database size, so it fits completely into the RAM?

soul88’s picture

The short answer is you wouldn't like to do that as MyISAM doesn't support transactions. The long answer is such questions should be asked/discussed in forum: https://www.drupal.org/forum or some other support "channels": https://www.drupal.org/support but not D8 core issue queue.

hkoosha’s picture

What's the status on this in 2019?

kscheirer’s picture

Looks like all the discussion never went anywhere. #2083451: Reconsider the separate field revision data tables petered out around 4 years ago. https://www.drupal.org/project/field_sql_norevisions never got a D8 port.