Problem/Motivation
This idea was initially proposed as part of #3440578: [PP-2] JSON-based data storage proposal for component-based page building but is split out into this separate issue per discussion in that issue.
Suppose you have a site with long-form articles: where $node->body->value averages 5KB (~800 words with no formatting, or fewer words if there's also bytes taken up by HTML tags and attributes). If a given node has:
- 20 language translations
- 50 revisions per translation = 1000 total revisions
This now adds up to 5KB * 20 * 1000 = 100MB of just body field data just for that one node. If you have 1000 such articles, you now have 100GB of just body field data.
Note that this scales with the square of the number of translations, because generally if you make a revision in one language, you also tend to revise the other translations to match, while at the same time each revision stores its own copy of the data for every language.
Meanwhile, a lot of this data is duplicated. At a minimum, each revision is storing a copy of the data for every language, including the 19 languages not changed by that revision.
But it's not only the duplication of the non-active translations, consider also the case of a multivalued field. Suppose instead of a 5kb single-item body field, you have a multi-valued field with 10 items each containing 500 bytes, and for the majority of content edits (new revisions), you only change the value of 1 of those items. Now, in addition to each revision duplicating the field values of non-active translations, it's also duplicating the values of the 9 unedited items.
In other words, for a node with X field items and Y translations, a new revision where only 1 field item in 1 language is modified ends up storing X*Y field item values, where all but one are duplicates from other revisions.
Proposed resolution
Use an approach similar to string interning or hash consing, where instead of storing the large field value in the field item record, you store a hash of it there, and have a separate table that maps the hash value to the large value.
The question then is how to decide when to apply this interning logic and when not to.
- Option 1: Make it the responsibility of the field type. For example, in addition to our current
text_longfield type, we could add atext_long_internedfield type. And similarly for other field types that can hold large values, like field_union fields and json fields. - Option 2: Make it similar to a field storage setting, but one that's managed by the field system rather than by each field type.
Data model changes
Whether done as new field types or as a new field storage setting, it would be opt-in, so no changes to existing sites. However, it's not yet clear whether we can integrate this into Views and EntityQuery in a fully encapsulated way, or whether the decision of whether to intern or not would leak into various other code.
Remaining tasks
Discuss and/or prototype.
Issue fork drupal-3469082
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
effulgentsia commentedWhen I first posted this issue, I didn't know how feasible option 2 (implementing it at a central but opt-in level) would be, but having thought a bit more on it, I think it's quite feasible.
The key is deciding at what granularity to intern:
text_longfield, intern thevalueproperty but not theformatproperty)Of these, I think that:
Given this, here's a rough outline of a possible implementation:
Let's assume that
$node->bodyis opted into interning via some configuration that's similar to a field storage setting. Then we want to modify the entity storage/schema handler(s) in the following way:body_value,body_summary, andbody_formatcolumns to thenode_revision__bodytable, we only add abody_hashcolumn.node_revision_interned_item__bodyandnode_revision_interned_item_list__body.node_revision_interned_item__bodywould be:entity_id,hash,body_value,body_summary, andbody_format.node_revision_interned_item_list__bodywould be:entity_id,list_hash,delta,item_hash.Now let's say
$node->bodyhas two items in it and it's time to save that. What we would write to the database is:node_revision_interned_item__bodymapping the hash of[value, summary, format]to those values.xxxxxxand the hash of item 1 isyyyyyy, then we calculate a single hash for['xxxxxx', 'yyyyyy']. Let's say that'szzzzzz. Assuming this is for node 100, what we'd store innode_revision_interned_item_list__bodyis:node_revision__bodytable, we only store a single delta with'zzzzzz'for thebody_hash.The
entity_idcolumn in the two interned tables isn't strictly necessary, since the hash is enough to uniquely map to the underlying values, but I'm suggesting it to make garbage collection easier. For example, it means when an entity is deleted, we can safely delete all of its interned records. Or, if we want to do more fine-grained garbage collection, we can do bulk/batch operations on a subset of entities at a time.Comment #3
longwaveI think this could be simpler if we do this at FieldItem level?
body_value,body_summary, andbody_formatcolumns to thenode_revision__bodytable, we only add abody_hashcolumn.node_revision__body__interned, with columnsbody_hash,body_value,body_summary, andbody_format.At insert/update time we just calculate the hash, create the interned storage row if it doesn't exist already, and store the hash in the revision table.
Loading/searching the data just becomes an additional SQL join to the revision table.
Garbage collection can be still be done by deleting rows from the interned storage table where the hash does not appear in the revision table; this can be done out-of-band (on cron) because, apart from disk space, it doesn't matter if we have orphaned hashes.
I had a thought that this could even be more generalised by having a single interned storage table per field type, which might provide slightly better effective compression ratios in some cases where the same data may appear in multiple fields, but the tradeoff there is that garbage collection is much more difficult (we would have to consider hashes from multiple tables), and I don't think the savings will be worth it.
Are we also assuming that this will only be used for revision storage? ie.
node__bodywill not use the interned strings?This technique could also be useful in a related but slightly different way for the
messagecolumn in thewatchdogtable, which can also grow quite large on some sites.Comment #5
effulgentsia commentedThe MR is looking great! I'm thrilled to be seeing this starting to come to life.
I think these are the key differences between #3 (what the MR is implementing) and what I suggested in #2:
node_revision__body__internedinstead ofnode_revision_interned_item__body.+1 to this. Much better name :)
entity_idcolumn innode_revision__body__interned.The hypothesis to this being that we can do efficient enough garbage collection without it. +1 to this; that seems like a reasonable hypothesis.
FieldItemlevel and not add a 2nd level of interning for theFieldItemListlevel.Perhaps this is an okay place to start. My thinking in #2 for wanting to intern at both levels is for something like #3468272: Store the ComponentTreeStructure field property one row per component instance. If we store that as a multi-valued field, each item wouldn't be very large, but there could be a lot of items. And you might have many revision edits where all you're doing is changing prop values and not changing the layout at all. So in that case having each revision only store a single hash instead of a hash per component seems beneficial. But, this is less critical than the
FieldItemlevel for prop values (and non-XB use cases), so I'd be fine with punting theFieldItemListlevel to a followup.Yeah, perhaps I'm wrong but I'm not foreseeing it having much benefit for the non-revision tables. Where we know duplication happens a lot is across revisions. I don't think duplication across nodes is a significant contributor to database size. And by not doing it for the non-revision tables, it means less disruption to existing Views, EntityQuery, etc. code., except where those things query the revision tables.
Oh that's a cool idea! Separate issue please, but that would be neat if it would help there. Do you think it would make much impact if it's only that column, but not also the
variablescolumn?Comment #6
nicxvan commentedI think there may be a mistake in your math, or I'm missing something.
5KB * 20 * 1000 = 100MB
You already multiplied by the 20 languages by 50 revisions to get 1000. Why are you multiplying by 20 again?
If I understand correctly then the math should be:
Your body size should be 5KB * 1000 which is 5MB. and 1000 articles is 5 GB.
Still massive, for no particular benefit, but I think the scale is off by an order of magnitude.
This would still save GB in a db and is definitely worth it!
Let me know if I'm missing something, if not I can update the IS.
Comment #7
effulgentsia commentedThanks for checking the math! I appreciate having more eyes on this.
Because even though any given revision only edits one translation at a time, the revision table stores records for every language for each revision. For example, say you create a brand new node and translate it into 20 languages. Conceptually you've only made one revision per language, but Drupal stores it in the database as 20 revisions, and the 20th revision has records for all 20 languages. If you now edit the node to make a small change (say add another sentence to
$node->body), and then edit every translation to make that same change for each language, Drupal will have stored an additional 400 records, because each translation's change gets stored as a new revision, and each of those new revisions copies over the records for every language.So if you then make 49 more changes, and translate each of those changes into each language, you'll end up with 1000 revisions and 20,000 records.
Comment #8
longwaveSome more thoughts on this while I slept on it:
Comment #9
nicxvan commentedAh, I was missing that detail. This will be a great change.
Will this have any impact on entity reference revisions? Paragraphs already get weird with translations and revisions.
Comment #10
longwave@nicxvan this should be suitable for any large field in any entity type that stores duplicated data so it will help reduce database size with Paragraphs as well as Experience Builder.
Comment #11
joachim commented> Because even though any given revision only edits one translation at a time, the revision table stores records for every language for each revision
What's the reason for the translation/revision system doing this? is it documented anywhere?
Comment #12
effulgentsia commentedNot sure if/where it's documented, but I think the reason is so that every revision is complete. For example, to be able to visit
en/node/1/revisions/1/viewfor any language and any revision ID and be able to display that by only loading the data for that revision, regardless of if the language you're viewing it in matches the language that was edited by that revision.Comment #13
nicxvan commentedGreat to hear this will help there too!
Comment #14
joachim commented> Not sure if/where it's documented, but I think the reason is so that every revision is complete
Instead of fiddling at the storage level, we could say that when you load revision N, if it doesn't have a particular language at that revision, you get the largest revision ID of that language < N.
Though I don't actually think it's very useful for editors to have the system the way it is. If you're the content editor working in German for instance, you view of your translation is littered with pointless identical revisions. Why should you care if your colleague in the French office made 6 edits to the French translation?
Comment #15
catchHashing: xxHash is probably what we should use these days - non cryptographic but collision resistant.
On the overall reason why the storage is like this, like most things it evolved over time and happened b somewhat by default. We started off with translations in different entities altogether (D6) and the tnid column. Drupal 7 added language to field tables but without an entity translation UI or support for base fields in core. Drupal 8 added the rest.
When we load and save an entity, we load and save the entire entity, not a slice of it. We would have had to have implemented per-language revision storage on top of all of the other things to make entity translation work and I don't think it really came up.
This would need to work for entity revision queries and views too, including revision queries across different languages. It's likely to result in more complex and hard to index queries than what's being discussed here. e.g. here we're looking at an extra join on fields that are rarely included in an entity query. I think selective revision saving/loading would require MAX(vid) for every language or something. Imagine a condition on a field value in any language, or querying with both a translatable field value and a non-translatable field value as conditions. Maybe it could be simplified if the latest revision holds all languages, but then there is workspaces too.
An advantage of the approach here is it's isolated to field types that are rarely queried, and it would work for monolingual sites where the body is large and the only change is publishing/unpublishing or adding a term reference or similar.
Comment #16
longwaveBack to my thoughts in #8:
Comment #17
longwaveSome test fails around installed config, but while I sort those out and add new test coverage marking as NR to see if anyone has any early comments on the approach so far.
Will also need an upgrade path for existing config if we go this route.
Comment #18
nicxvan commentedIs this trading runtime efficiency for storage efficiency?
I'm concerned this makes saving and loading fields slower to hash revisions for the majority of sites. It seems it will make queryng revisions more complex.
I have a client with 10 languages and many nodes. When I shared this with them they raised concerns about the performance trade off since cpu is more expensive than storage.
Comment #19
effulgentsia commentedIt's opt-in: a per-field configuration that defaults to false. So on each site you'll be able to choose which fields, if any, to apply it to.
For fields that opt into it, it adds an indexed JOIN to loading and querying revisions.
For fields that opt into it, it adds having to compute a hash for every field item for every translation when writing a revision (saving an entity). It's possible that this could be expensive (and we might want to consider future optimizations to track which items/translations are dirty and need re-hashing), but xxHash is very fast (so dirty tracking might not be needed).
Comment #20
effulgentsia commentedJust to set expectations, this might help some with Paragraphs but probably not as much as it will help with Experience Builder. For Paragraphs, it will help with individual fields that have long content (if those fields opt into it, per #19), such as
text_longfields that content editors populate with a lot of text. However, for some sites, Paragraphs don't necessarily have a ton of content in any one field, they just have a lot of fields, and in those cases, the work in this issue won't help much with that.Because Experience Builder will combine values for all fields within a component into a single JSON field or Field Union field, this issue will help with the "lots of fields with small content" use case as well as the "single fields with large content" use case.
Comment #21
effulgentsia commentedI agree with 128 bit rather than 64 bit. https://en.wikipedia.org/wiki/Birthday_attack has a table that shows that with 64 bits, you'd have more than a 1 in a million chance of collision once your table grows above 6 million records.
However, instead of a 32 character hex string, we could reduce it to a 24 character base64 string.
Comment #22
nicxvan commentedThat resolves our concerns!
I missed: https://git.drupalcode.org/project/drupal/-/merge_requests/9417/diffs#83...
Comment #23
smustgrave commentedSeems the MR needs a manual rebase.
Comment #24
catchI think we should consider whether there might be better savings from using native MySQL compression, e.g. from #2857359: Make MySQL ROW_FORMAT configurable - we could set that by default for new sites potentially. Or whether both MySQL compression and this issue would combine to mean even greater space savings.
Comment #25
pere orga#2857359: Make MySQL ROW_FORMAT configurable is nice because it doesn't increase complexity in Drupal. But because MySQL compression is applied per page (typically 16KB, it could be smaller), space would be saved only when the repeated data is located within the same page.
Also, it wouldn't reduce the size of database backups.
(And of course, it's MySQL-specific)
Comment #27
andyf commentedThanks all! Rebased against 211722d90c2fdedb17308f4e7999275c2b3df495.
I tested it out locally on a site with some large revisions (max 5MB) but is otherwise simple: one language, only the (single-value) body field gets large, < 20k nodes. I saw the dump size reduce to ~50% of its original size (uncompressed), and the MySQL (5.7) DB data directory fall to about 68% of the size, measured immediately after doing a fresh DB import.
Comment #28
smustgrave commentedBelieve still needs an upgrade path. Also tagging for a change record to explain what the new key is doing.