Add way to "intern" large field item values to reduce database size by 10x to 100x for sites with many entity revisions and/or languages [#3469082]

Problem/Motivation

This idea was initially proposed as part of #3440578: [PP-2] JSON-based data storage proposal for component-based page building but is split out into this separate issue per discussion in that issue.

Suppose you have a site with long-form articles: where $node->body->value averages 5KB (~800 words with no formatting, or fewer words if there's also bytes taken up by HTML tags and attributes). If a given node has:

20 language translations
50 revisions per translation = 1000 total revisions

This now adds up to 5KB * 20 * 1000 = 100MB of just body field data just for that one node. If you have 1000 such articles, you now have 100GB of just body field data.

Note that this scales with the square of the number of translations, because generally if you make a revision in one language, you also tend to revise the other translations to match, while at the same time each revision stores its own copy of the data for every language.

Meanwhile, a lot of this data is duplicated. At a minimum, each revision is storing a copy of the data for every language, including the 19 languages not changed by that revision.

But it's not only the duplication of the non-active translations, consider also the case of a multivalued field. Suppose instead of a 5kb single-item body field, you have a multi-valued field with 10 items each containing 500 bytes, and for the majority of content edits (new revisions), you only change the value of 1 of those items. Now, in addition to each revision duplicating the field values of non-active translations, it's also duplicating the values of the 9 unedited items.

In other words, for a node with X field items and Y translations, a new revision where only 1 field item in 1 language is modified ends up storing X*Y field item values, where all but one are duplicates from other revisions.

Proposed resolution

Use an approach similar to string interning or hash consing, where instead of storing the large field value in the field item record, you store a hash of it there, and have a separate table that maps the hash value to the large value.

The question then is how to decide when to apply this interning logic and when not to.

Option 1: Make it the responsibility of the field type. For example, in addition to our current text_long field type, we could add a text_long_interned field type. And similarly for other field types that can hold large values, like field_union fields and json fields.
Option 2: Make it similar to a field storage setting, but one that's managed by the field system rather than by each field type.

Data model changes

Whether done as new field types or as a new field storage setting, it would be opt-in, so no changes to existing sites. However, it's not yet clear whether we can integrate this into Views and EntityQuery in a fully encapsulated way, or whether the decision of whether to intern or not would leak into various other code.

Remaining tasks

Discuss and/or prototype.

Issue fork drupal-3469082

Show commands

Start within a Git clone of the project using the version control instructions.

Add & fetch this issue fork’s repository

Or, if you do not have SSH keys set up on git.drupalcode.org:

Add & fetch this issue fork’s repository

3469082-interned-prototype changes, plain diff MR !9417
Check out this branch for the first time

Check out existing branch, if you already have it locally

About issue forks

Comments

Comment #1

20 August 2024 at 04:33

effulgentsia created an issue. See original summary.

Comment #2

effulgentsia commented 3 September 2024 at 20:48

When I first posted this issue, I didn't know how feasible option 2 (implementing it at a central but opt-in level) would be, but having thought a bit more on it, I think it's quite feasible.

The key is deciding at what granularity to intern:

At the FieldItemList level (meaning store a single hash of the full list)
At the FieldItem level (a hash per field item)
At the property level (e.g., for a text_long field, intern the value property but not the format property)

Of these, I think that:

At the FieldItem level is the most valuable.
At the FieldItemList level would provide significant added value so long as it's in addition to, not instead of, the FieldItem level.
I don't think there's much value to intern at the property level, and also this would be the most difficult to implement centrally.

Given this, here's a rough outline of a possible implementation:

Let's assume that $node->body is opted into interning via some configuration that's similar to a field storage setting. Then we want to modify the entity storage/schema handler(s) in the following way:

Instead of adding the body_value, body_summary, and body_format columns to the node_revision__body table, we only add a body_hash column.
We add two other tables: node_revision_interned_item__body and node_revision_interned_item_list__body.
The columns for node_revision_interned_item__body would be: entity_id, hash, body_value, body_summary, and body_format.
The columns for node_revision_interned_item_list__body would be: entity_id, list_hash, delta, item_hash.

Now let's say $node->body has two items in it and it's time to save that. What we would write to the database is:

For each item, a record in node_revision_interned_item__body mapping the hash of [value, summary, format] to those values.
Let's say the hash of item 0 is xxxxxx and the hash of item 1 is yyyyyy, then we calculate a single hash for ['xxxxxx', 'yyyyyy']. Let's say that's zzzzzz. Assuming this is for node 100, what we'd store in node_revision_interned_item_list__body is:
```
100, 'zzzzzz', 0, 'xxxxxx'
100, 'zzzzzz', 1, 'yyyyyy'
```
Finally, in the node_revision__body table, we only store a single delta with 'zzzzzz' for the body_hash.

The entity_id column in the two interned tables isn't strictly necessary, since the hash is enough to uniquely map to the underlying values, but I'm suggesting it to make garbage collection easier. For example, it means when an entity is deleted, we can safely delete all of its interned records. Or, if we want to do more fine-grained garbage collection, we can do bulk/batch operations on a subset of entities at a time.

Comment #3

longwave

he/him

English

commented 4 September 2024 at 17:33

I think this could be simpler if we do this at FieldItem level?

Instead of adding the body_value, body_summary, and body_format columns to the node_revision__body table, we only add a body_hash column.
We add another table, node_revision__body__interned, with columns body_hash, body_value, body_summary, and body_format.

At insert/update time we just calculate the hash, create the interned storage row if it doesn't exist already, and store the hash in the revision table.

Loading/searching the data just becomes an additional SQL join to the revision table.

Garbage collection can be still be done by deleting rows from the interned storage table where the hash does not appear in the revision table; this can be done out-of-band (on cron) because, apart from disk space, it doesn't matter if we have orphaned hashes.

I had a thought that this could even be more generalised by having a single interned storage table per field type, which might provide slightly better effective compression ratios in some cases where the same data may appear in multiple fields, but the tradeoff there is that garbage collection is much more difficult (we would have to consider hashes from multiple tables), and I don't think the savings will be worth it.

Are we also assuming that this will only be used for revision storage? ie. node__body will not use the interned strings?

This technique could also be useful in a related but slightly different way for the message column in the watchdog table, which can also grow quite large on some sites.

Comment #4

4 September 2024 at 21:37

longwave opened merge request !9417

Comment #5

effulgentsia commented 5 September 2024 at 04:39

The MR is looking great! I'm thrilled to be seeing this starting to come to life.

I think these are the key differences between #3 (what the MR is implementing) and what I suggested in #2:

Name the table node_revision__body__interned instead of node_revision_interned_item__body.
+1 to this. Much better name :)
Don't include an entity_id column in node_revision__body__interned.
The hypothesis to this being that we can do efficient enough garbage collection without it. +1 to this; that seems like a reasonable hypothesis.
Intern only at the FieldItem level and not add a 2nd level of interning for the FieldItemList level.
Perhaps this is an okay place to start. My thinking in #2 for wanting to intern at both levels is for something like #3468272: Store the ComponentTreeStructure field property one row per component instance. If we store that as a multi-valued field, each item wouldn't be very large, but there could be a lot of items. And you might have many revision edits where all you're doing is changing prop values and not changing the layout at all. So in that case having each revision only store a single hash instead of a hash per component seems beneficial. But, this is less critical than the FieldItem level for prop values (and non-XB use cases), so I'd be fine with punting the FieldItemList level to a followup.

Are we also assuming that this will only be used for revision storage? ie. node__body will not use the interned strings?

Yeah, perhaps I'm wrong but I'm not foreseeing it having much benefit for the non-revision tables. Where we know duplication happens a lot is across revisions. I don't think duplication across nodes is a significant contributor to database size. And by not doing it for the non-revision tables, it means less disruption to existing Views, EntityQuery, etc. code., except where those things query the revision tables.

This technique could also be useful in a related but slightly different way for the message column in the watchdog table, which can also grow quite large on some sites.

Oh that's a cool idea! Separate issue please, but that would be neat if it would help there. Do you think it would make much impact if it's only that column, but not also the variables column?

Comment #6

nicxvan commented 5 September 2024 at 13:24

I think there may be a mistake in your math, or I'm missing something.

5KB * 20 * 1000 = 100MB
You already multiplied by the 20 languages by 50 revisions to get 1000. Why are you multiplying by 20 again?

If I understand correctly then the math should be:
Your body size should be 5KB * 1000 which is 5MB. and 1000 articles is 5 GB.

Still massive, for no particular benefit, but I think the scale is off by an order of magnitude.
This would still save GB in a db and is definitely worth it!
Let me know if I'm missing something, if not I can update the IS.

Comment #7

effulgentsia commented 5 September 2024 at 13:57

Thanks for checking the math! I appreciate having more eyes on this.

You already multiplied by the 20 languages by 50 revisions to get 1000. Why are you multiplying by 20 again?

Because even though any given revision only edits one translation at a time, the revision table stores records for every language for each revision. For example, say you create a brand new node and translate it into 20 languages. Conceptually you've only made one revision per language, but Drupal stores it in the database as 20 revisions, and the 20th revision has records for all 20 languages. If you now edit the node to make a small change (say add another sentence to $node->body), and then edit every translation to make that same change for each language, Drupal will have stored an additional 400 records, because each translation's change gets stored as a new revision, and each of those new revisions copies over the records for every language.

So if you then make 49 more changes, and translate each of those changes into each language, you'll end up with 1000 revisions and 20,000 records.

Comment #8

longwave

he/him

English

commented 5 September 2024 at 14:10

Some more thoughts on this while I slept on it:

Wondering now if this should be handled as a new field type instead of a field storage setting - this really only makes sense for blob or longtext type fields; the hash is 64 chars so the data probably needs to be an order of magnitude larger for this technique to make sense. Field types can't currently determine their storage requirements - unless we force the custom_storage flag and handle it entirely separately, but maybe that is cleaner than trying to do it all in SqlContentEntityStorage.
If this is to be a field storage setting, is it somehow worth abstracting out the underlying field storage for "base fields", "configurable fields", "interned fields" into plugins and then allowing a storage plugin selection per field? This will need some rules - base fields can only have cardinality 1, interned fields should only be available for fields that use blob/longtext columns, etc, but it would provide more choice to the site builder. Implementing this is probably out of scope here but it is still worth discussing as it may still shape the solution here.
Unsure which hash method to choose, or if it matters that much. I chose SHA256 because we have precedent elsewhere in core and I remember from somewhere that, while shorter, MD5 or SHA1 are discouraged in some cases even if not used as a cryptographic hash.

Comment #9

nicxvan commented 5 September 2024 at 14:27

Ah, I was missing that detail. This will be a great change.

Will this have any impact on entity reference revisions? Paragraphs already get weird with translations and revisions.

Comment #10

longwave

he/him

English

commented 5 September 2024 at 16:01

@nicxvan this should be suitable for any large field in any entity type that stores duplicated data so it will help reduce database size with Paragraphs as well as Experience Builder.

Comment #11

joachim commented 5 September 2024 at 16:37

> Because even though any given revision only edits one translation at a time, the revision table stores records for every language for each revision

What's the reason for the translation/revision system doing this? is it documented anywhere?

Comment #12

effulgentsia commented 5 September 2024 at 16:45

Not sure if/where it's documented, but I think the reason is so that every revision is complete. For example, to be able to visit en/node/1/revisions/1/view for any language and any revision ID and be able to display that by only loading the data for that revision, regardless of if the language you're viewing it in matches the language that was edited by that revision.

Comment #13

nicxvan commented 5 September 2024 at 16:59

Great to hear this will help there too!

Comment #14

joachim commented 5 September 2024 at 18:16

> Not sure if/where it's documented, but I think the reason is so that every revision is complete

Instead of fiddling at the storage level, we could say that when you load revision N, if it doesn't have a particular language at that revision, you get the largest revision ID of that language < N.

Though I don't actually think it's very useful for editors to have the system the way it is. If you're the content editor working in German for instance, you view of your translation is littered with pointless identical revisions. Why should you care if your colleague in the French office made 6 edits to the French translation?

Comment #15

catch

he/him

English

commented 6 September 2024 at 04:34

Hashing: xxHash is probably what we should use these days - non cryptographic but collision resistant.

On the overall reason why the storage is like this, like most things it evolved over time and happened b somewhat by default. We started off with translations in different entities altogether (D6) and the tnid column. Drupal 7 added language to field tables but without an entity translation UI or support for base fields in core. Drupal 8 added the rest.

When we load and save an entity, we load and save the entire entity, not a slice of it. We would have had to have implemented per-language revision storage on top of all of the other things to make entity translation work and I don't think it really came up.

Instead of fiddling at the storage level, we could say that when you load revision N, if it doesn't have a particular language at that revision, you get the largest revision ID of that language

This would need to work for entity revision queries and views too, including revision queries across different languages. It's likely to result in more complex and hard to index queries than what's being discussed here. e.g. here we're looking at an extra join on fields that are rarely included in an entity query. I think selective revision saving/loading would require MAX(vid) for every language or something. Imagine a condition on a field value in any language, or querying with both a translatable field value and a non-translatable field value as conditions. Maybe it could be simplified if the latest revision holds all languages, but then there is workspaces too.

An advantage of the approach here is it's isolated to field types that are rarely queried, and it would work for monolingual sites where the body is large and the only change is publishing/unpublishing or adding a term reference or similar.

Comment #16

longwave

he/him

English

commented 10 September 2024 at 20:40

Back to my thoughts in #8:

While it makes sense to only do this for blob/text fields I don't think it makes sense to defer storage layout to the field type. I started prototyping this route but it turns out to be even more hacky. Discussed with @effulgentsia and while this probably only applies to core text fields, JSON fields in contrib and perhaps a small number of others, there is no benefit to it being a separate field type.
@effulgentsia also reminded me that base fields currently apply to all bundles in an entity. While we could allow fields on the base table to be NULL for bundles that don't use the field, and I've never really understood why e.g a boolean field of cardinality 1 requires an entirely separate table for its storage, I think this requires more discussion and is out of scope here - if I add test coverage then at least we can refactor at will later if we decide to make field storage pluggable in some way.
xxHash looks good, Symfony uses the 128 bit version of xxHash in places, so let's follow suit here.

Comment #17

longwave

he/him

English

commented 10 September 2024 at 22:15

Status:	Active	» Needs review
Issue tags:		+Needs tests, +Needs upgrade path

Some test fails around installed config, but while I sort those out and add new test coverage marking as NR to see if anyone has any early comments on the approach so far.

Will also need an upgrade path for existing config if we go this route.

Comment #18

nicxvan commented 11 September 2024 at 01:04

Is this trading runtime efficiency for storage efficiency?

I'm concerned this makes saving and loading fields slower to hash revisions for the majority of sites. It seems it will make queryng revisions more complex.

I have a client with 10 languages and many nodes. When I shared this with them they raised concerns about the performance trade off since cpu is more expensive than storage.

Comment #19

effulgentsia commented 11 September 2024 at 02:21

It's opt-in: a per-field configuration that defaults to false. So on each site you'll be able to choose which fields, if any, to apply it to.

For fields that opt into it, it adds an indexed JOIN to loading and querying revisions.

For fields that opt into it, it adds having to compute a hash for every field item for every translation when writing a revision (saving an entity). It's possible that this could be expensive (and we might want to consider future optimizations to track which items/translations are dirty and need re-hashing), but xxHash is very fast (so dirty tracking might not be needed).

Comment #20

effulgentsia commented 11 September 2024 at 02:34

it will help reduce database size with Paragraphs as well as Experience Builder

Just to set expectations, this might help some with Paragraphs but probably not as much as it will help with Experience Builder. For Paragraphs, it will help with individual fields that have long content (if those fields opt into it, per #19), such as text_long fields that content editors populate with a lot of text. However, for some sites, Paragraphs don't necessarily have a ton of content in any one field, they just have a lot of fields, and in those cases, the work in this issue won't help much with that.

Because Experience Builder will combine values for all fields within a component into a single JSON field or Field Union field, this issue will help with the "lots of fields with small content" use case as well as the "single fields with large content" use case.

Comment #21

effulgentsia commented 11 September 2024 at 05:47

xxHash looks good, Symfony uses the 128 bit version of xxHash in places, so let's follow suit here

I agree with 128 bit rather than 64 bit. https://en.wikipedia.org/wiki/Birthday_attack has a table that shows that with 64 bits, you'd have more than a 1 in a million chance of collision once your table grows above 6 million records.

However, instead of a 32 character hex string, we could reduce it to a 24 character base64 string.

Comment #22

nicxvan commented 11 September 2024 at 14:26

It's opt-in: a per-field configuration that defaults to false. So on each site you'll be able to choose which fields, if any, to apply it to.

That resolves our concerns!

I missed: https://git.drupalcode.org/project/drupal/-/merge_requests/9417/diffs#83...

Comment #23

smustgrave commented 17 October 2024 at 17:12

Status:

Needs review

» Needs work

Seems the MR needs a manual rebase.

Comment #24

catch

he/him

English

commented 16 May 2025 at 08:50

I think we should consider whether there might be better savings from using native MySQL compression, e.g. from #2857359: Make MySQL ROW_FORMAT configurable - we could set that by default for new sites potentially. Or whether both MySQL compression and this issue would combine to mean even greater space savings.

Comment #25

pere orga

Catalan

commented 16 May 2025 at 10:27

#2857359: Make MySQL ROW_FORMAT configurable is nice because it doesn't increase complexity in Drupal. But because MySQL compression is applied per page (typically 16KB, it could be smaller), space would be saved only when the repeated data is located within the same page.

Also, it wouldn't reduce the size of database backups.

(And of course, it's MySQL-specific)

Comment #26

17 June 2025 at 18:23

andyf made their first commit to this issue’s fork.

Comment #27

andyf commented 17 June 2025 at 19:02

Status:

Needs work

» Needs review

Thanks all! Rebased against 211722d90c2fdedb17308f4e7999275c2b3df495.

I tested it out locally on a site with some large revisions (max 5MB) but is otherwise simple: one language, only the (single-value) body field gets large, < 20k nodes. I saw the dump size reduce to ~50% of its original size (uncompressed), and the MySQL (5.7) DB data directory fall to about 68% of the size, measured immediately after doing a fresh DB import.

Comment #28

smustgrave commented 10 August 2025 at 19:11

Status:	Needs review	» Needs work
Issue tags:		+Needs change record

Believe still needs an upgrade path. Also tagging for a change record to explain what the new key is doing.

Comment #29

10 August 2025 at 19:11

Version:

11.x-dev

» main

Drupal core is now using the main branch as the primary development branch. New developments and disruptive changes should now be targeted to the main branch.

Add way to "intern" large field item values to reduce database size by 10x to 100x for sites with many entity revisions and/or languages

Problem/Motivation

Proposed resolution

Data model changes

Remaining tasks

Issue fork drupal-3469082

Comments

Parent issue

Referenced by