I'm not too fond of the way we currently handle the 'sanitize' step (where field module should run check_markup() and stuff if needed) - currently on field_attach_view() , before building the render arrays in field_default_view().
It might be better suited as a pre_render step.

Leaving this as a reminder for further pondering.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

yched’s picture

We might want to piggyback what comes out of #369011: Performance: do check_markup() in text_field_load()

jenlampton’s picture

Title: move field data 'sanitize' to #pre_render ? » move field data sanitizeation out of hook_field_load
Category: Task » Bug report
Issue summary: View changes
FileSize
509.93 KB

It looks like the sanitization got moved all the way up into hook_field_load() which is way too early, and is causing me some serious recursion and performance headaches for lots of different scenarios.

Here's my use case: I have a site that's using "shortcodes" (read: custom input filter) to render various different things inside the body field of a node. These things can be single field off a same node (inline images) a collection of fields off the same node (image / credit / caption), or sometimes, a teaser for another node on the site, like so:

Scenario 1

Only a single node involved, but pulling content from a second field into the body also causes headaches.
As this node A is being loaded, here's what happens:
* node_load(node A)
* add fields for node A
* run check_markup on body field for node A
* * replacement pattern calls node_load(node A)
* * add fields for node A
* * run check_markup on body field for node A
* * * replacement pattern calls node_load(node A)
* * * add fields for node A
* * * run check_markup on body field for node A
etc, etc, (recursion)

Since this node has not yet FINISHED loading by the time the second node_load() is called, there is nothing in the cache yet, so, this causes infinite recursion. I added a recursion-detection stop into the custom input filter, but now we have another problem. The first time the node is loaded, recursion is detected and rendering is aborted. The embedded content does not appear in the node body. Reload the page a second time, and the embedded content appears in the body. Clear the cache: broken. Reload page: fine. There's no way we can get the embedded content to appear in the first page load! :(

Scenario 2

If two nodes reference each other via shortcodes, we also have a problem. The first node (call it node A) often references a second node (call it node B) and commonly, node B also references node A back.

As this node A is being loaded, here's what happens:
* node_load(node A)
* add fields for node A
* run check_markup on body field for node A
* * replacement pattern calls node_load(node B)
* * add fields for node B
* * run check_markup on body field for node B
* * * replacement pattern calls node_load(node A)
* * * add fields for node A
* * * run check_markup on body field for node A
* * * * replacement pattern calls node_load(node B)
etc, etc (recursion)

since neither node has FINISHED loading yet by the time the filters are run this causes infinite recursion.

The solution here seems simple: don't filter the content on load.

To me, this current order of things is architecturally a little bonkers - "load" is supposed to mean "get the thing out of the database". Sanitization should happen much later in the process, as part of node_view() or maybe even in prepare. Swapping around the order of things to get a performance gain seems a little like cheating.

In my case, the only valid work-around was to disable the cacheability of my filter. This is what the previous developer on this site had done, and the site was suffering from extreme performance problems because of it. The site loads upwards of 10 nodes per (node) page, and without being able to cache *any* of the filtered output - that's very expensive.

jenlampton’s picture

Issue summary: View changes
jenlampton’s picture

Title: move field data sanitizeation out of hook_field_load » remove field data sanitization from hook_field_load
Status: Active » Needs review
FileSize
1.94 KB

Attached is a patch that removes the early sanitization. So far my testing hasn't revealed any issues caused by this. I'll keep testing on my production site and report back.

jenlampton’s picture

I just read through all of #369011: Performance: do check_markup() in text_field_load() and though I agree with a 10% performance gain on principle, I disagree with moving view-logic into the load-logic in order to get it. That's how we end up with bugs like this.

I would love to hear some feedback from @catch @yched or anyone who participated in #369011: Performance: do check_markup() in text_field_load() here, now that we know what kinds of problems this is causing.

yched’s picture

That seems like a drastic change for D7 at this point :-/
Maybe there would be a way to avoid the recursion by deferring the invocation of hook_field_load() until after all field values have been read from the database ?

jenlampton’s picture

Isn't the point of hook_field_load() to read field values from the database?

yched’s picture

No, the storage layer is in charge of loading the values. hook_field_load() runs after that, and allows the field type to do extra massaging or load additional stuff on top of the raw values loaded from the db.

For example, for D7 file fields, file_field_load() receives the stored $item['fid'], and adds the info from the {file} records.

jenlampton’s picture

If running the sanitization in *_load() is already too early, how does having another earlier phase (storage layer) help here? I'm happy to try another approach, but I'm not sure I understand how to do that just yet.

jurgenhaas’s picture

I came across this issue on a customer site that has some pretty expensive filters (highlighting terms and phrases from a big taxonomy) which is extremely slowing down the performance of the site. Even worse if nodes get exported where a huge number of nodes need to be processed all at once.

It comes down to the sanitization in the text module and I'm not getting it why _text_sanitizes() calls check_markup() without setting $cache to TRUE. The $cache variable in the signature of check_markup defaults to FALSE and hence, the filter result of text fields is never cached, right?

I was looking around for other field loaders or sanitizers in core and contrib and they all seem to be setting cache to TRUE. Why is the text module of core not doing it?

In my use case it accelerates the site massively and I checked a lot of other things and it doesn't seem to break anything. Even from a logical point of view it seems that this is the correct way of handling it.

Am I wrong or am I missing something?

jenlampton’s picture

Patch in #4 still applies cleanly to Drupal 7.58

@jurgenhaas I think your suggestion is a good one, but is perhaps unrelated to this issue. Here we're talking about problems with running the sanitization (and token replacement) before nodes are finished loading, causing incomplete replacements since the data we need hasn't been created yet. I think you might want to open a new issue about the cache flag.

jenlampton’s picture

Patch in #4 still applies cleanly to Drupal 7.59.

amit.drupal’s picture

Patch in #4 is working fine.

jenlampton’s picture

Patch in #4 still applies cleanly to Drupal 7.63.

jenlampton’s picture

Patch in #4 still applies cleanly to Drupal 7.69, through 7.94.