Second step in making search results pages cacheable: make sure each entity's search plugin bubbles up cache tags [#2265065]

Problem/Motivation

This is blocked on #2241249: First step in making search results pages cacheable: add the associated SearchPage's cache tag. Once that lands, we could make it so that search pages are cacheable in the page cache, but special care is needed, to make sure we avoid re-running all search queries, which would negate (or at least significantly reduce) the benefit of cached search pages.

The reason that for example the cache tags for Node entity search results do not already have corresponding cache tags, is this (from NodeSearch)):

      // Render the node.
      /** @var \Drupal\node\NodeInterface $node */
      $node = $node_storage->load($item->sid)->getTranslation($item->langcode);
      $build = $node_render->view($node, 'search_result', $item->langcode);
      unset($build['#theme']);
      $node->rendered = drupal_render($build);

The rendering is happening too early: this causes HTML strings to be bubbled up rather than render arrays containing cache tags. This is wrong.

Proposed resolution

TBD

Remaining tasks

TBD

User interface changes

TBD

API changes

TBD

Comments

Comment #1

catch

he/him

English

commented 12 May 2014 at 10:32

That's related to #1166114: Manage Displays Search Results doesn't manage the display of the search results - the search results don't end up showing the results of the search result view mode. The other issue made the existing situation clearer in the UI, but eventually gave up resolving the functionality mis-match.

Comment #2

catch

he/him

English

commented 12 May 2014 at 10:45

To actually cache the query and invalidate it only on search indexing, it would be feasible to add cache tags for each individual search term and invalidate them when items are indexed.

So if I search for 'Drupal', then the result gets tagged with 'Drupal'.

Every time a node is indexed that contains the word 'Drupal', that cache tag would get cleared.

This has bad scaling implications though, since you would end up with as many cache tag invalidations as there are words in the search index - most of which would be for tags that will never actually be applied to any cached items, leading to a very bloated {cache_tags} table.

I can't currently think of a way to optimize the 'pointless cache tag invalidations' - or definitely not with the checksum implementation/for most backends.

Another option would be to tag all search result listings with a single 'search_result' cache tag (or one per search plugin), then invalidate that each time indexing happens (i.e. at the end of each queue run). Then invalidation happens with the same frequency as the search index queue is run (assuming content has been posted/updated in the meantime).

Comment #3

wim leers

Ghent 🇧🇪🇪🇺

commented 12 May 2014 at 13:14

So if I search for 'Drupal', then the result gets tagged with 'Drupal'.

That sounds very scary indeed!
But, I don't think that's quite necessary? Node search pages could be tagged with Node::getListCacheTags(), hence whenever a node is modified, all search pages would be invalidated.

Another option would be to tag all search result listings […] Then invalidation happens with the same frequency as the search index queue is run

Yes! This would be the "more optimal yet still easy to implement" option compared to what I just described. Just have a per-entity type "last search indexation" cache tag, that gets invalidated at the end of each indexation run.
That results in this sane sounding end result: "search result pages for an entity type can be cached, until the SearchPage config entity for that entity type is modified (i.e. changed ranking factors), or until a new indexation of that entity type has occurred".

Comment #4

jhodgdon

she/her

English

commented 12 May 2014 at 15:23

So... As I noted on the other issue, really any time any node gets touched in any way (updated, deleted, added), any search results that include a keyword that either was previously or currently in that node (whether or not the node was part of the previous search results) would be invalid. Technically, the results wouldn't be invalid until this change gets reflected in the search index.

So really, any time the node_search part of the search index gets touched in any way, I think the only safe thing to do would be to invalidate all node_search search results.

Comment #5

jhodgdon

she/her

English

commented 12 May 2014 at 15:24

Also, do we need a second issue to deal with User search results, or a generic way for plugins providing search page types to deal with their own types of search results?

Comment #6

catch

he/him

English

commented 12 May 2014 at 20:55

@jhodgdon, that's what I'm suggesting in #2 - except that we should do a single cache invalidation at the end of the search index queue runner (i.e. not for every indexing operation since that could lead to stampedes).

The user search plugin should never be exposed to anonymous users and hence the page cache, so sorting out cache tags for that could happen in a non-critical follow-up I think.

Comment #7

catch

he/him

English

commented 12 May 2014 at 21:13

I thought for a moment that this issue didn't need to be critical, because in 7.x search indexing does not invalidate the page cache so there'd be zero regression.

However we do invalidate the page cache via cache_clear_all(). We must not have tests for the case of search results + page cache + content updates since otherwise the other issue would fail. I also realised there's a bug in 7.x

So 7.x:

- cache_clear_all() invalidates the page cache

- If no-one searches before the search index is updated, you're OK.

- if a search is run before the search index is updated, then you'll fill the page cache with search results using the old index. Nothing happens when the indexing run is finished.

8.x:

- unless we invalidate the 'rendered' or 'content' cache tag every content update, it'd be a regression compared to 7.x (but doing so would be a performance regression compared to CACHE_TEMPORARY - which is why the other issue is critical). Regression is only not obvious because of missing test coverage.

-if we clear the cache tag on indexing, we fix both the 7.x race condition and the regression.

Wim also reminded me in irc that this issue isn't about the search indexing at all, only the rendered search highlight stuff, so this is all step 3...

I'm not sure if the regression justifies this issue being critical - IMO we should stop treating all regressions as critical, and just use whatever issue priority they'd get if they were new issues (if there's a critical underlying problem for the regression, then it should be critical of course), but not changing that just yet.

Comment #8

jhodgdon

she/her

English

commented 12 May 2014 at 21:33

RE #6 ...

So... For one thing, non-admin users can absolutely use User search, and on some sites, anonymous users can. Anyone with "use search" and "view user accounts" permission (or whatever they're actually called) can go to search/user and search for users by user name (with wildcards allowed). Admins (administer users permission) can also search by email addresses (with wildcards allowed). This is the UserSearch plugin and it's configured when the user and search modules are enabled, by default. So basically, any time a user name changes or is added/deleted, any UserSearch searches would need to be invalidated. I think the only feasible way to do that is "Invalidate all UserSearch searches whenever any user CRUD happens.

For node searches, I agree that the main feasible way to do this is "Invalidate all NodeSearch searches whenever the node_search index is updated". A more careful, less invalidating method would be complicated -- you would have to do something like you said in #2, but also be careful when you were *deleting* stuff from the index with a particular keyword (not just adding and updating). Easier just to say "Index is updated, all searches using the index are now invalid".

As a note though... there are functions in search.module that clear out nodes from the index when they're deleted. This should also invalidate searches, not just when the UpdateIndex method is called.

Comment #9

catch

he/him

English

commented 12 May 2014 at 21:52

Anyone with "use search" and "view user accounts" permission (or whatever they're actually called) can go to search/user and search for users by user name (with wildcards allowed).

Good point!

As a note though... there are functions in search.module that clear out nodes from the index when they're deleted. This should also invalidate searches, not just when the UpdateIndex method is called.

That works for me - deletion should be rare compared to creation/updates so we could just directly invalidate.

Comment #10

jhodgdon

she/her

English

commented 13 May 2014 at 00:22

Status:

Postponed

» Active

OK. The related issue was committed, so this can be active.

I had some more thoughts... NodeSearch... there is also Advanced search, which can go on things like taxonomy term, etc. So really, unless we want to get **really** tricky (and I don't think we do), all NodeSearch search results should probably be invalidated when any node is updated or deleted, and during indexing if any node is added that hadn't been indexed before.

The point is: I really don't think that adding the individual node IDs that are displayed on the page to the cache tags really buys us much (or anything), because updates to nodes that are not displayed can also make the page invalid... unless, like I said, we want to get **really** tricky and try to cache based on keywords... but given that Advanced Search exists, I don't really think we want to go down that road.

So I think our plan in this issue should be:

NodeSearch:
- invalidate all cached search results on any node delete or update
- invalidate all cached search results when a new node is added to the index during a cron/indexing run

UserSearch:
- invalidate all cached search results if any user is added/edited/deleted

Thoughts?

Comment #11

catch

he/him

English

commented 13 May 2014 at 10:06

there is also Advanced search, which can go on things like taxonomy term, etc.

Isn't that also using the search index though? In which case clearing all of the node_search tagged entries on index would still be fine.

The point is: I really don't think that adding the individual node IDs that are displayed on the page to the cache tags really buys us much (or anything), because updates to nodes that are not displayed can also make the page invalid...

The cache tags wouldn't just be for the node IDs, but also for the author, any other reference fields etc. So for example if the node search result page is themed to show a user avatar next to search results they authored, then when the user gets updated for a new image, that would invalidate the user cache tag, which is applied to the search page. afaik user updates do not invalidate the search index for the nodes they authored.

Comment #12

jhodgdon

she/her

English

commented 13 May 2014 at 12:50

RE #11 ...

The "Advanced" search query conditions are outside the index. So hypothetically if the node is edited to change its tags, and someone had done an advanced search on the tags, that search might be invalid before we got to index time (whether or not the previous result included that particular node, because now that it's tagged differently, it might be in the search results this time).

So... I guess my point is that editing node A in any way could change which nodes should be returned in search results, hence invalidating any search (because Advanced searching uses information that is not in the search index tables), whether node A was in the results for the search or not. Also, I just remembered also that methods of *ranking* search results (ordering them) also use information that is not in the search index.

So... I guess we have a few strategies for solving this for NodeSearch (UserSearch is relatively straightforward):

a) Make a serious attempt to make sure that everything that could possibly affect search results or their display gets a cache tag:
- All nodes that are displayed as part of the results, and any other entities involved in building/displaying them (users, referenced entities, taxonomy terms, other fields of referenced entities depending on the display settings of reference fields, etc.). The NodeSearch plugin builds the node and then runs the search_excerpt function to display a highlighted subset of the node before returning that to the Search module to display on the results page, which is why the tags are not currently being carried through -- we'd need to do something about that strategy to make the tags generated during node build/display are "bubbled up", the original intent of this issue.
- All entities involved in "advanced" searching for this particular search -- e.g., if advanced searching by author or taxonomy term, the user ID or taxonomy term would need to be tagged too.
- All entities involved in the ranking part of the query. This could involve contrib modules implementing hook_ranking -- how would we handle that?

Even if we did this, all node searches would still need to be invalidated during search indexing.

b) Don't try to be so thorough, and just invalidate all node searches on node CRUD and during indexing. This would potentially miss edits to referenced entities; on the other hand, so would search indexing. (Node A references entity B. Entity B is edited, affecting the rendered display of node A. Node A should be reindexed, but the NodeSearch plugin doesn't detect this and doesn't reindex node A, so the search index is actually invalid. This problem is independent of the problem of search results page caching though, and is the reason we have the "reindex" button on the search settings page. ... Do we need to be doing something smarter to decide whether to reindex node A? Separate issue I guess...)

c) Don't cache search results at all, because too much can affect them.

Comment #13

wim leers

Ghent 🇧🇪🇪🇺

commented 13 May 2014 at 14:13

Summarizing the conversation

To summarize everything discussed so far: counterintuitively, the set of entities that match a search query does not depend only on the search index, but also on the entities themselves (i.e. the search index is solely for text matches). This means that what was assumed previously is wrong, i.e. that a rendered search results page depends on:

the entity type's search index (search_index:<entity type> cache tag)
the rendered entity's cache tags (<entity type>:<entity ID> plus cache tags of rendered fields, e.g. entity author cache tag, entity author picture file cache tag, entity taxonomy terms cache tags …)

Instead, the full list of dependencies is this:

the entity type's search index (search_index:<entity type> cache tag)
the entity type's list cache tag (<entity type>s:1), because both the filtering (query match) and sorting (ranking) may depend on data not stored in the index, but on the entity itself, which is precisely the reason that each entity type has a "list" cache tag
the rendered entity's cache tags (<entity type>:<entity ID> plus cache tags of rendered fields, e.g. entity author cache tag, entity author picture file cache tag, entity taxonomy terms cache tags …)

FYI, this truly is precisely why each entity type has a "list" cache tag. "Search results pages" are very comparable to "Views results pages", after all, regarding how it is determined which entities are rendered.
From Drupal\Core\Entity\Entity, see how the comments describe exactly the scenarios that we've been describing in this issue:

  protected function invalidateTagsOnSave($update) {
    // An entity was created or updated: invalidate its list cache tags. (An
    // updated entity may start to appear in a listing because it now meets that
    // listing's filtering requirements. A newly created entity may start to
    // appear in listings because it did not exist before.)
    $tags = $this->getListCacheTags();
    …
  }

and

  protected static function invalidateTagsOnDelete(array $entities) {
    $tags = array();
    foreach ($entities as $entity) {
      // An entity was deleted: invalidate its own cache tag, but also its list
      // cache tags. (A deleted entity may cause changes in a paged list on
      // other pages than the one it's on. The one it's on is handled by its own
      // cache tag, but subsequent list pages would not be invalidated, hence we
      // must invalidate its list cache tags as well.)
      $tags = NestedArray::mergeDeepArray(array($tags, $entity->getCacheTag(), $entity->getListCacheTags()));
      …
    }
  }

#12

a) Great point — I did not know you could search by taxonomy term or author ID. You're right that this immensely complicates things.

On filtering by term. However… if I look at NodeSearch, then, yes, I see that searching by taxonomy term is supported. If I build a URL that uses it manually, it works. But it's not exposed in the UI. Nor is it in Drupal 7. So I can only wonder how many people are actually using this?
I think it's generally much saner to keep node search about searching nodes only. It's more maintainable, it's more logical, etc.
If you disagree with removing that functionality, even then there is not really a problem; it just means that whenever a user does filter by taxonomy term, we also have to add the taxonomy term's list cache tag, that's it.

On filtering by author. Secondly, searching by author is similarly hidden in the UI, but it's information that lives on the node itself, not in some other DB table. So I think there's more of a case to be made to keep this, but here also I'd question the usefulness, since it's never been exposed in the UI all this time! In any case, no extra cache tags are necessary here, since 1) changing the node's author would invalidate the node list cache tag, 2) changing the author user itself would invalidate the user entity's cache tag, which would already be set thanks to 3 (the author is a rendered field, which automatically comes with that user entity's cache tag).

b) Wow, I didn't realize that we are indexing nodes based on the rendered output in a special view mode! I thought the indexing only covered textual fields! As long as you don't have to care about cache invalidation, that's nice, but now we do care about that all across core, so now it's problematic.
But apparently this was broken in D7 also. So fixing that is out of scope. Which means it's also ok for search results to be briefly invalid while the reindexing still needs to happen.

Conclusion

So, my take is simple, I think merely setting the cache tags of all dependencies should yield correct results:

Point 1 has to be implemented. Once node reindexing is done, all search results pages will be invalidated, and the updated index will be used. This is similar to the Drupal 7 behavior AFAICT.
Point 2 is a trivial addition, correctness of that cache tag already is guaranteed with full test coverage.
Point 3 merely is a bugfix from a Drupal 8 POV, the current code was correct in Drupal 7, but we do things slightly differently now, to leverage render caching; and here too we already have full test coverage.

P.S.

Finally… thanks to this issue, I tried the Search module in Drupal 8 and looked at it in some detail, and I'm scared about how buggy it is. If you look at the data stored in the search_dataset table, you'll see that it's storing not only actual content (good), but also HTML class attributes ("permalink"), PHP serialization ("array size 1", "string basichtml") and even completely irrelevant node "links" ("login or register to post comments").
Hence searching for "array" or "permalink" results in completely bogus search results.

This is starting to look like a release risk. It's almost certainly because we introduced regressions along the way that we didn't notice because there's not enough test coverage :(

I strongly hope I'm wrong!

Comment #14

jhodgdon

she/her

English

commented 13 May 2014 at 14:33

The fact that the advanced search by taxonomy term is currently broken in 8 is a bug (known) -- it worked in 7 and 6. We have many issues in search.module... we should not build a caching system that will break when we fix the bugs. In any case, the Advanced search mechanism and hook_ranking() definitely allow search queries to involve other tables besides the search index, and if you're going to do caching right, those need to be included.

And yes, the Search module is *very* buggy. Most of the bugs are filed, and some have been around since D5 due to lack of maintainership for Search...I'm not sure where the PHP serialization is coming from -- haven't seen that one. "permalink" is in the index because comment module is adding links -- it is not an HTML class attribute but actual text that is shown and there is an issue for that and the other comment links being shown (in other words, that is a known bug). We should file a bug about PHP serialization though, especially if you can figure out what field it is coming from. Should obviously not be there.

Comment #15

jhodgdon

she/her

English

commented 14 May 2014 at 18:25

Priority:

Critical

» Normal

Um... by the way, why is this issue marked Critical? It doesn't seem like it completely kills the ability to use Drupal or even searching, just because the search results pages are not cacheable currently.

Comment #16

jhodgdon

she/her

English

commented 14 May 2014 at 18:29

Related issues:

+#242187: Cache search results

Comment #17

catch

he/him

English

commented 14 May 2014 at 19:36

Priority:

Normal

» Major

@jhodgdon it's because once #2124957: Replace 'content' cache tag with 'rendered' and use it sparingly is committed search results in the page cache won't be invalidated at all - only when the cache entry expires. If we had test coverage for search results + page cache + indexing that issue would be failing tests.

I agree though that this does not make it critical, it is a regression from 7.x though so let's split the difference at major.

Comment #18

pwolanin commented 20 May 2014 at 14:58

So, discussed this with a call with Wim.

For me, it would be acceptable to just invalidate all cached search result pages when the search indexing process runs (usually on cron).

Comment #19

jhodgdon

she/her

English

commented 20 May 2014 at 15:33

#18 (invalidate during indexing) is I think acceptable for NodeSearch page results, given that doing it completely right would be very difficult.

For UserSearch page results, we still need to invalidate for any user CRUD, because these searches do not use the search index (it's a direct query on users table). Technically, we would only need to invalidate if the user name or email address was edited, not any other fields/properties.

Comment #20

pwolanin commented 20 May 2014 at 18:56

Well, assuming anonymous users won't have access to search by email (isn't than an admin perm) it would just be username?

Comment #21

jhodgdon

she/her

English

commented 20 May 2014 at 21:16

#20 - True, since pages are not cached for logged-in users... I assume that is still true... the UserSearch plugin only allows search by email for users with "administer users" permission, which one could assume would never be granted to anonymous users.

Comment #22

jhodgdon

she/her

English

commented 3 April 2015 at 13:54

Whoops, we forgot this was here and filed duplicate issues (two parts of this):
#2464409: Search results should bubble rendered entity cache tags and set list cache tags (to do) and #2460911: Search reindexing should invalidate cache tags (done)

Comment #23

wim leers

Ghent 🇧🇪🇪🇺

commented 13 April 2015 at 14:17

Status:	Active	» Closed (duplicate)
Related issues:		+#2464409: Search results should bubble rendered entity cache tags and set list cache tags

Let's close this in favor of #2464409: Search results should bubble rendered entity cache tags and set list cache tags. As you say, part of the scope of this issue was already done, and the other issue is more focused. So closing this one.

Second step in making search results pages cacheable: make sure each entity's search plugin bubbles up cache tags

Problem/Motivation

Proposed resolution

Remaining tasks

User interface changes

API changes

Comments

Summarizing the conversation

#12

Conclusion

P.S.

Parent issue

Related issues