Entity usage list does not scale [#3007752]

Problem/Motivation

We have an entity with *a lot* of usages. 56 pages of them.

Because entity usage currently loads all the usages and then only pages them in the UI, that is super slow and uses a ton of resources.

Proposed resolution

Find a way to do paging in the query. I still believe it would make sense to group usages per entity which might or might not make that easier :)

Remaining tasks

User interface changes

API changes

Data model changes

Comment	File	Size	Author
#15	3007752-15.patch	18.77 KB	Debasish147
#13	3007752-13.patch	18.21 KB	Debasish147
#12	3007752-12.patch	18.84 KB	Debasish147
#11	entity-usage-pagination.patch	18.21 KB	Debasish147

Issue fork entity_usage-3007752

Show commands

Start within a Git clone of the project using the version control instructions.

Add & fetch this issue fork’s repository

Or, if you do not have SSH keys set up on git.drupalcode.org:

Add & fetch this issue fork’s repository

8.x-2.x changes, plain diff MR !32
Check out this branch
3007752-entity-usage-list changes, plain diff MR !17
Check out this branch

About issue forks

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

18 October 2018 at 21:13

Berdir created an issue. See original summary.

Comment #2

marcoscano

Portuguese, Brazil

Barcelona, Spain

CreditAttribution: marcoscano as a volunteer commented 19 October 2018 at 04:11

I think the paging in the query might be tricky because records in the DB table are currently undistinguishable respect to what we want to show on the UI (i.e. only if source has a label (excluding orphan paragraphs), inform if source is default or past revision / translation, skip if source doesn't exist, etc). Doing all this as part of the query might be challenging.

Maybe caching could reduce the severity of the issue, last time we tweaked the UI I've opened #2985265: Cache viewable results when generating usage page for that. It won't solve the slowness the first time it's calculated, though.

Another option is to start storing in the DB table whether the usage record is from a default revision of the source entity or not. It would mean adding complexity to keep this info always up-to-date, but it would likely make the logic for a paged query easier. It wouldn't solve however the skipping of sources without label (orphan paragraphs) or informing whether the usage is only in a non-default translation...

Another crazy idea is to do all the \Drupal\entity_usage\Controller\ListUsageController::getRows() calculation on a periodic background process (cron), and cache the results.

Thoughts?

Comment #3

Berdir

German

Switzerland

CreditAttribution: Berdir at MD Systems GmbH commented 21 May 2019 at 10:30

The most obvious problem is that the label/link building as well as access checking happens *before* patching. That means it might resolve things for *many* pages but then only show a small subset. That should already *massively* help with the permission.

And I'm honestly not happy about how the list works, but not going to complain about something when I don't have time to do it better :) Maybe, some day...

Comment #4

marcoscano

Portuguese, Brazil

Barcelona, Spain

CreditAttribution: marcoscano as a volunteer commented 27 May 2019 at 13:59

This affects several issues, but posting here since I believe this is the more important thing to solve.

I've given this some thought from the perspective of solving things in this order:
1) Refactor the writing of records to DB, so that we don't issue several DB queries and events for every relationship we want to track. This would solve the performance problem mentioned in #3002332: Track composite (e.g. paragraph) entities directly and only as their host.
2) The above would then set the foundations for skipping paragraphs, if necessary/wanted (the other part of #3002332: Track composite (e.g. paragraph) entities directly and only as their host ).
3) With that in place, we would be able to think on how we improve the usage list controller (this issue). For example, if there are no paragraphs in the equation things get a lot easier when deciding what to show there.

However, while exploring approaches for the refactor in 1), I'm having a hard time figuring out a reasonable way to implement it. Basically, since usage is recorded when the source entity's hooks are triggered, we are storing relationships in a top-down approach. However, it seems to be quite tricky to do the same and cleanly calculate (potentially nested) descendants. In other words, while it's easy for us at display-time to "get the first non-paragraph label" going bottom up (because we have $entity->getParent()), it isn't straightforward to figure out "all non-paragraph descendants" when writing things to the DB, since this would involve looping over all fields and calling all plugins again. Maybe it's possible, but I haven't seen how to get there yet... :/

Another approach that I'm starting to consider is:
- We mitigate the performance issue on node save by moving the calculation+events to the background ( #3015287: Allow usage records to be registered in background ). At this point I think we could actually make the background processing introduced there the default behavior.
- We keep registering all relationships, no matter the entities involved.
- We create a new DB table with similar structure than the original entity_usage table, but intended to store only records that are supposed to be displayed on the UI. This way the controller can perform straight paged queries agains this table, and just pos-process (adding links, etc) the records on the current page.
- Before saving records on this new table, we invoke an alter hook, so modules can alter the way the information is stored (for example changing the parent of an entity referenced from a paragraph item, or directly preventing paragraph item records being stored there)

The challenge here is to make sure the information is always in sync between the two tables, but maybe it's worth a try.

Thoughts?
-

Comment #5

marcoscano

Portuguese, Brazil

Barcelona, Spain

CreditAttribution: marcoscano as a volunteer commented 10 June 2019 at 13:08

After having explored #4 for a short while, I don't like it anymore as much in order to keep pursuing it. It's too hacky, and there are several edge cases to handle.

Right now I may just fall-back to a better implementation of the current approach... Instead of foreaching ALL rows and then just displaying some of them, maybe we should just start looping until we have a number of visible items that are == the number of items per page, and store in a query argument the initial position we started on that page. I'll come back after I explore that idea a bit better.

Comment #6

marcoscano

Portuguese, Brazil

Barcelona, Spain

CreditAttribution: marcoscano as a volunteer and at Lullabot for State of Georgia commented 10 June 2019 at 17:47

More brainstorming, new idea came up:
- Defer all usage calculation to a background process (using DestructableInterface)
- Add some new columns to the DB so we can store this concept of "Top-level source entity". Each row would contain information about their "immediate" source entity _and_ their "top-level" source entity. We can only do that in a background process, otherwise the top-level entity is not yet saved when the descendants are being saved.
- Change the usage page so that:
-- We group usages per "top level entity type"
-- we no longer display the field information on this first page
-- we add a new "More information" (or something like that) link to allow users to go to a separate page where more information (including fields, etc) can be shown, but this time only for this referencing source)

This would allow us to join the top-level source entity table and exclude past revisions, so direct paged queries could become feasible.

Comment #7

marcoscano

Portuguese, Brazil

Barcelona, Spain

CreditAttribution: marcoscano as a volunteer and at Lullabot for State of Georgia commented 11 June 2019 at 06:03

New iteration:

- We actually don't need add anything new to the DB, if we assume we don't care about intermediate (paragraph-like) relationships (as suggested in #3002332: Track composite (e.g. paragraph) entities directly and only as their host ), we can assume all sources are "top-level" and all targets are "bottom-level". This would actually _remove_ columns (method, field_name and count).

I'm seriously considering radically simplifying things in a more opinionated, but hopefully more performant approach in a new branch of this module. If we do that not only this issue, but also several others can be fixed at once. I've opened #3060802: Refactor module architecture in a simpler, opinionated and more performant approach to explore these new ideas.

Comment #8

Berdir

German

Switzerland

CreditAttribution: Berdir at MD Systems GmbH commented 11 June 2019 at 07:53

In holiday, just one comment on the Destructible thing, note that's only after the response on php fpm, other environments need to wait on it

Comment #9

marcoscano

Portuguese, Brazil

Barcelona, Spain

CreditAttribution: marcoscano as a volunteer and at Lullabot for State of Georgia commented 11 June 2019 at 13:23

Posted a first POC of this change in #3060802-2: Refactor module architecture in a simpler, opinionated and more performant approach , will continue to update that issue with the progress.

just one comment on the Destructible thing, note that's only after the response on php fpm, other environments need to wait on it

Thanks for the heads up! One thing is not entirely clear to me though: Does this mean that on non-php-fpm environments this would break/not work, or the request would just need to wait until the service finishes all its execution? If it's the latter, I think it doesn't really make things worse than they are now for them (while it still improves the experience where it does run in background), right?. If things don't work as expected though, then yes we would need to reconsider that part.

Thank you!

Comment #10

marcoscano

Portuguese, Brazil

Barcelona, Spain

CreditAttribution: marcoscano as a volunteer and at Lullabot for State of Georgia commented 24 June 2019 at 08:04

I've updated the issue summary of #3060802: Refactor module architecture in a simpler, opinionated and more performant approach with the plan of action there. If that idea moves forward, we would probably close this issue as won't fix in the 2.x branch.

Comment #11

Debasish147 CreditAttribution: Debasish147 commented 24 March 2020 at 11:18

File	Size
entity-usage-pagination.patch	18.21 KB

I have updated the pagination logic for 2.x branch by following the logic from 3060802: Refactor module architecture in a simpler, opinionated and more performant approach which is in 3.x branch. It is enhancing pagination performance for 2.x branch. Before applying this patch we have kept one record per entity and keeps records nicely up to date with update/delete by following approached discussed in below links:
Lock wait timeout exceeded
Updating existing revisions doesn't purge previously created usage
This patch and 3007752-13.patch was created from 8.x-2.0-beta2 not 8.x-2.x-dev and it failed. But 3007752-12 was created on 8.x-2.x-dev and it is passed.

Comment #12

Debasish147 CreditAttribution: Debasish147 commented 24 March 2020 at 10:51

File	Size
3007752-12.patch	18.84 KB

1 file was hidden/shown/deleted

File	Size
entity-usage-pagination.patch	18.21 KB

Comment #13

Debasish147 CreditAttribution: Debasish147 commented 24 March 2020 at 11:05

File	Size
3007752-13.patch	18.21 KB

Comment #14

Debasish147 CreditAttribution: Debasish147 commented 24 March 2020 at 11:14

1 file was hidden/shown/deleted

File	Size
3007752-13.patch	18.21 KB

Comment #15

Debasish147 CreditAttribution: Debasish147 commented 25 March 2020 at 18:21

File	Size
3007752-15.patch	18.77 KB

This is working for beta-3

Comment #16

16 October 2021 at 04:01

weekbeforenext made their first commit to this issue’s fork.

Comment #17

16 October 2021 at 04:15

weekbeforenext opened merge request !17

Comment #18

weekbeforenext

English

Asheville, NC

CreditAttribution: weekbeforenext at Lullabot for Commonwealth of Massachusetts commented 16 October 2021 at 04:45

Focussing on specifically making the Usage tab more performant without all of the changes from #3060802: Refactor module architecture in a simpler, opinionated and more performant approach, the MR I created is from scratch.

Refactored EntityUsage.php, listSources():

Added the ability to query the count of unique source records for the total number of records for paging using an optional count value.
Added the ability to query all usage records for a page of unique source records using an optional offset value.
There is now a secondary query (called $sub_query) that controls unique sources and a where() was added to the main query to restrict the sources and retain all records.

Refactored ListUsageController.php:

Added an entity variable to prevent loading the entity multiple times if possible.
Changed getRows() to getPageRows() and refactored calls to the listSources() method to take advantage of the changes there.

Comment #19

weekbeforenext

English

Asheville, NC

CreditAttribution: weekbeforenext at Lullabot for Commonwealth of Massachusetts commented 16 October 2021 at 05:49

Status:

Active

» Needs review

Comment #20

marcoscano

Portuguese, Brazil

Barcelona, Spain

CreditAttribution: marcoscano as a volunteer and at Lullabot for State of Georgia commented 18 October 2021 at 08:13

Thanks for working on this!

My initial thoughts:

- About the $count parameter:

Instead of bundling this into ::listSources(), wouldn't it be better to have a specific method for this? Something like ::listUniqueSourceCounts() , or better naming...

- About the $offset parameter:

I think we should either expand a lot its docblock and explain all that's happening when this parameter is passed in, or (even better) create a new method for this. I'm thinking something along the lines of ::listSourcesNextPage(). Or is there a reason this needs to be inside ::listSources() ?

- About:


    $total = $this->getRowsCount();
    if (!$total) {
      return [
        '#markup' => $this->t('There are no recorded usages for entity of type: @type with id: @id', ['@type' => $entity_type, '@id' => $entity_id]),
      ];

Unless I'm reading this wrong, this empty-message would be shown only if there are really no sources, right? In the previous implementation, I think it showed up if there were no "real" usages (in other words, if a media item is used only in an orphan paragraph (or an entity where $this->getSourceEntityLink($source_entity) returns empty), this usage would be discarded, which doesn't seem to be the case in this new approach. (this probably reveals we have missing test coverage for this :P , unless I'm misinterpreting this)

- Related to the above, the fact that we are changing how ::getRows() and ::getPageRows() work seem to potentially cause issues in those edge case scenarios as well. The main complexity of this code is that we want to provide a "full" page, after taking out of the list:
* Source entities that don't exist or can't be loaded for any reason (I know, very edge case, but...)
* Source entities that can't provide a link (this is less edge-case, mostly happens with orphan paragraphs)
Paging at the query level means that if we do this clean up later, we might end up with a page with fewer items than we think (or even a page with no rows, but we think there are some, because the query returned non-empty.

I understand these are arguably edge-cases that are less important than having a huge performance issue, but one could also argue that not all sites have the performance issues, and do benefit from us treating these edge-cases as well... I'd like to think better on this trade-off...

Another (radically different) option could be to cache the result of our expensive calculation of the initial list:

    if (!empty($this->allRows)) {
      return $this->allRows;
      // @todo Cache this based on the target entity, invalidating the cached
      // results every time records are added/removed to the same target entity.
    }

We would need to make sure the cache is invalidated appropriately, and it wouldn't solve it for the first time it's requested (cold cache), but all subsequent requests would probably be very fast.

Comment #21

weekbeforenext

English

Asheville, NC

CreditAttribution: weekbeforenext at Lullabot for Commonwealth of Massachusetts commented 19 October 2021 at 03:36

Updated MR17 to separate the new "subquery" solution from the original. There is now a configuration setting, usage_controller_subquery, that will enable the use of the subquery logic and I copied tests to verify they match for the new option. The new setting is false by default.

Entity usage list does not scale

Problem/Motivation

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Issue fork entity_usage-3007752

Comments

Referenced by