Problem/Motivation

If you need to load a lot of entities and iterate over them (pretty common), your memory usage could easily go through the roof.

For example, consider updating all nodes for some reason is a post_update hook. You want to load them all and save them, but loading all 10000000000 nodes will kill your machine. You would need to load them in batches to keep memory usage at a sane level.

This could also affect core directly, as we may need to update content (or config) entities in the upgrade path, but have no way of knowing how many items a site will have. This could cause problems and easily balloon over the PHP requirements we set to run Drupal.

Proposed resolution

Introduce an EntityIterator class that uses a generator to break the IDs to load into chunks. Then load just a set amount each time, clearing the static cache before each load so it only contains the items currently being iterated over.

We can get clever and use a generator. This internally creates an iterator so we can implement \IteratorAggregate and just return our generator!!

Remaining tasks

Add some tests

User interface changes

N/A - Developer only

API changes

Addition of new Iterator

Data model changes

N/A

Comments

damiankloip created an issue. See original summary.

dawehner’s picture

Love the idea! When we have a paging generator for the query we can achieve quite some stuff.

damiankloip’s picture

Issue summary: View changes
damiankloip’s picture

Status: Active » Needs review
Issue tags: +Needs tests
FileSize
2.17 KB
PASSED: [[SimpleTest]]: [PHP 5.5 MySQL] 113,490 pass(es). View

This works pretty well initially.

dawehner’s picture

+++ b/core/lib/Drupal/Core/Entity/ChunkedIterator.php
@@ -0,0 +1,80 @@
+        yield $id => $entity;

generator are everywhere!

damiankloip’s picture

search api has a use case for this as well. e.g. try to start tracking a certain entity bundle with 50.000 entiites. good luck with that

- webflo (IRC) 2015

dawehner’s picture

Category: Feature request » Task

IMHO this is at least a task

bojanz’s picture

Great idea!

webflo’s picture

I found a related project in D7 contrib https://www.drupal.org/project/entity_process_callback

catch’s picture

Should be useful for CMI config entities as well as config ones. I wrote an iterator for 6.x Views at #853864: views_get_default_view() - race conditions and memory usage back in 2011 which still hasn't been committed :P

We've fixed the bulk of the issues in Views that patch was trying to solve, but for config updates we'll still need to load all and save, and there are potentially places like field info it could help as well.

I think this get more and more useful as we start to move away from batch API. For CLI you want to do everything in one process, and ideally wrap large operations like this in a transaction. So if we can write code that works in long running processes correctly like this, then use batch only to handle timeouts and user feedback, we'll be in a much better state overall.

catch’s picture

damiankloip’s picture

Issue tags: -Needs tests
FileSize
7.3 KB
PASSED: [[SimpleTest]]: [PHP 5.5 MySQL] 113,497 pass(es). View
5.13 KB

With some quick unit tests.

Wim Leers’s picture

  1. +++ b/core/lib/Drupal/Core/Entity/ChunkedIterator.php
    @@ -0,0 +1,80 @@
    +class ChunkedIterator implements \IteratorAggregate, \Countable {
    

    I wonder if there is a common term for this in the world of software outside of Drupal?

  2. +++ b/core/lib/Drupal/Core/Entity/ChunkedIterator.php
    @@ -0,0 +1,80 @@
    +   * @param int $cache_limit
    ...
    +    $this->cacheLimit = (int) $cache_limit;
    

    Wouldn't something like chunkSize make more sense?

damiankloip’s picture

FileSize
7.37 KB
PASSED: [[SimpleTest]]: [PHP 5.5 MySQL] 113,871 pass(es). View
1.75 KB

We can see if there is a more widespread name for this kind of iterator. Here is the property rename for now - makes sense to me.

catch’s picture

I did a quick google for 'chunked iterator' and there at least people using that term on the internet, especially the last one is exactly the same issue we're dealing with here.

http://stackoverflow.com/questions/7926908/chunking-an-iterable

https://vinaybalamuru.wordpress.com/2012/05/14/basic-chunked-list-iterator/

I didn't send much time looking for other terms, worth doing but it wouldn't only be us at least if we stick with chunked iterator.

dawehner’s picture

What I don't understand is why we need an iterator at all. Can't we just make a function which has the yield statement, aka. acts directly as generator and be done with it?
This generator would then implement \Traversable which is mostly all we need? Countable could then be something on top of that.

damiankloip’s picture

If we use an iterator we can inject the storage. Otherwise aren't we in \Drupal::entityManager hell? I would rather have it this way I think... maybe I don't understand what you're saying completely.

damiankloip’s picture

Oh yeah, I guess we could just pass it as another parameter. count() is then awkward though, as you would need to do all the actual loading etc.. Plus I think adding functions is kind of against what people are expecting now. An iterator class will feel more at home I think?

dawehner’s picture

Status: Needs review » Reviewed & tested by the community

Well sure, let's get this in, its progress!

I think you should be able to achieve the same by using a class with a method in which you yield all the things.

xjm’s picture

Version: 8.0.x-dev » 8.1.x-dev

As an API addition, this should now go into 8.1.x as per https://www.drupal.org/core/d8-allowed-changes#minor. Thanks @damiankloip and @dawehner!

alexpott’s picture

Status: Reviewed & tested by the community » Needs review
+++ b/core/lib/Drupal/Core/Entity/ChunkedIterator.php
@@ -0,0 +1,83 @@
+    }
...
+    // Reset any previously loaded entities then load the current set of IDs.
+    $this->entityStorage->resetCache();
+    return $this->entityStorage->loadMultiple($ids);

Would it not just be better to have a way of loading entities without static caching at all? I think it is weird to blow the entire static cache even for entities not loaded through the ChunkedIterator.

damiankloip’s picture

That would be nice. Trying to add that at this stage seems like quite a bit of work though? Would we add a new method? I guess we would have to. But then any existing implementations would not implement it etc... who knows what people are already doing with entities.

This was going for just a simple helper that people can use if they like. The ChunkedIterator could just reset the cache of the items it loaded..

dawehner’s picture

Here is an idea, let's clone the entity storage object in the constructor. With that we no longer clear the cache on that particular object, so it doesn't affect the actual mostly used entity storage object.

damiankloip’s picture

I like that idea, let's try that.

damiankloip’s picture

FileSize
7.47 KB
1.69 KB

Also tested manually, seems to work fine.

aheimlich’s picture

\Drupal\Core\Entity\ContentEntityStorageBase::resetCache() clears both the static cache and the persistent cache. I can understand the iterator wanting to clear the static cache between chunks, but is clearing the persistent cache as well a good idea? I would think not, for the same reasons as #23.

damiankloip’s picture

Huh? When the hell did it start doing that? That method was always meant for clearing static caches only.

EDIT: looks like the docs support that thinking too.

aheimlich’s picture

damiankloip’s picture

Thorough. Thanks!

The docs on the interface are still completely misleading then. Which means the content entity implementation is not really playing by the rules.

It seems seriously restricting (and wrong) to not be able to clear the static cache without clearing the persistent cache.

https://api.drupal.org/api/drupal/core!lib!Drupal!Core!Entity!ContentEnt...

"Resets the internal, static entity cache."

bojanz’s picture

We should definitely have a way to only reset the static entity cache. I had no clue this resulted in a persistent cache reset.

damiankloip’s picture

Yeah, it's pretty bad IMO. A misuse of the current interface.

damiankloip’s picture

I guess it could be solved by having a whole static cache backend injected or something. That will still get awkward to invalidate though.

It's really* frustrating that all of these problems have spawned from a complete mistake in the content storage implementation. Either way I think that needs to be fixed. The current working is wrong but also makes things really inconsistent between entity implementations.

catch’s picture

So only the storage and tests themselves should have to worry about the persistent entity cache - just saving and updating field definitions etc. shouldn't need to reset caches at all. Feels like something we can straightforwardly change in a minor release at least.

So I think it's worth just deleting that code, moving it into ->save() somewhere, and seeing what breaks - probably separate issue to tackle that bit?

damiankloip’s picture

In theory that sounds workable. I'll have a look and see what breaks... :)

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.0-beta1 was released on March 2, 2016, which means new developments and disruptive changes should now be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.0-beta1 was released on August 3, 2016, which means new developments and disruptive changes should now be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.0-alpha1 will be released the week of January 30, 2017, which means new developments and disruptive changes should now be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.