Problem/Motivation

If you need to load a lot of entities and iterate over them (pretty common), your memory usage could easily go through the roof.

For example, consider updating all nodes for some reason is a post_update hook. You want to load them all and save them, but loading all 10000000000 nodes will kill your machine. You would need to load them in batches to keep memory usage at a sane level.

This could also affect core directly, as we may need to update content (or config) entities in the upgrade path, but have no way of knowing how many items a site will have. This could cause problems and easily balloon over the PHP requirements we set to run Drupal.

Proposed resolution

Introduce an EntityIterator class that uses a generator to break the IDs to load into chunks. Then load just a set amount each time, clearing the static cache before each load so it only contains the items currently being iterated over.

We can get clever and use a generator. This internally creates an iterator so we can implement \IteratorAggregate and just return our generator!!

Remaining tasks

Add some tests

User interface changes

N/A - Developer only

API changes

Addition of new Iterator

Data model changes

N/A

Comments

damiankloip created an issue. See original summary.

dawehner’s picture

Love the idea! When we have a paging generator for the query we can achieve quite some stuff.

damiankloip’s picture

Issue summary: View changes
damiankloip’s picture

Status: Active » Needs review
Issue tags: +Needs tests
FileSize
2.17 KB
PASSED: [[SimpleTest]]: [PHP 5.5 MySQL] 113,490 pass(es). View

This works pretty well initially.

dawehner’s picture

+++ b/core/lib/Drupal/Core/Entity/ChunkedIterator.php
@@ -0,0 +1,80 @@
+        yield $id => $entity;

generator are everywhere!

damiankloip’s picture

search api has a use case for this as well. e.g. try to start tracking a certain entity bundle with 50.000 entiites. good luck with that

- webflo (IRC) 2015

dawehner’s picture

Category: Feature request » Task

IMHO this is at least a task

bojanz’s picture

Great idea!

webflo’s picture

I found a related project in D7 contrib https://www.drupal.org/project/entity_process_callback

catch’s picture

Should be useful for CMI config entities as well as config ones. I wrote an iterator for 6.x Views at #853864: views_get_default_view() - race conditions and memory usage back in 2011 which still hasn't been committed :P

We've fixed the bulk of the issues in Views that patch was trying to solve, but for config updates we'll still need to load all and save, and there are potentially places like field info it could help as well.

I think this get more and more useful as we start to move away from batch API. For CLI you want to do everything in one process, and ideally wrap large operations like this in a transaction. So if we can write code that works in long running processes correctly like this, then use batch only to handle timeouts and user feedback, we'll be in a much better state overall.

catch’s picture

damiankloip’s picture

Issue tags: -Needs tests
FileSize
7.3 KB
PASSED: [[SimpleTest]]: [PHP 5.5 MySQL] 113,497 pass(es). View
5.13 KB

With some quick unit tests.

Wim Leers’s picture

  1. +++ b/core/lib/Drupal/Core/Entity/ChunkedIterator.php
    @@ -0,0 +1,80 @@
    +class ChunkedIterator implements \IteratorAggregate, \Countable {
    

    I wonder if there is a common term for this in the world of software outside of Drupal?

  2. +++ b/core/lib/Drupal/Core/Entity/ChunkedIterator.php
    @@ -0,0 +1,80 @@
    +   * @param int $cache_limit
    ...
    +    $this->cacheLimit = (int) $cache_limit;
    

    Wouldn't something like chunkSize make more sense?

damiankloip’s picture

FileSize
7.37 KB
PASSED: [[SimpleTest]]: [PHP 5.5 MySQL] 113,871 pass(es). View
1.75 KB

We can see if there is a more widespread name for this kind of iterator. Here is the property rename for now - makes sense to me.

catch’s picture

I did a quick google for 'chunked iterator' and there at least people using that term on the internet, especially the last one is exactly the same issue we're dealing with here.

http://stackoverflow.com/questions/7926908/chunking-an-iterable

https://vinaybalamuru.wordpress.com/2012/05/14/basic-chunked-list-iterator/

I didn't send much time looking for other terms, worth doing but it wouldn't only be us at least if we stick with chunked iterator.

dawehner’s picture

What I don't understand is why we need an iterator at all. Can't we just make a function which has the yield statement, aka. acts directly as generator and be done with it?
This generator would then implement \Traversable which is mostly all we need? Countable could then be something on top of that.

damiankloip’s picture

If we use an iterator we can inject the storage. Otherwise aren't we in \Drupal::entityManager hell? I would rather have it this way I think... maybe I don't understand what you're saying completely.

damiankloip’s picture

Oh yeah, I guess we could just pass it as another parameter. count() is then awkward though, as you would need to do all the actual loading etc.. Plus I think adding functions is kind of against what people are expecting now. An iterator class will feel more at home I think?

dawehner’s picture

Status: Needs review » Reviewed & tested by the community

Well sure, let's get this in, its progress!

I think you should be able to achieve the same by using a class with a method in which you yield all the things.

xjm’s picture

Version: 8.0.x-dev » 8.1.x-dev

As an API addition, this should now go into 8.1.x as per https://www.drupal.org/core/d8-allowed-changes#minor. Thanks @damiankloip and @dawehner!

alexpott’s picture

Status: Reviewed & tested by the community » Needs review
+++ b/core/lib/Drupal/Core/Entity/ChunkedIterator.php
@@ -0,0 +1,83 @@
+    }
...
+    // Reset any previously loaded entities then load the current set of IDs.
+    $this->entityStorage->resetCache();
+    return $this->entityStorage->loadMultiple($ids);

Would it not just be better to have a way of loading entities without static caching at all? I think it is weird to blow the entire static cache even for entities not loaded through the ChunkedIterator.

damiankloip’s picture

That would be nice. Trying to add that at this stage seems like quite a bit of work though? Would we add a new method? I guess we would have to. But then any existing implementations would not implement it etc... who knows what people are already doing with entities.

This was going for just a simple helper that people can use if they like. The ChunkedIterator could just reset the cache of the items it loaded..

dawehner’s picture

Here is an idea, let's clone the entity storage object in the constructor. With that we no longer clear the cache on that particular object, so it doesn't affect the actual mostly used entity storage object.

damiankloip’s picture

I like that idea, let's try that.

damiankloip’s picture

FileSize
7.47 KB
1.69 KB

Also tested manually, seems to work fine.

aheimlich’s picture

\Drupal\Core\Entity\ContentEntityStorageBase::resetCache() clears both the static cache and the persistent cache. I can understand the iterator wanting to clear the static cache between chunks, but is clearing the persistent cache as well a good idea? I would think not, for the same reasons as #23.

damiankloip’s picture

Huh? When the hell did it start doing that? That method was always meant for clearing static caches only.

EDIT: looks like the docs support that thinking too.

aheimlich’s picture

damiankloip’s picture

Thorough. Thanks!

The docs on the interface are still completely misleading then. Which means the content entity implementation is not really playing by the rules.

It seems seriously restricting (and wrong) to not be able to clear the static cache without clearing the persistent cache.

https://api.drupal.org/api/drupal/core!lib!Drupal!Core!Entity!ContentEnt...

"Resets the internal, static entity cache."

bojanz’s picture

We should definitely have a way to only reset the static entity cache. I had no clue this resulted in a persistent cache reset.

damiankloip’s picture

Yeah, it's pretty bad IMO. A misuse of the current interface.

damiankloip’s picture

I guess it could be solved by having a whole static cache backend injected or something. That will still get awkward to invalidate though.

It's really* frustrating that all of these problems have spawned from a complete mistake in the content storage implementation. Either way I think that needs to be fixed. The current working is wrong but also makes things really inconsistent between entity implementations.

catch’s picture

So only the storage and tests themselves should have to worry about the persistent entity cache - just saving and updating field definitions etc. shouldn't need to reset caches at all. Feels like something we can straightforwardly change in a minor release at least.

So I think it's worth just deleting that code, moving it into ->save() somewhere, and seeing what breaks - probably separate issue to tackle that bit?

damiankloip’s picture

In theory that sounds workable. I'll have a look and see what breaks... :)

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.0-beta1 was released on March 2, 2016, which means new developments and disruptive changes should now be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.0-beta1 was released on August 3, 2016, which means new developments and disruptive changes should now be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.0-alpha1 will be released the week of January 30, 2017, which means new developments and disruptive changes should now be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

EclipseGc’s picture

So, no movement on this issue in quite a while, but I'd very much like to see us approach this topic again as yields and generators are totally something we should be actively taking advantage of.

Phenaproxima and I were discussing this earlier today and in order to keep BC, he suggested a "streamMultiple()" method that can be used in leu of the current methods when yields are more favorable than returns. I'd also suggest a "streamMultipleByProperties()" method and just use raw entityFieldQuery with both of these and side-step all typical entity loading operations. I'm a little uncertain of the implications of the caching here, but since this use case is most beneficial in the scenario where we'd exhaust memory on the system by loading all the entities into memory in the first place, it seems beneficial to me to side step it completely if possible. Please feel free to disabuse me of that notion of it's senseless, I don't claim to understand all the nuance here.

Eclipse

phenaproxima’s picture

+1 what @EclipseGc described. The idea of a specialized, chunked entity iterator, IMHO, is utterly obviated by the existence of generators. We should take advantage of this wonderful, criminally underused feature of PHP.

I propose we add an interface that can be optionally implemented by entity storages, to maintain BC:

interface StreamingEntityLoadInterface {

  public function streamMultiple($ids = NULL);

  public function streamMultipleByProperties(array $properties);

}

Both of these methods would return generators.

dawehner’s picture

+1 what @EclipseGc described. The idea of a specialized, chunked entity iterator, IMHO, is utterly obviated by the existence of generators. We should take advantage of this wonderfully, criminally underused feature of PHP.

Did you had a look at the actual patch? This is really just using a generator, while it allows you to specify the chunk size. Putting more things onto the same object (entity the storage) makes it harder to maintain BC etc.

The tricky bit is really all about the reset of the caches, so you don't end up with infinite amount of used memory :)

damiankloip’s picture

Yeah, the current approach already uses a generator! :) A traversable instance is returned there anyway. So maybe both look at the work that's been done here already and read the previous comments. IMO it is still better keeping this functionality in a standalone class like it is currently, it is unit tested and simple to use. If we really wanted these new methods (although entities are ridiculously overloaded already), the implementation is as easy as:

public function streamMultiple(array $ids = []) {
  return new ChunkedIterator($this, $ids);
}

And you get your traversable object back to use.

As Daniel mentioned, it is indeed the caching that is the problem for core (all above too) - that is the crux of this issue, iterating is easy. Just to iterate a set of content entities, you currently need to clear the persistent cache too.... For an example workaround to circumvent the persistent cache clearing, see http://blog.damiankloip.net/2016/d8-entity-iterator that's an example of what you have to do due to the way the current 'API' works.

EclipseGc’s picture

So, at the risk of getting pedantic...

Yes I read the patch. Maybe we shouldn't lead with "did you read the patch"? It's kind of belittling.

In my comment, I never suggested that the patch wasn't using generators, just that we should see movement on this basic idea again because of the value of generators, thus the reason I even bothered to comment.

To the meat of the patch itself, my comment was really just trying to say "this is great but I feel like a simpler implementation might get quicker traction." and "Is it possible to side-step the caching completely?" (which is the conversation I'd prefer to be having) I'm not trying to start fights here, just trying to figure out what the mvp for a generator based solution to entity loading might be.

As catch explained to me in irc, loadMultiple exists for a reason, and that getting individual entities is more costly from a time perspective on long runs. I totally appreciate that. However, as Damian's own comment points out, whether we're loading one entity at a time, or 50... is sort of an implementation detail in this instance, so the least effort to get a generator of any sort into this pipe-line is probably a worthwhile effort.

If we don't want to add additional methods to the Entity Storage, that's fine. Again, implementation detail.

Eclipse

dawehner’s picture

I have a hard time extracting what you try to tell us :)

EclipseGc’s picture

lol ok tl:dr;

  • I never suggested the patch wasn't using generators, obviously it is.
  • I want to use generators asap.
  • Is this patch truly the MVP to do so? Or can it be simpler and get in faster?

Eclipse

dawehner’s picture

Well, keep in mind that in Drupal core MVP is tricky, as well, you have to support it for years. Putting some thoughts into the implementation, which @damiankloip did, is not a bad idea.

damiankloip’s picture

It would be nice to keep the iterator that we have now, as it does encapsulate the logic nicely, and then can be re-used whenever/wherever needed, in pretty much any context. It's just that darn caching. Having an implementation coupled in the controller itself could allow us to directly clear the statically cached items but then we would need new interface and more methods etc.. In an ideal world I would totally stick with the current approach.

I'll have a little play around.

catch’s picture

Reminder that I have a very old issue that would fix the caching problem here. Adding to related issues.

damiankloip’s picture

catch’s picture

Yep that issue, I tried to add it as a related issue but didn't stick, trying again.