Problem/Motivation

Drupal 7 introduced multiple loading of entities and the controller classes. This means that especially when building lists of content, the best case can be a single multiget from cache to get all the fully populated entity objects (when using entitycache module) or that all nodes are loaded with a flat number of database queries regardless of the number of nodes. This improved on the situation in Drupal 6 where each load was loaded individually, and each node_load() could execute a decent number of database queries each.

For this to work, there is a general pattern of "get the IDs, multiple load the nodes based on the IDs, do stuff with them" - this is great for content listing, however it does not work if for example you need to load 7 different nodes in seven different blocks - those still get grabbed one at a time.

For the WSCCI iniatiative (especially after the irc meeting last night and discussions in #1177246: Context values: Keys or objects?), there is a desire to have the publicly accessible context properties be either literals or objects. In the case of entities, this would be $context['node'] or $context['user'] most likely.

There is also a desire to improve block caching, so that from just context, you can build a cache key and load a block from cache (core already allows you to do this via drupal_render(), #cache, and pre_render callbacks although it is only partly formalized).

This means that if we just got an $nid from somewhere, use the $nid as part of a cache key, then get a cache hit, we won't actually need to load the node (or a 'parent' page won't need to load the node, but an ESI callback might on a cache miss in a different PHP process altogether).

But... passing around $node objects, and yet only needing $nid to generate a cache key seem mutually exclusive I hear you say!

Proposed resolution

There has been a lot of discussion about making entities into proper classes (rather than stdClass) with methods etc. Additionally, that we should keep the Load controller (and associated other controllers as they come into play) as separate classes to the actual entity class (so $entity->save() calls a method from an EntitySave class, the code wouldn't be baked into the Entity class). Details are not fully worked out yet but that is the general direction.

This means I think we should be able to construct an entity object like this:

$entity = new Entity($type, $id); // $id likely optional so that mock entities can be created, potentially new ones.

This means I can do $node = new Entity('node', $nid); $node->id (or $node->id() or whatever), and that will work fine.

What occurred to me yesterday, is there may well be a way to reconcile this with multiple load, would look something like this:

* When you instantiate the class with an $id, the $id gets added to the EntityLoad controller - which maintains a list of $ids of that type that are in use.

* (two options), when you call $entity->load(), or just access a property that is missing (triggering __get() or similar), the $entity object calls back to the EntityLoad controller to fetch the actual loaded node. I am not yet tied to the magic method vs. the explicit ->load() method for this, the internal implementation in terms of this issue could be very similar.

* When ->load() (either directly called or via magic) asks the EntityLoad controller, it inspects the list of entity IDs that it has, vs. the ones that are already loaded. At this point, it is able to multiple load all of the nodes that weren't loaded already - so they're ready for later when requested. (we could also add a method or property to disable this behaviour, allow the default to be overridden or similar).

This would give us the following:

- for lists, multiple load works more or less the same as now, except you could just foreach over $nids, instantiate the class and load it - no get $nids, load, then foreach pattern. So it should be a simpler for people who just want to load nodes and do stuff with them. The front-loading would be encapsulated in the class logic rather than a pattern that has to be copy/pasted around.

- for non lists (like multiple blocks on a page with different nodes based on different relationships), if we build the context for the blocks, then go through and execute the block handlers in sequence (either building the actual render array or rendering the string, doesn't matter for this), multiple load will work for this case too, whereas it currently doesn't.

- For sites using ESI/big pipe or similar, processes that don't need to load entities won't do so just to generate cache keys or similar.

- There are advantages for doing things like mocking the context object too (i.e. a block callback never has to call $node = node_load($nid); it just acts on a class that was passed into it), so dependency injection, consistency etc. It keeps us closer to menu_get_object() and passing parameters vs. node_load(arg(1));

- This approach should very compatible with adding an LRU cache for entities, see #375494: Restrict the number of nodes held in the node_load() static cache and #1199866: Add an in-memory LRU cache.

Remaining tasks

No patch yet, we really need to get #1018602: Move entity system to a module in.

User interface changes

None.

API changes

Probably.

Comments

Crell’s picture

Subscribing, will try to comment soon. Overall in favor of entities-as-classes.

pwolanin’s picture

I don't think we would have

$entity = new Entity($type, $id); // $id likely optional so that mock entities can be created, potentially new ones.

Instead it would need to be a factory method (function or class method).

$entity = entity_load($type, $id);

or maybe:

$node = DrupalEntity::load($type, $id);

We'd have to think about creative use of __get() (as well as __isset())? Though I'm not sure if there would be a meaningful performance hit by having to copy all the properties from a loaded object into the stub object.

catch’s picture

We'll definitely need a factory. I'm assuming entities a classes but wasn't planning for this to be that issue.

fago’s picture

Instead it would need to be a factory method (function or class method).

Agreed.
I do think we want have per-entity-type classes like Node, Comment, .., which are useful for implementing all the entity-type specifics. See the WIP code at #1184944: Make entities classed objects, introduce CRUD support. This makes lazy-loading generic entity objects impossible, though I don't think this really is an issue.

* When ->load() (either directly called or via magic) asks the EntityLoad controller, it inspects the list of entity IDs that it has, vs. the ones that are already loaded. At this point, it is able to multiple load all of the nodes that weren't loaded already - so they're ready for later when requested. (we could also add a method or property to disable this behaviour, allow the default to be overridden or similar).

I'm not a big fan of this, because it makes it extremely hard to follow/control what happens. Some simple calls accessing a property could magically trigger massive load operations, even if the lazy-load is triggered by a simple read access of a property. That way it becomes harder for developers to know what's going on and to keep control.

I'd like to avoid magic getters or methods for the same reason - all the magic makes it hard for developers to see or track what happens. I've already gone that way with faces and the magic really just made things more complicated :( It doesn't help much if someone can simply call a method or access a property if no-one knows it's there in the first place; or if people read the code and don't actually understand what happens in the back.

- for lists, multiple load works more or less the same as now, except you could just foreach over $nids, instantiate the class and load it - no get $nids, load, then foreach pattern. So it should be a simpler for people who just want to load nodes and do stuff with them. The front-loading would be encapsulated in the class logic rather than a pattern that has to be copy/pasted around.

Actually, I'd prefer the pattern. A load-call really isn't hard, but it makes it obvious to me what happens. So in the end I'd argue the load call is simpler, because I don't need to be aware of any magic that happens in the back.

Still, imho having a single or multiple lazyload could be worthwhile as it could help to avoid some unnecessary entity-loads while leading to a more consistent and simpler API (no more Node vs NodeID function args or returned variables..).

Crell’s picture

I agree we probably do want to have a class-per-entity-type rather than one uber Entity class and that's it. However, that doesn't mean we can't still have a universal factory. The universal factory just needs to be told what type of entity it's dealing with along with the id(s). That's quite easy to do if we allow ourselves to mentally separate loading of entities from "being" an entity entirely; You don't get an entity from new Entity or new Node, ever, at all. You get it from a factory that does not begin with Entity:: or Node::. There's nothing wrong with that.

(As I said in a previous DrupalCon presentation, it is not the job of Pizza to make Pizza. It is the job of Pizza to BE Pizza.)

Also, in a previous project back in Drupal 5, which actually served as much of the inspiration for the Data API Design Sprint and the Fields/Entities system that resulted, I had the concept of a "load set". I was loading data objects from a 3rd party system into my own mini-ORM or mini-Entities system. By default, all objects were loaded with just an ID, say from a search result set. When you tried accessing data from one of them, it would lazy-load data. However, to avoid the Select N+1 problem it would lazy-load from *all* objects in that load set at once.

That is, if you got a list of 20 Artwork ids from a search query, you'd get back just IDs. The code would then create 20 Artwork objects, each one with just its ID, no other data, and a load set ID that linked all 20 of them together. No traffic here to load data from the remote system yet. Then when you first accessed data that would result in a network hit, it did the equivalent of node_load_multiple() for all 20 records and populated all of them at that point. Then all Artwork objects had full data, and the code could safely proceed in ignorance of that fact without triggering an avalanche of network traffic to the legacy system. (We later added caching to it as well.)

Perhaps we could explore a similar setup here. When you create entity objects, you create a set of linked skeleton objects. They then initialize and load en masse with entity_load_multiple() (or something similar to it) on first use. That gives you the double benefit of super-thin objects by default and multi-load to avoid tons of hits to the datastore.

catch’s picture

@Everyone - please ignore the code examples, I am not proposing entity class names or factory names here. If I was, the issue would have a completely different title. We should discuss that on #1184944: Make entities classed objects, introduce CRUD support or open a new one dedicated to that discussion.

This issue is only about allow entity objects that don't contain loaded data for consistency, some kind of lazy load (either magic or explicit via a->load() method) and multiple loading. It will affect the implementations and interfaces of the classes, but not what they're called or much else either.

@Crell that's more or less exactly what I'm proposing in the OP :) with the entity load controller (or somewhere in the implementation around that level) handling the load sets internally (and a 1-1 mapping of entity type to load set).

Having the load controller track the load sets internally gives us the following:
- different entities loaded via different code paths can be multiple loaded at the same time, no need to track sets externally. I had previously been mulling over tracking sets externally somehow but currently don't feel like there's a non-ugly way to do that.
- we can reclaim the D6 pattern of just grabbing a list of ids, foreaching over them and loading them - because internally it will be just as fast as multiple loading them up front.
- we can add an LRU cache to this mechanism very easily (minor caveat about when we want to load more entities than will fit in the cache, but that seems resolvable).
- contrib could replace that LRU cache with a weak references implementation - see https://wiki.php.net/rfc/weakreferences - which are specifically designed for this pattern (amongst others). It looks like this will be a PECL extension rather than in PHP 5.4. The general idea is that you have an index of ids, which allows you to pull an object from memory, but the garbage collector can remove the objects themselves (and the API could retrieve them again if they happened to be needed).

It feels to me like the main points of contention are whether to use magic or an explicit ->load() call.

Advantages of __get():
- we could potentially keep all entity storage actually inside the storage controllers, and just have the entity instances themselves pull values from there - no actual copying of values.
- some entity properties might be fields, and the fields might themselves be class instances, this potentially allows for different kinds of routing to happen.
- Keeping the entity instances extremely shallow probably makes weak references/LRU much easier - the actual content of the class would not differ whether the entity is loaded or not, it always calls back to the storage controller.
- code that has entity classes doesn't need to call ->load() on it (this would need to be done before accessing any property within the scope of a function I think). You can just do what you want with the entity instance and the hard work happens in the background

Disadvantages:
- as fago points out, adding magic to things like this can make code flow hard to follow
- __get() is much slower than directly accessing properties, although it's definitely a good trade-off if we keep multiple load, compared to a ->load() call it would probably be a bit slower but hard to tell.

fago’s picture

I agree we probably do want to have a class-per-entity-type rather than one uber Entity class and that's it. However, that doesn't mean we can't still have a universal factory.

Indeed. I was thinking about lazy-loading the entity-type too, which has totally nothing to do with that discussion. So let's forget that, sry.

Perhaps we could explore a similar setup here. When you create entity objects, you create a set of linked skeleton objects. They then initialize and load en masse with entity_load_multiple() (or something similar to it) on first use. That gives you the double benefit of super-thin objects by default and multi-load to avoid tons of hits to the datastore.

Just to make that clear, I'd be happy having this kind of lazy-loading. I'm just hesitant about having the controller magically loading multiple entities while a single entity has been requested, as well as other __get() magic.

@__get():

We can do a get() method which allows easy access to properties and triggers all kind of lazy-loading the same way. So we have the same advantages, but it's obvious and easy to track for developers what accessing the property could trigger. Also we could allow using $entity->load() and directly accessing plain entity properties as usual afterwards without triggering any magic at all.

Crell’s picture

Personally I think the entity system would be greatly improved if we got rid of arbitrary object properties on entity objects. Have fields, have Entity Properties (by whatever API), and don't allow/support anything else. That's going to be necessary for CMI anyway for deployment, I think. If we need to track Entity Metadata as well (think flag status, nodequeue, possibly OG membership, etc.), that should have first-class support rather than being punted as "well, just throw random crap on the object and hope it doesn't break". That would also help deployment.

That would then give us very clear, explicit places where lazy-loading could be triggered, as well as various other advantages. (I'd like to see title turned into a method, for instance, on all entities.)

Also, $node->lazyLoadMe() and __get() are not mutually exclusive. To wit:

class Entity {
  public function lazyLoadMe() {
    $this->controller->lazyLoad($this); // Controller handles the lazy loading, but needs context.
  }

  public function __get($var) {
    $this->lazyLoadMe();
    return $this->$var;
  }
}

That's probably how we'd want to implement it anyway, so the only question is whether lazyLoadMe() is public or protected. I think in this case it's fine to be public.

catch’s picture

Right we need to seriously think about first class support for things that are neither on the base table nor fields. A big issue with fields is the amount of overhead in terms of metadata to attach stuff (and there are issues around certain things not really making sense to render, yet you can't have a field without a formatter). I think we can do this as a separate task to tackling that though.

I agree we can do both methods, although if we actually copy properties into the Entity class then it could be harder to do LRU and similar was thinking more like:

public function __get($var) {
  // LoadController::get() lazy loads from storage if the entity isn't available via LoadController::load().
  return $this->loadController->get($var, $this);
}

public function load() {
  return $this->loadController->load($this);
}

So the actual Entity class stays shallow the whole time.

Crell’s picture

catch: So you're suggesting that all Field data actually live in the controller object, and the entity object is just a routing shell? I am intrigued but not yet convinced. That seems like it adds a lot of unnecessary indirection in common usage, since then all requests go through __get() rather than just the first. __get() has been going out of vogue in the PHP world from what I've heard, because it makes code harder to follow/debug/auto-complete. (Don't underestimate the importance of the last one.)

catch’s picture

I'm considering it as a possibility, not necessarily suggesting we do that yet.

Another way to do it would be something like this:

class Entity {
  public function __construct($type, $id = NULL) {
    // The load controller is prepared with a reference to the object.
    $this->loadController->setReference($this);
  }
  
  public function load() {
    // When actually loading the object, the load controller clears out its
    // reference. That way when this entity goes out of scope, the memory
    // will be freed.
    $this->loadController->load($this);
  }
}

The main thing is to leave it open so that it's possible to manage memory usage from loading entities in a reasonably intelligent way - either let the PHP garbage collector handle it or allow for an LRU cache to be inserted somewhere. Most page requests shouldn't have to worry about this, but it'd be good for migrations, anything that needs to batch process lots of entities etc.

catch’s picture

Hmm, actually #12 works the other way too. Keep a full reference to the object in the load controller, then if there's memory management needs, set that reference back to a skeleton object instead of the full one. So any way we do this I think we can allow for that particular issue, and it just comes down to which API we want for it all.

So Crell's example would work fine for this, we just need to allow the load controller to maintain an index of entities for the multiple loading (i.e. the __construct() part).

Which is good, 'cos part of why I opened this discussion was to make sure we'd be able to do this down the line, and it looks like that should be doable as long as we keep the fundamental principle of the Entity objects and the controllers being separate classes, which it seems everyone agrees on.

fago’s picture

Personally I think the entity system would be greatly improved if we got rid of arbitrary object properties on entity objects. Have fields, have Entity Properties (by whatever API), and don't allow/support anything else. That's going to be necessary for CMI anyway for deployment, I think. If we need to track Entity Metadata as well (think flag status, nodequeue, possibly OG membership, etc.), that should have first-class support rather than being punted as "well, just throw random crap on the object and hope it doesn't break". That would also help deployment.

Exactly, we need an "entity property API".

>catch: So you're suggesting that all Field data actually live in the controller object, and the entity object is just a routing shell?

As said, I don't like having $node->property involve lots of magic and I very much doubt it's good for performance. Once we lazy-loaded an entity, we should have as simple and fast access to the properties as possible.
Also we really need an object containing all the data, else stuff like entity form workflows are going to become even more complicated than they are now.

ad #12:
Registering entity objects in the controller makes sense to me, though we need to keep out entities that are currently edited, thus changed, but not yet saved.
However, most entity objects are going to be loaded via the controller anyway, except for the newly created (=saved) ones which we should register. Not sure whether we should just leave out un-serialized objects (-> form workflow). Maybe better this could be solved by creating a new revision before doing any changes (assuming that each edit generates a new revision anyway).

That's probably how we'd want to implement it anyway, so the only question is whether lazyLoadMe() is public or protected. I think in this case it's fine to be public.

Yes, at least it should be public. Give developers control. Restricting __get() magic to be triggered only for registered entity properties would certainly help to reduce unwanted side-effects. Still, I'm not convinced implementing __get() is a good idea.

Crell’s picture

Well, lazy loading that you have to explicitly initialize doesn't really buy you much. The developer then needs to know whether an object has been fully loaded or not, which defeats the purpose. So we're going to need some sort of automatic "OK, now I need to be loaded" trigger one way or another, and __get() is as good as any I suppose. I agree that we don't want to hit it for every property, though.

fago’s picture

>Well, lazy loading that you have to explicitly initialize doesn't really buy you much.

Sure. Still, it doesn't have to be magic, but could be triggered by an explicit function call, e.g. $entity->get().

We probably need a getter like this anyway, as we want to assist users getting values of various languages, while using a sensible default.

E.g. I think a getter that consistently works like

$terms = $node->get('field_tags', 'en');
echo $terms[0]->label();

for all properties including fields what be helpful. Whereas in that example both function calls could trigger lazy-loading.

casey’s picture

Using __get() to accomplish lazy field loading doesn't really have to mean a negative performance impact. In fact I think it will mean a positive one: __get() will be called only once, if we store the loaded field data inside the entity object during that __get() call.

There are lots of situations where we only need data of a couple of fields so we have a performance win for every field not loaded. An issue I see however is how to combine this with entity caching.

A required getter function like fago suggests in #16 is definitely going to be a performance hit; A function call is consuming at least double execution time as a property call. Fields are most of the time accessed multiple times during one request.

casey’s picture

A cool benefit of lazy loading would be that you can immediately use entity objects returned by EntityFieldQuery.

catch’s picture

With lazy field loading I'm talking about lazy loading /all/ the fields when one is accessed. The field cache is also per-entity so it's about the same with field and entity-level caching really.

However... I think we need to also consider allowing fields or instances to declare themselves as not loaded until specifically requested - http://drupal.org/project/field_suppress tries to deal with some of that. That's a bit of a different discussion though (and might be better handled with reference fields sometimes in the case of entities that have dozens or hundreds of instances - suggests a wider problem when that's the case).

casey’s picture

Good idea. I was just thinking about the same thing the other way around. Allow fields to be preloaded while all others are lazyloaded, but suppressing fields might be better.

catch’s picture

One thing I forgot to mention, the idea here is to lazy load everything about the entity that we don't already know about - so that includes the data from base tables as well as fields.

So with entitycache it would work the same as now - the only difference is the ->load() only happens when you access a property, not when you instantiate the object in the first place.

I think I agree with casey about __get() - it should be faster than having a method call every time, and this is one place where we could use some indirection - it will allow us to stop people accessing $node->field_name[$langcode][0]['foo'] finally as well as other goodies.

fago’s picture

I think I agree with casey about __get() - it should be faster than having a method call every time, and this is one place where we could use some indirection - it will allow us to stop people accessing $node->field_name[$langcode][0]['foo'] finally as well as other goodies.

So how would this work then, considering translatable properties?

catch’s picture

When accessing a field value, have it internally figure out which language to return the value for. Similar to field_get_values() now. We could separately offer some kind of getter allowing raw field values to be accessed, but I wouldn't want to add the ->get() overhead to every field and property on everything, when 99.9% of cases wouldn't be passing an explicit language.

fago’s picture

Yep, just going with the default language via __get() should do it + offer other languages via get()".

I've thought of calling get() only in case you are not sure whether the entity has been already loaded or not, but I see the beauty of not having to deal with that. As long as __get() is restricted to the list of defined properties it should work out.

yched’s picture

Hum, subscribe.

pwolanin’s picture

Status: Active » Postponed

Consensus at the code sprint was that we should postpone any consideration of this, especially gene the level of new WTFs that might be introduced and the unclear effect on performance.

catch’s picture

Status: Postponed » Active
tim.plunkett’s picture

The current issue summary makes a lot of guesses about how the classed entities will work, but that code is in now. So it needs an overhaul.

Berdir’s picture

We also discussed yesterday with fago that this is related to the Entity Property API as well, where we might want to have an EntityProperty class that supports lazy loading it's property iems. Another use case might be taxonomy_get_tree() and similar functions that might return a possibly large number of entities?

alexpott’s picture

Issue tags: -Configuration system

Sorting out tags - this is not a configuration system issue

mitokens’s picture

Version: 8.0.x-dev » 8.1.x-dev

Drupal 8.0.6 was released on April 6 and is the final bugfix release for the Drupal 8.0.x series. Drupal 8.0.x will not receive any further development aside from security fixes. Drupal 8.1.0-rc1 is now available and sites should prepare to update to 8.1.0.

Bug reports should be targeted against the 8.1.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.9 was released on September 7 and is the final bugfix release for the Drupal 8.1.x series. Drupal 8.1.x will not receive any further development aside from security fixes. Drupal 8.2.0-rc1 is now available and sites should prepare to upgrade to 8.2.0.

Bug reports should be targeted against the 8.2.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.6 was released on February 1, 2017 and is the final full bugfix release for the Drupal 8.2.x series. Drupal 8.2.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.3.0 on April 5, 2017. (Drupal 8.3.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.3.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.