Problem

Core, and many contrib modules, use the following patterns, both for configuration and the results of hook implementations (registry/info hooks usually):

- collect all information from the database (or hook implementations)
- cache it as a single cache entry
- load it whenever any part of that configuration is requested.
- static cache it since it may be requested multiple times.

Between core and contrib modules, this pattern (or something extremely similar) is used by the following systems (at least):

drupal_get_schema()

Called nearly every request in Drupal 6 (and Drupal 7 right up until a couple of days before 7.0 release) via node_load(). Can often reach 2-3mb once a site has contrib modules and/or fields. Also called every time drupal_write_record() is called.

theme registry

Called nearly every request in both Drupal 6 and Drupal 7. Often reaches 3-4mb on sites with lots of contrib modules enabled.

variables

Loaded every request, can reach 1mb or more, especially as sites get more mature since there is still no reliable way to complete remove module variables on uninstall or prevent cruft from accumulating on runtime.

system info

Loaded every request in Drupal 7 due to system_init() (to get styles and scripts data from .info files in the system table), this can reach 1mb even with just core modules enabled.

field info

Loaded every time a field is requested (nearly every request) - can reach 3-4mb on any moderately complex site.

Just these five subsystems can contribute 10mb or more of PHP memory usage (and some very large objects being requested from the database/memcache and unserialized), and they're not the only ones using this pattern. CCK in Drupal 6 uses it, and views does a similar thing for default views (which can be all be loaded every request in some circumstances), same for ctools defaults, this is not an exhaustive list.

There are several things in common:

- 'omnipotent' APIs - for example _alter() hooks that act on the entire array (hook_schema_alter(), hook_theme_registry_alter())
- no way to predict usage up front.
- memory issues tend to only show themselves once sites start adding contrib modules and/or configuration - they are issues that get worse over the lifecycle of a site, although in some cases the 'base' cost is still relatively expensive. For this reason most of the issues have been found while profiling client sites.

The consequences of the current approaches:
* High overall memory usage (these objects are usually loaded into static or global variables).
* Cache objects can sometimes exceed max_allowed_packet on MySQL or the maximum size of memcache variables (both default to 1mb although memcache tends to compress large cache items, and most servers have the MySQL value set higher).
* Since these items are large, often expensive to build, and needed by most requests, they are prone to stampedes on high traffic sites.
* Some one-off attempts to fix them (i.e. in Views 2 around a 12-18 months ago, see discussion in ) have led to race conditions, and generally the code to fix the issues one-off tends to be both complex and ugly. #853864: views_get_default_view() - race conditions and memory usage describes one such race condition, and also has some ugly code (by catch, who is writing the summary so I'm allowed to say that :p).

Rejected approaches

* No caching at all. The original focus of this issue was cache items exceeding max_allowed_packet. However, especially with the schema and theme registry, an uncached build is a significant operation in terms of memory and CPU usage - all .install (or .admin.inc and .pages.inc) files have to be loaded to pick up hook_schema() or preprocess implementations, and the CPU involved in building the large arrays is also significant. There was a good reason caching was introduced for these systems and simply removing it would significantly worsen the situation.

* Load individual items (either direct from the database for config, or direct from hook implementations for 'registries').

- Again with the schema and theme registries, this is not an option. hook_schema_alter() and hook_theme_registry_alter() expect the full data set to act upon. In the case of the theme registry, many dozens of theme hooks may be required per request, so individual requests for these would be significant in terms i/o on the database or other cache systems - and we already have page execution time issues in Drupal 7/8 as well). Additionally, requesting individual cache items also means setting them individually, if this is done all at once during the same cache rebuild, it can result in dozens, hundreds or even thousands of database inserts for the process that has to rebuild the cache (currently the case for the views defaults cache in 6.x-2.x, I need to revisit that patch as well).

- it is also not really an option for configuration such as variables or field info - there are often 50-100 of each requested per page, which would mean again dozens or hundreds of round trips to the database to pull them out. This would end up similar to the path alias situation in Drupal 6 where sheer number of queries + latency causes performance issues in itself.

Proposal

The patch introduces a new class in bootstrap.inc, CacheArrayObject. This is an abstract class since systems that use it will need to override at least one method, but it incorporates many of the lessons learned tackling the above issues.

The primary goals of the issue are this:

* Solve the issue in a generic way that can be easily backported to Drupal 7 (and Drupal 6 Pressflow since that requires PHP5, I already have Pressflow backports of this issue and others for Tag1 client(s) (that are partially funding some of this work)). This means only API additions, no API changes as such.

* Use a combination of lazy loading and cumulative caching to allow the run-time cache item (and by extension memory usage) to reflect the amount of information that is actually requested, rather than all the possible information available on the site. For the schema cache in particular, often only 2-20 tables have schema information requested, while the full schema may often include information for 100-300 database tables.

* Set caches once at the end of the request (or when the object goes out of scope) to avoid too many additional database writes.

* Avoid per-page caching so that the cost of rebuilds is minimized.

* Avoid write stampedes on cache items by default (so that different processes don't overwrite each others' partial data). This does not guarantee cache coherency by itself (nor does any caching in core currently), but helps us along the way towards this.

* Avoid read-locks on caches, since these can lead to hanging processes which can be as much of a problem as a stampede (especially for this kind of cache where the bottleneck is usually in PHP as opposed to slow queries).

The patch here includes an implementation for the schema cache, a quick summary of the approach taken for that:

- build a run-time cache for the schema. This uses a separate cache key for GET requests (based on the idea that drupal_write_record() tends to be called on POST, so the cache for GET requests should be smaller).

- leave the single large cache item in place for now - this prevents a full schema rebuild each time the run-time schema needs to be added to.

- hide the ArrayObject behind the current procedural functions - calling code still gets a 'real' array.

There are three active issues that currently contain patches based on CacheArrayObject as added here. These are separate patches for two reasons: 1. They were all previously being worked on individually with one-off procedural fixes and have their own history. 2. All of those issues have slightly different performance implications, different trade-offs and usage patterns, so they deserve review in their own right. However the individual patches in those issues are relatively small in terms of code weight (especially if you skip over the CacheArrayObject class which is the same hunk in all three), so they give a good idea of how the pattern will look when applied elsewhere.

#1011614: Theme registry can grow too large for MySQL max_allowed_packet and memcache default slab size
#987768: [PP-1] Optimize variable caching
#1061924: system_list() memory usage

The next candidate (and probably most serious in terms of memory usage) would be #1040790: _field_info_collate_fields() memory usage, however I don't want to add to the list of patches depending on this one until this has had feedback from Dries.

@todos and points of discussion

Points of discussion

1. Whether there should be a new interface defined (in addition to ArrayAccess which this already implements) this was the initial approach and was returned to a couple of times, or the abstract class that is included here.

- in general interfaces are really nice to have since they document things nicely for calling code.
- in this specific case, the calling code is 99% of the time going to be in the subsystem using the API itself (i.e. drupal_get_schema() etc.), and it is only using methods that are already part of ArrayAccess or ArrayObject. New methods added here are for specific implementations - an abstract class suits this better since it is a contract between the implementer of the class and the base class, whereas an interface is a contract between the 'user' of the class and the implementation.
- for type hinting, we can use ArrayAccess.

There was a lot of back and forth on this, but in the end I think we reached a consensus, and in the process refined and improved the documentation a lot.

2. Huge arrays are not necessarily the best model for these APIs, we should consider a full refactoring.
This is definitely true for the variables cache (hence CMI), however there are no current, active refactoring proposals for the schema, theme or other APIs. Additionally we can't backport those kinds of changes to Drupal 7 - given that for example Field API is a new addition in Drupal 7 (and field usage is likely to be much more extensive than even CCK was in D6), I think it's important to try to stop the bleeding in D7. Since the changes to each individual subsystem are usually quite self contained, this should not make any larger refactoring efforts more complicated - and they may well need to deal with the same performance issues as here.

3. We could just implement ArrayAccess, however I believe this was proposed by sdboyer, who's now a strong supporter of keeping this as extending ArrayObject since we're using a lot of the capabilities that provides after examining the patch more.

@todos
* While the patch here should have a very beneficial effect for run-time memory usage, it does not improve peak memory usage from cache misses. That's impossible to do without changing the 'omnipotent' hook_schema_alter() in terms of hook invocations - i.e. we may want to make hook_schema_alter() only act on one table at a time, so that we never examine the full array all at once.

However for example the variable caching and system info patches do reduce peak memory usage on cache misses, so the limitation is not inherent but more case-by-case depending on where we apply this to. I'm pretty confident we'll be able to reduce peak memory usage on cache misses in several places with minimal API changes in other partsDrupal 8, and possibly backport some of this to D7 (i.e. it might be possible with the Field API caches).

* It would be good to allow for full dependency injection of the cache and lock systems so that this could have proper unit tests without any external dependencies, however this will require moving the lock system to a class, and finding a good pattern to inject dependencies into classes (we currently don't have a pattern for this).

* A nice feature would be to allow passing a callback (possibly anonymous function) instead of requiring extending classes to implement resolveCacheMiss(), again this requires coming up with a nice pattern that we don't currently have.

* Some places this could apply to (such as path caching) might benefit from having the specific implementations be pluggable, however that is something we can either add one-off, or try to come up with a generic factory.

Most of these are nice to haves that the patch enables, rather than specific deficiencies in the current approach, and should be relatively straightforward to add later in Drupal 8.

Original report
If we're going to have Views in Core, that's a ton of information to shove in the schema. The MySQL packet size may be too small to handle caching that. This patch splits out the cache into one entry for each table.

CommentFileSizeAuthor
#166 cachearray-5.3-fully-compat.patch9.32 KBcatch
#155 cachearray-5.3-fully-compat.patch9.32 KBsdboyer
#155 cachearray-5.3.4-only.patch8.25 KBsdboyer
#152 cachearray.patch8.25 KBcatch
#143 cachearray.patch8.21 KBcatch
#129 schema.patch7.86 KBcatch
#125 schema.patch8.2 KBcatch
#108 schema.patch8.2 KBcatch
#107 schema.patch8.2 KBcatch
#107 interdiff.txt1.47 KBcatch
#105 drupal8.schema-cache.105.patch8.34 KBsun
#98 bootstrap.inc-7.2-alternative_schema_lazzy_loading_without_cache.patch5.54 KBpounard
#96 schema.patch8.34 KBcatch
#93 schema.patch8.33 KBcatch
#91 schema.patch8.04 KBcatch
#88 schema.patch7.63 KBcatch
#81 schema.patch7.13 KBcatch
#74 schema.patch6.82 KBcatch
#72 schema.patch6.85 KBcatch
#71 schema.patch7.06 KBcatch
#71 schema_cache_D6.patch7.12 KBcatch
#67 schema.patch6.91 KBcatch
#64 schema.patch15.02 KBcatch
#57 402896_drupal_get_schema_3.patch11.62 KBandypost
#55 402896_drupal_get_schema.patch14.18 KBcatch
#54 402896_drupal_get_schema.patch0 bytescatch
#52 402896_drupal_get_schema.patch14.09 KBcatch
#49 402896_schema_cache10.patch14.09 KBandypost
#48 402896_schema_cache9.patch13.74 KBandypost
#47 402896_schema_cache8.patch13.63 KBandypost
#45 402896_schema_cache7.patch13.31 KBandypost
#37 402896_schema_cache.patch4.66 KBcatch
#34 402896_schema_cache.patch7.04 KBcatch
#31 402896_schema_cache.patch7.23 KBcatch
#29 402896_schema_cache.patch7.91 KBcatch
#23 402896_schema_cache.patch6.93 KBcatch
#14 402896_schema_cache.patch6.12 KBchx
#13 402896_pressflow_D6.patch2.14 KBcatch
#11 schema_cache_402896_D6.patch2.57 KBcatch
#11 Desk 1_024.png85.16 KBcatch
#11 Desk 1_025.png73.77 KBcatch
#10 402896_schema_cache.patch4.52 KBcatch
#8 402896_schema_cache.patch4.5 KBcatch
#6 schema.patch4.52 KBcatch
#1 cache-schema-without-cache.patch4.92 KBdmitrig01
cache-schema-tables-individually.patch2.89 KBdmitrig01
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

dmitrig01’s picture

what if we don't cache? let's compare benchmarks of the two and core

Status: Needs review » Needs work

The last submitted patch failed testing.

catch’s picture

Status: Needs work » Needs review

subscribing.

dmitrig01’s picture

can you tell i didn't test these?

catch’s picture

Status: Needs review » Needs work
catch’s picture

Assigned: dmitrig01 » catch
Status: Needs work » Needs review
Issue tags: -Views in Core +Performance, +memory
FileSize
4.52 KB

This is a great idea, shame the patch stalled.

While profiling a Drupal 6 site with 219 database tables, the schema cache is 2mb.

I did a quick test hacking in drupal_get_schema('system'); to dblog_init() in a basic D7 site with no contrib.

Memory usage of drupal_get_schema() (bytes):

Before: 806,632
After: 3,368

So even with a vanilla-ish install (this was a clean D6 upgrade) we're looking at saving 800k of memory.

We've had #910156: Remove 'description' keys from schema cache and #910190: Entire schema cache needlessly loaded on all pages which help, but there are still plenty of situations where drupal_get_schema() gets called, both in core and contrib. There is also the original risk here of the schema entry being too big to actually cache - I'm not aware of that happening yet but with sites that have a lot of field tables this is only going to get bigger. On memcache the limit for any individual cache entry is 1mb, this is gzipped but looking at the memcache issue some people are running into it with theme registry and some contrib modules, theme registry is usually about three times the size of the schema cache.

Patch is untested except for the profiling.

Status: Needs review » Needs work

The last submitted patch, schema.patch, failed testing.

catch’s picture

Status: Needs work » Needs review
FileSize
4.5 KB

This one should install at least.

Status: Needs review » Needs work

The last submitted patch, 402896_schema_cache.patch, failed testing.

catch’s picture

Status: Needs work » Needs review
FileSize
4.52 KB

And another one.

catch’s picture

Title: Cache each table in the schema individually » drupal_get_schema() wastes 1-3mb of memory for nearly every request
Category: task » bug
Priority: Normal » Major
Issue tags: +Needs backport to D6
FileSize
73.77 KB
85.16 KB
2.57 KB

So I thought we no longer called drupal_get_schema() all the time in D7 after #910190: Entire schema cache needlessly loaded on all pages, but I was wrong on that, it's now called indirectly by menu_link_get_preferred() via drupal_schema_field_sql(), which is triggered by menu_set_active_trail() - which is even more requests than the older issue of node_load().

That makes this a major bug both for D6 and D7 since we're just squandering 1-3mb of memory for absolutely no reason whatsoever.

Attached before/after screenshots showing memory usage from drupal_get_schema(). On a completely vanilla Drupal 7 install, this patch saves around 1mb of memory, it will be much higher once contrib modules are installed and fields set up. It also saves around 10% of CPU usage on the front page with no content.

Also attaching a Drupal 6 backport. On the Drupal 6 site I'm profiling, this reduced around 2m of memory usage down to 30k for drupal_get_schema().

The standard install Drupal 7 site has 90 tables, the D6 site has 220 - so we're saving on average 10k of memory for every table in the database.

catch’s picture

We may also want to roll back #910190: Entire schema cache needlessly loaded on all pages once this is in to remove the 'double caching' in entity info, although that would only be a simplification due to no longer having to work around this bug, it's not causing any real problems, so not including that here.

catch’s picture

FileSize
2.14 KB

Also in case it's helpful, the D6 patch didn't apply 100% cleanly to pressflow for some reason, needed to roll that patch anyway so here it is.

chx’s picture

FileSize
6.12 KB

Actually we should remove the calls from menu.inc -- and avoid the calls as much as we can -- as they are not needed but need to leave entity_get_info as it is because the schema there can be needed (too generic to know for sure).

catch’s picture

Discussing with chx in irc, the entity_get_info() optimization is worth it to avoid loading this information from cache altogether, even if we fix the main memory issue here,it still doesn't do any harm avoiding loading the information where we can.

andypost’s picture

@catch Is it possible to write a test case? Could we check availability of schema parts at runtime?

webchick’s picture

subscribing.

carlos8f’s picture

subscribing

chx’s picture

If we want to test this, we need to measure the memory hit for first retrieving a very small schema and then retrieve the whole schema and the memory hit is at least 10-20-30 times as much.

Crell’s picture

Status: Needs review » Needs work

I don't know how we could unit test this. Our current testing framework doesn't handle questions of memory usage, or whether or not something works. Since there's no API changes, there's nothing really to add to the testing framework. I'm definitely in favor of not loading the full schema cache if we don't need it, though, and catch's benchmarks above show exactly why.

I cannot speak to the menu changes as I don't know that logic well enough. The drupal_get_schema() part of #14 looks good to me overall, but if I'm reading it correctly it makes a subtle change that will rebuild the schema registry every time someone asks for the full thing rather than using the cached version (as the old code did). I don't know how often we request the full schema. I wonder if we'd be better off doing a full rebuild or loading each item in the cache and sticking them together. Or at that rate, since it's all cached anyway, perhaps we should be storing both the individual tables and the full array, and then if someone asks for the full schema we can return that from cache_get('schema-full') or something instead.

Let's do that and avoid the extra rebuild. Otherwise this gets a +1 from me.

catch’s picture

Status: Needs work » Needs review

I deliberately moved the caching of the full schema since I agree with the original point of this issue which was to avoid hitting max_allowed_packet or the memcache 1mb slab size limit. I've heard of Drupal 6 databases out there with 300+ tables, this may get worse with per-field storage in D7, and there's also modules like table wizard now which could end up being used to subsume entire different databases into the Drupal schema.

On stitching them together, that's probably the most performant option overall, but it means we'll need to take a similar approach to the code in #853864: views_get_default_view() - race conditions and memory usage - store an 'index' cache entry alongside the individual ones, then handle the case where the index exists but the individual entries don't (and etc.) which means a lot of extra logic and locking - although the views issue is much the same problem with a very similar solution to what's needed so I'm glad I ran into that a week or so before this one ;)

We only call drupal_get_schema() without either the $table or $reset arguments once in core, in scripts/dump-database-d6.sh. I'd also checked drupalcontrib.org, where the only such call is http://drupalcontrib.org/api/drupal/contributions--ctools--includes--exp....

The full schema rebuild takes about 10mb of memory (on the Drupal 6 site I'm profiling). Another way to make this less expensive from a memory point of view would be to split schema files off in to .schema - so we're not loading files with 70+ update hooks just to get the schema. Since we explicitly don't allow use of hook_schema() in update functions now, this would also reduce memory usage on update.php too - but that's not an option for Drupal 7.

So that leaves us three options:

1. Go with the patch more or less as is, live with the extra expense when the full schema is actually requested.
2. Add back the single entry for the whole schema (this was also an earlier approach in that Views issue) but risk ending up with an array too big to cache at some point.
3. Cache an index of all schema tables alongside the individual entries, and use this to stitch the whole array back together when the full schema is requested, most efficient for both the single table and full schema requests, but needs to handle potential race conditions and/or missing entries.

I'd go for either 1 or 3, moving this back to CNR in the hope of getting some more feedback.

And yeah we already have some tests for drupal_get_schema(), and functions like drupal_write_record() which rely heavily on it. We don't have anything automated to test for cpu or memory usage, which is why we have huge gaping issues like this lying around core all the time...

Crell’s picture

Status: Needs review » Needs work

Well blargh. In that case I would go for #1. The complexity of #3 is not worth it if we're calling drupal_get_schema() without a parameter that infrequently. We should add a note to the function's docblock then informing the user that a no-param call will *always* trigger a schema rebuild so people don't get surprised by it. Let's do that and get this in.

And I am very +1 on breaking up .install files better in Drupal 8.

catch’s picture

Status: Needs work » Needs review
FileSize
6.93 KB

Works for me, added the comment. I will try to ping merlinofchaos to get his opinion on the full schema caching, and maybe do a proper grep of contrib, but I think we're better off keeping things simple for now and sorting that out later if it turns out to be necessary.

catch’s picture

Oh and the menu changes here are fine, instead of hitting the schema to grab all columns, just so we can remove one from one table, we instead alias the one we want from the other table and use that instead. Much simpler approach for the same overall issue.

webchick’s picture

Interesting. If we fix this, is #296693: Restrict access to empty top level administration pages potentially back on the table?

catch’s picture

No the access checks from that issue don't have anything to do with the schema loading here.

Crell’s picture

Hm, I don't think the performance implication there is clear. It needs to be explicit "if you don't ask for a specific table, you'll burn through 10 MB of memory. Enjoy." Maybe something more like:

+++ includes/bootstrap.inc	30 Dec 2010 03:27:52 -0000
@@ -2636,51 +2636,62 @@ function ip_address() {
  * @param $table
- *   The name of the table. If not given, the schema of all tables is returned.
+ *   The name of the table. When this optional parameter is ommitted the full
+ *   database schema is returned. Only schemas for individual tables are
+ *   cached. ¶
  * @param $rebuild
- *   If true, the schema will be rebuilt instead of retrieved from the cache.
+ *   If true, the schema will be rebuilt and the cache refreshed.

Hm, I don't think the performance implication there is clear. It needs to be explicit "if you don't ask for a specific table, you'll burn through 10 MB of memory. Enjoy." Maybe something more like:

"The name of the table. If not given, the full database schema is rebuilt and returned. That is, a NULL value here implies $rebuild = TRUE."

Which I suppose arguably puts into question whether we still need the $rebuild parameter, but let's not go there when we're at RC3. :-)

Powered by Dreditor.

catch’s picture

There's a difference though, in that $rebuild resets the cache entries for each table, whereas calling it without $rebuild doesn't, so I avoided using the word "rebuild" precisely for that reason. We don't want to go and do between 100-300 cache sets for no reason.

catch’s picture

FileSize
7.91 KB

Hmm, this discussion makes me think we should go for #2 for now, especially keeping a Drupal 6 backport in mind. We can then keep this issue open or start a new one to decide what to do about the full schema cache size.

Here's a patch which does that, only minor changes going this route, untested.

Status: Needs review » Needs work

The last submitted patch, 402896_schema_cache.patch, failed testing.

catch’s picture

Status: Needs work » Needs review
FileSize
7.23 KB

Status: Needs review » Needs work

The last submitted patch, 402896_schema_cache.patch, failed testing.

andypost’s picture

We should keep in mind that simpletest module needs ability to cleanup static cache without invoking hooks.

catch’s picture

Status: Needs work » Needs review
FileSize
7.04 KB

Didn't run full test suite but this fixed the installer for me. andypost mentioned cleaning up the code path a little in irc, which I think I agree with, but need to head out in a bit so posting this for the bot.

catch’s picture

On simpletest, that would likely help to speed up test runs - but I think we'd need a new argument (or abuse $table and set it to FALSE explicitly or something), and we'd also need to figure out between just clearing the static and doing a full cache clear.

Status: Needs review » Needs work

The last submitted patch, 402896_schema_cache.patch, failed testing.

catch’s picture

Status: Needs work » Needs review
FileSize
4.66 KB

Grrr. OK thought about this some more and we don't really need to set the schema for 100+ tables when only a handful might be requested individually.

So....

New patch adds drupal_get_table_schema() (this is only an API addition, not a change, I can make it a private function if we absolutely have to) - this handles the cache_get()/cache_set() logic internally.

drupal_get_schema() uses this if $table is set, this is leading towards splitting this into two separate functions in D8 which we obviously can't consider now.

drupal_get_schema() itself barely changes, except we no longer bother to statically cache the entire schema, and when rebuilding, we clear all schema* cache entries immediately before setting it again to avoid race conditions.

Patch looks a fair bit simpler now.

sun’s picture

Status: Needs review » Reviewed & tested by the community

Nice work. Looks good to me.

webchick’s picture

Status: Reviewed & tested by the community » Needs review
Issue tags: +Needs tests

I'd like to see tests of this new API function, as well as a couple more reviews.

Can we get an update in terms of where we are with RAM savings now with the latest version?

catch’s picture

HEAD:
Total Incl. MemUse (bytes): 7,911,840 bytes
Total Incl. PeakMemUse (bytes): 8,137,776 bytes
Number of Function Calls: 9,878

Patch:
Total Incl. MemUse (bytes): 6,907,232 bytes
Total Incl. PeakMemUse (bytes): 7,133,616 bytes
Number of Function Calls: 9,623

Patch with the menu hunk reverted (i.e. the improvement to drupal_get_schema() itself):
Total Incl. MemUse (bytes): 6,950,064 bytes
Total Incl. PeakMemUse (bytes): 7,176,432 bytes

So drupal_get_schema('menu_links'); would now use 43k of memory instead of (at least) 1mb.

The new function only does exactly the same as drupal_get_schema($table); does, which is called by drupal_write_record() thousands of times by existing tests, so I'm not sure it needs it's own unique testing.

Also if it didn't work this would blow up:

class DatabaseTestCase extends DrupalWebTestCase {
  protected $profile = 'testing';

  function setUp() {
    parent::setUp('database_test');

    $schema['test'] = drupal_get_schema('test');
    $schema['test_people'] = drupal_get_schema('test_people');
    $schema['test_one_blob'] = drupal_get_schema('test_one_blob');
    $schema['test_two_blobs'] = drupal_get_schema('test_two_blobs');
    $schema['test_task'] = drupal_get_schema('test_task');

    $this->installTables($schema);

    $this->addSampleData();
  }
q0rban’s picture

subscribing

andypost’s picture

Trying to review and cleanup. I think we need to test hook_schema_alter()

andypost’s picture

Status: Needs review » Needs work
+++ includes/bootstrap.inc	30 Dec 2010 13:14:43 -0000
@@ -2630,6 +2630,38 @@ function ip_address() {
+function drupal_get_table_schema($table, $rebuild = FALSE) {
+  static $schema = array();

drupal_get_schema() with $rebuild has no power on this static cache

So static cache should be in old stable core API function

Powered by Dreditor.

EDIT: profile_field_form_validate() user_account_form_validate() user_save() - better to fix with new API

andypost’s picture

+++ includes/bootstrap.inc	30 Dec 2010 13:14:43 -0000
@@ -2680,20 +2711,13 @@ function drupal_get_schema($table = NULL
+        // To avoid race conditions, clear the cache prior to setting it.
+        cache_clear_all('schema', 'cache', TRUE);
         cache_set('schema', $schema);
       }

First we need to save new global schema cache then cache_clear_all('schema:', 'cache', TRUE);

PS: Happy new year happens...

Powered by Dreditor.

PS: For D8 we need more attention for drupal_get_schema_unprocessed() and drupal_get_schema() providing relation between module<=>schema(table) we could unify this

andypost’s picture

Status: Needs work » Needs review
FileSize
13.31 KB

Let's try this patch, without tests :(

Fixed:
- touched all places where drupal_get_schema() used.
- drupal_get_table_schema() uses drupal static fast cache
- added @return descriptions (needs some love)
- drupal_get_schema() cleans statics of drupal_get_table_schema() and per table cache
- fixed race condition with cache cleaning in module_enable(),
- suppose DrupalWebTestCase::tearDown() should not rebuild schema (no need to set cache)

Should we fix drupal_install_[un]schema() to clean cache?

@todo review drupal_get_schema_unprocessed() usage

Status: Needs review » Needs work

The last submitted patch, 402896_schema_cache7.patch, failed testing.

andypost’s picture

Status: Needs work » Needs review
FileSize
13.63 KB

Another round

- fixed return value (FALSE) for drupal_get_schema() when nothing to return
- drupal_get_table_schema() stores false results in static to not trigger database for none-exist tables in same request
- A bit fixed comments
- module_enable() should clean schema cache right after changing schema

andypost’s picture

FileSize
13.74 KB

Test patch: do we really need cache rebuild in install_configure_form() also fixed tab and comment

andypost’s picture

FileSize
14.09 KB

So it seems that full schema cache is not needed at install. cache tables are retrieved with drupal_get_schema_unprocessed() for now

Summarizing:
- most frequent usage is frontend drupal_get_schema() with $table set so introduced drupal_get_table_schema()
- next is cleaning schema cache - task for install, development, update
- we dont need full schema cache result except DrupalWebTestCase::tearDown() to cleanup test's tables
- to avoid race conditions module_enable() should clean cache after drupal_install_schema()
- removed useless cache cleaning

Anonymous’s picture

+        drupal_static_reset('drupal_get_table_schema');

any reason not to just update the static with the new schema values?

+        // To avoid race conditions, set the whole schema cache prior to
+        // clearing it's parts.

this comment smells to me, but i'm not sure why just yet ;-)

Anonymous’s picture

Status: Needs review » Needs work

this patch still leaves us open to a race condition if process A calls drupal_get_schema('foo') at the same time process B calls drupal_get_schema(NULL, TRUE), where process B will be writing updated schema information for table 'foo'.

i'm ok if that's considered a crazy edge case, but the comment i highlighted in #50 seems to indicate we think we've got race-free code here.

catch’s picture

Status: Needs work » Needs review
FileSize
14.09 KB

Updated the comment, now only claims to avoid caching stale data. Also added in the drupal_static() stuffing rather than clearing it.

Otherwise only comment changes.

I'm not sure about the change to using drupal_get_table_schema($table); instead of drupal_get_schema($table); around core, it's not strictly necessary and I've been trying to keep the patch small so it doesn't look too API-changey, but it doesn't do any harm either.

The DrupalWebTestCase find is a good one, wonder if that'll help test run performance noticeably.

sun’s picture

Status: Needs review » Needs work
+++ includes/bootstrap.inc	3 Jan 2011 02:37:31 -0000
@@ -2630,6 +2630,49 @@ function ip_address() {
+ * @return processed table's schema array or FALSE.

@@ -2639,16 +2682,17 @@ function ip_address() {
+ * @return whole schema array when no $table, table schema for $table else FALSE.

Description should be on a separate line, and form a proper sentence.

+++ includes/bootstrap.inc	3 Jan 2011 02:37:31 -0000
@@ -2630,6 +2630,49 @@ function ip_address() {
+  else if (!$rebuild && $cached = cache_get("schema:$table")) {

elseif

+++ includes/bootstrap.inc	3 Jan 2011 02:37:31 -0000
@@ -2630,6 +2630,49 @@ function ip_address() {
+    $full_schema = drupal_get_schema(NULL, $rebuild);
+    if (isset($full_schema[$table])) {
+      $schema[$table] = $full_schema[$table];

Shouldn't we prepopulate the static $schema cache once we have to retrieve the full schema, so we don't re-retrieve it on subsequent cache misses?

+++ includes/bootstrap.inc	3 Jan 2011 02:37:31 -0000
@@ -2680,20 +2724,26 @@ function drupal_get_schema($table = NULL
+        // Refresh the static cache in drupal_get_table_schema() with the
+        // updated schema.
+        $static = &drupal_static('drupal_get_table_schema');
+        $static = $schema;

So it seems that this sets the static $schema variable to the full schema... -- do we really need that remote population? It looks like we could also take over the return value in drupal_get_table_schema()...?

Powered by Dreditor.

catch’s picture

Status: Needs work » Needs review
FileSize
0 bytes

The remote population was suggested by Justin in #51. I don't think there are many cases where this would happen, but if it did it ought to save retrieving the entire schema from schema again.

I'm not sure what you mean by taking over the return value?

Cleaned up the comments.

catch’s picture

No bytes in my patch :(

podarok’s picture

subscribe

andypost’s picture

This approach seems finished. But I call to pay attention for common scenarios (usage of this function)

The most frequent is entity_save() - we changed a 1 big call to cache with unserialization into a bunch (minimum 4 and probably 20...) calls for cached table schemes. So we eats a less memory but makes more IO with cache backend (DB by default), CPU usage seems the same.

OTOH with fields in core some sites could have near 200-300 tables in schema - it's near 1-3 Mb of memory but loading all .install files actually eats more memory.

About simpletest - we have a high requirements for using it so it's not a problem and the most efficient to have whole processed schema cached!

So I think the most good approach to cache schema per entity... or fix static implementation and race condition before release and leave this issue active for D8 and backporting.

As maintainer of cacherouter module I know that most cache backends have some troubles with cleaning cache with wildcard!

We should cache a schema processed to not invoke hook_schema_alter() because each uncached call to schema retrieval loads all .install and executes drupal_alter so in some caches we could get schema with not all modules loaded so wrongly altered.

This patch:
- fixes race condition in module_enable()
- fixes function docs
- introduces "drupal_static_fast"
- fixes menu link query (seems more quick to add field against loading a whole schema cache)
- removes unused cache clean because discarded static implementation - this could give some performance on tests
- adds comments for usage of drupal_get_schema()

podarok’s picture

#57: 402896_drupal_get_schema_3.patch queued for re-testing.

catch’s picture

The most frequent is entity_save() - we changed a 1 big call to cache with unserialization into a bunch (minimum 4 and probably 20...) calls for cached table schemes. So we eats a less memory but makes more IO with cache backend (DB by default), CPU usage seems the same.

I think this is fine as a trade-off. Even sites that have thousands of nodes or comments posted per day are still going to have a much higher read rate, and despite the extra cache_get()s they will still get the memory saving.

OTOH with fields in core some sites could have near 200-300 tables in schema - it's near 1-3 Mb of memory but loading all .install files actually eats more memory.

This is true, although the patch shouldn't result in .install files being loaded any more frequently than now? I fully agree we need to sort .install files out though.

On wildcards, db and memcache both have support for that (memcache project added it in 1.5). It'd also be possible to avoid wildcards by looping over the schema array keys and clear each item individually.

I'd prefer not to remove the actual changes to drupal_get_schema() caching in this issue - although I'm fine if you want to split the other changes off to a separate issue. Neither version of this patch is going to get into core in the next 24 hours (i.e. before 7.0 release), so we have some time to get it properly reviewed for 7.1.

moshe weitzman’s picture

I copied just the critical hunk from the latest patch (menu.inc) and posted to a new issue at catch's suggestion. See #1014096: Don't load schema on every page view. Hopefully that one can go in while we work on a broader solution here.

andypost’s picture

--- includes/module.inc	27 Nov 2010 20:41:38 -0000	1.209
+++ includes/module.inc	3 Jan 2011 17:24:52 -0000
@@ -421,14 +421,14 @@ function module_enable($module_list, $en

@@ -421,14 +421,14 @@ function module_enable($module_list, $en
       _system_update_bootstrap_status();
       // Update the registry to include it.
       registry_update();
-      // Refresh the schema to include it.
-      drupal_get_schema(NULL, TRUE);
       // Clear entity cache.
       entity_info_cache_clear();
 
       // Now install the module if necessary.
       if (drupal_get_installed_schema_version($module, TRUE) == SCHEMA_UNINSTALLED) {
         drupal_install_schema($module);
+        // Refresh the schema to include it.
+        drupal_get_schema(NULL, TRUE);

I think this hunk should be commited before release because it fixes race condition.
I can provide a patch within 3-4 hours, maybe someone can make it quicker...

Powered by Dreditor.

webchick’s picture

Nothing else is getting committed before release, as far as I'm concerned. So take your time.

nightowl77’s picture

subscribe

catch’s picture

Issue tags: +Needs backport to D7
FileSize
15.02 KB

Here's a new patch based on CacheArrayObject from #1011614: Theme registry can grow too large for MySQL max_allowed_packet and memcache default slab size.

This does not include the various fixes from andypost, most of these look good but they're not anything to do with schema memory usage, I think we need one issue for the caching, and one for cleaning up drupal_get_schema() usage around core.

This takes a different tack from before.

drupal_get_schema() loads a SchemaCache object if $table is passed in, otherwise the whole array

Instead of caching table by table, we build a runtime schema cache over time (similar to the registry issue), the cache is split between GET requests and everything else so that schemas used in drupal_write_record() are separated as much as possible. This will save some cache requests when saving complex objects with lots of drupal_write_record() calls.

It adds drupal_get_complete_schema() - this is the old drupal_get_schema() code except now it's just a helper. Could be a private function maybe.

Working on this I am left wondering why on earth drupal_get_schema() is in bootstrap.inc, and for that matter why we have to have all the install/crud parts of schema API in common.inc, but that's not for this patch.

Ran node tests which all passed, let's see what the bot thinks.

Status: Needs review » Needs work

The last submitted patch, schema.patch, failed testing.

catch’s picture

Version: 7.x-dev » 8.x-dev
Category: bug » task

Now that #1014096: Don't load schema on every page view is in we're only loading the full schema on every request if a contrib module triggers it, for writes, and in Drupal 6. That makes this a task now, leaving at major though since it's so easy for a contrib module or a core patch to get it back to square one.

Fails look like something silly in the patch so will have a look at fixing it up later.

catch’s picture

FileSize
6.91 KB

Not quite as silly a bug as you might think.

Here's why that patch failed:

- CacheArrayObject::__destruct() calls lock_acquire() / cache_get() etc. which need a database.
- drupal_get_schema() puts the CacheArrayObject() in drupal_static
- DrupalUnitTestCase::setUp() calls drupal_static_reset()
- when it calls drupal_static_reset(), there is neither the default database connection for the real site, nor the new simpletest database available.
- resetting the static takes the object out of scope, calls the destructor which expects a database, and everything explodes.

This patch just uses a normal static for now.

catch’s picture

Status: Needs work » Needs review

Marking CNR for the bot, I didn't run full test suite but that looked like most of the failures.

pillarsdotnet’s picture

omercioglu’s picture

catch’s picture

Issue tags: -Needs backport to D6
FileSize
7.12 KB
7.06 KB

Updated D8 patch with docs fixes and some minor cleanup.

Also a backport to Pressflow/D6. We can't actually use this version of the patch in D6 proper since it is PHP5-only, so removing the tag.

catch’s picture

FileSize
6.85 KB

Discussed in irc with Crell and chx. We had decided to put the interface back, but also on making the methods protected, can't have both, so I've left the interface, made the class abstract, and resolveCacheMiss() is now an abstract function to enforce implementing it in the extending class.

Ran a couple of tests locally but not everything.

Crell’s picture

Status: Needs review » Needs work

There is no actual interface definition here.

What we probably want is:

interface CacheObjectInterface extends ArrayAccess {
  // New stuff here.
}

abstract class CacheArrayObjectAbstract extends ArrayObject implements CacheObjectInterface { 
  // Implement everything except resolveCacheMiss().
}

class SchemaCacheObject extends CacheArrayObjectAbstract {
  public function resolveCacheMiss() { ... }
}

(The use of the Abstract suffix is not in our coding standards currently but is part of PSR-0.)

catch’s picture

Status: Needs work » Needs review
FileSize
6.82 KB

Docs cleanup from aspilicious.

drupal_get_schema() static cache was a bit wonky, refactored a bit. Only ran a couple of tests as usual.

pillarsdotnet’s picture

Status: Needs review » Needs work

Crell wrote in #73:

(The use of the Abstract suffix is not in our coding standards currently but is part of PSR-0.)

Being woefully uninformed, I spent about an hour trying to figure out what you meant by this statement. My best candidate so far is Drupal's position on PIG standards.

catch’s picture

Status: Needs work » Needs review

There is no interface because you cannot define protected methods in an interface.

All three methods in CacheArrayObject, that do not exist in ArrayObject are protected.

So it would be an empty interface.

This is what I already said in #72.

sun’s picture

I see no reason why these methods should be protected. Overall, recent addition of OO code in core has too many properties and methods protected all over the place for no good reason.

That said, I also don't understand why an abstract class additionally needs an interface - the class needs to define and document the abstract methods anyway already, and without implementing the abstract methods, it's impossible to extend the class.

Thus, the only remaining todo from my perspective would be to change the methods (and properties) to public. That's also the only solid way to write proper unit tests to verify the low-level behavior, instead of testing a irrelevant subsystem.

catch’s picture

I agree there is no reason to define an interface for an abstract class (that is already extending a class that itself implements well defined interfaces), especially since the class is expected to be extended - no-one implements the interface directly. That would seem to be the whole point of abstract classes.

There's no reason (including unit tests) to call the three protected methods on the class - they are internal only. chx, webchick and Crell all asked for these to be protected, frankly I don't care but I am not fighting OO wars in here - you can all do that between yourselves if you like and call me when you're done.

You cannot expose variables as public on an ArrayObject without them accessible as part of the 'array', so that is out of the question regardless of people's likes and dislikes.

chx’s picture

Status: Needs review » Reviewed & tested by the community

the argument webchick had for protected is that ordinary users will know that they have nothing to do with these methods. and so yes protected is good. I have nitpicked the previous versions of the patch like no tomorrow and honestly, I am out of problems with this patch. It's good to go.

Crell’s picture

Status: Reviewed & tested by the community » Needs work

I was not in channel when you were discussing what methods should be protected, catch. I didn't weigh in on that front at all, other than going "Ooo..." when chx said he'd been advocating for one. :-)

As I discussed in channel, though, we *do* want an interface. Using only a base class means you can never, ever implement a cache object that does not extend from it. (Type hinting breaks otherwise.) Even if the majority case will probably just extend from the base class, we don't want to force that. All we care about is that "I can call this thingie and be sure that these methods are there". That's an interface. We do want to allow, say, an implementation that uses ArrayAccess but doesn't just wrap a PHP array internally like ArrayObject does. Suppose we want to wrap SplFixedArray instead for performance reasons. Or SplObjectStorage. Or, hell, a database result? Or something from a PECL extension? Or just wrap ArrayObject and compose it rather than extending it? PHP has LOTS of data collection systems.

Our coding standards specify interface-driven development for a reason. Just because the common implementation will extend from one class doesn't mean we should hard-code that all of them must.

As far as unit testing, the way to make things unit testable is not to make everything public. It's to inject anything that is going to get read from or written to rather than hard coding calls to it. That means we can easily inject mock objects instead that implement the same interface (not the same parent class!) and act as canaries. We can then test those mock objects to ensure that things behaved the way they should.

catch’s picture

Status: Needs work » Needs review
FileSize
7.13 KB

Sorry Crell, I took 'Ooo' as a sign of approval, didn't realise you missed the actual discussion about that bit...

In terms of the interface we are more or less back to http://drupal.org/node/1011614#comment-4537674 now.

I have left resolveCacheMiss as an abstract method in the CacheArrayObjectAbstract class - since there is no point in providing a default implementation.

It would be possible to have that in the interface and an empty function body in the class, but it is more of an implementation detail in the base class anyway and we want to enforce that extending classes actually override that method (i.e have PHP throw a fatal error rather than silently not resolve cache misses). I can be persuaded either way on this though.

One thing from irc that didn't make it into discussion here - Crell mentioned passing in a cache object (as returned from _cache_get_object($bin)) instead of the string $bin. I started work on converting to that, but for full dependency injection unit testing you would also need to be able to pass in a lock object and mock the locking framework too - but that is procedural in D7/8.

Potential followups for D8: (I would prefer not to try to tackle these in the initial patch, because this is currently our best chance of getting other major performance issues like #1040790: _field_info_collate_fields() memory usage fixed for D7).

- allow for full dependency injection
- allow a 'resolve cache miss' callback to be passed into the constructor so the class can be used as is (this is tricky since the constructors can take different arguments - PHP does not enforce that arguments are the same with __construct())
- add a factory and make each implementation (theme registry, schema) pluggable (this will also require a resolution to the __construct() issue since the factory will need to be able to handle arbitrary function parameters which gets ugly).

Crell’s picture

Status: Needs review » Needs work

After some more discussion with catch in IRC:

- All methods need to have their visibility specified explicitly, per coding standards. (Sorry, I should have caught that one before.)

+++ b/includes/bootstrap.inc
@@ -159,6 +159,160 @@ define('REGISTRY_WRITE_LOOKUP_CACHE', 2);
+ * Extend CacheArrayObject rather than implementing this interface directly.

Not true. Rather say, "In most cases it is better to extend CacheArrayObjectAbstract than implementing this interface directly."

+++ b/includes/bootstrap.inc
@@ -159,6 +159,160 @@ define('REGISTRY_WRITE_LOOKUP_CACHE', 2);
+  /**
+   * Add an offset and value to the ArrayObject cache.
+   *
+   * @param $offset
+   *   The array offset that was request.
+   */
+  function persist($offset);

catch and I talked this over at length in IRC. The comment here should be expanded. I would recommend:

"Flag an offset value to be written to the persistent cache.

If a value is assigned to a cache object with offsetSet(), it will not be written to the persistent cache unless it is flagged with this method."

I'm also tempted to say that persist() should take a key and a boolean, so you can turn off persisting for a key if necessary. I could go either way there, but it seems like a good idea for completeness.

We should probably also re-specify offsetSet() with an identical definition to ArrayAccess, but add the following comment:

"Note that values assigned with array access ($c['foo'] = 1) will not automatically be written to the persistent cache. To persist a new value, first assign it with array access and then use the persist() method to flag it for writing. In most cases that should not be necessary as the cache object is responsible for regenerating missing cache values on demand."

+++ b/includes/bootstrap.inc
@@ -159,6 +159,160 @@ define('REGISTRY_WRITE_LOOKUP_CACHE', 2);
+  /**
+   * Write to the persistent cache.
+   *
+   * @param $cid
+   *   The cache ID.
+   * @param $bin
+   *   The cache bin.
+   * @param $data
+   *   The data to write to the persistent cache.
+   * @param $lock
+   *   Whether to acquire a lock before writing to cache.
+   *   Defaults to TRUE.
+   */
+  function set($cid, $data, $bin, $lock = TRUE);

This should read "Immediately write a value to the persistent cache."

+++ b/includes/bootstrap.inc
@@ -159,6 +159,160 @@ define('REGISTRY_WRITE_LOOKUP_CACHE', 2);
+/**
+ * Extends ArrayObject to enable it to be used as a caching wrapper.
+ *
+ * This class should be extended by systems that need to cache large amounts
+ * of data and have it represented as an array to calling functions. These
+ * arrays can become very large, so ArrayObject is used to allow different
+ * strategies to be used for caching internally (lazy loading, building caches
+ * over time etc.). This can dramatically reduce the amount of data that needs
+ * to be loaded from cache backends on each request, and memory usage from
+ * static caches of that same data.
+ *
+ * Note that array_* functions do not work with ArrayObject.
+ *
+ * By default, the class accounts for caches where calling functions might
+ * request keys in the array that won't exist even after a cache rebuild. This
+ * prevents situations where a cache rebuild would be triggered over and over
+ * due to a 'missing' item. These cases are stored internally as a value of
+ * NULL. This means that the offsetGet() and offsetExists() methods
+ * must be overridden if caching an array where the top level values can
+ * legitimately be NULL, and where $object->offsetExists() needs to correctly
+ * return (equivalent to array_key_exists() vs. isset()). This should not
+ * be necessary in the majority of cases.
+ *
+ * Classes extending this class will need to override at least the
+ * resolveCacheMiss() method to have a working implementation.
+ *
+ * @see SchemaCache
+ */

Most of this docblock should be moved to the interface. The interface should describe how cache objects should be used. CacheArrayObjectAbstract should describe how this particular common implementation works, or any quirks it had relative to the interface.

+++ b/includes/bootstrap.inc
@@ -159,6 +159,160 @@ define('REGISTRY_WRITE_LOOKUP_CACHE', 2);
+   * @param $bin
+   *   The bin to cache the array.a

Stray "a".

Powered by Dreditor.

catch’s picture

Most of this looks fine but:

"Note that values assigned with array access ($c['foo'] = 1) will not automatically be written to the persistent cache. To persist a new value, first assign it with array access and then use the persist() method to flag it for writing. In most cases that should not be necessary as the cache object is responsible for regenerating missing cache values on demand."

I expect implementing classes to use these methods, but absolutely do not expect persist() to be called except from within the class. This is why they moved to protected methods on CacheArrayObjectAbstract() in the first place. The whole point of this patch is that the caching logic is encapsulated entirely in the class instead of split between multiple procedural functions, and procedural code (or other classes) treat it more or less as they do the current real arrays.

Thought we had straightened this issue out in irc but the comment suggests otherwise.

So once again, I would argue that the specific interface is superfluous, and any type hinting can be dealt with by just specifying ArrayAccess. All of this stuff is very closely tied to the base implementation, if you were not using that, you could easily go without the ->set() or ->persist() methods - so they should not be in the interface, which means no interface apart from ArrayAccess is necessary.

It is still 'interface driven development', it just happens there is a perfectly good interface defined in PHP itself that we're using.

This takes us back to http://drupal.org/node/402896#comment-4632780, I will add the other docs changes to that patch.

Crell’s picture

... Were we seriously talking past each other for an hour? :-(

I was at all times speaking about the general case interface, NOT any specific implementation. What were you talking about? If we weren't talking about a publicly exposed persist() and set(), what the heck were we talking about?

Bear in mind that the question of "WTF does offsetSet() actually do" is still valid even if you don't want to expose persist() and set(). That's something that should be documented and consistent.

catch’s picture

@Crell -
when you say this:

"Note that values assigned with array access ($c['foo'] = 1) will not automatically be written to the persistent cache. To persist a new value, first assign it with array access and then use the persist() method to flag it for writing. In most cases that should not be necessary as the cache object is responsible for regenerating missing cache values on demand."

You are talking about this kind of usage:

$schema = new $SchemaCache();

$schema['foo'] = 'bar';
$schema->persist('foo');

Or at least that is how the comment reads, you even say in most cases the cache object is responsible for regenerating missing cache values on demand which is the same thing (except it should be 'in all cases', the only optional thing is whether they're written back to the cache system or not).

Calling those methods from outside code is the only reason to make them public, but we don't want them called from outside (and having them public is a requirement of them being in the interface). I have made this point repeatedly and I don't know how many different ways it can be said, but here's another go:

There is plenty of reason to have them as protected methods on the base class, since there are a good dozen specific implementations that will end up using the same basic pattern once this is applied around core. If you don't extend the base class, then you might as well extend CacheArrayObject, or just implement ArrayAccess directly since you are not getting anything for free except some public methods that have no reason to be public.

That is not the same as "I was at all times speaking about the general case interface, NOT any specific implementation".

Implementing an interface, extending a class, neither of these are at all the same as the use of that class by code that instantiates it - which is what public functions and interfaces are really for. Otherwise there would be no use case for abstract classes and methods at all (the accepted way to define methods that must be implemented or classes that must be extended, without having them public).

To make sure this is not just me talking crap (which it definitely could be), I checked the php.net docs for abstract classes, this comment sums it up nicely:

http://www.php.net/manual/en/language.oop5.abstract.php#103752

By their nature, although sometimes they may look quite similar, abstract classes and class interfaces serve very distinct purposes.

The interface of a class is meant as a tool for the "user" of that class. An interface is a public presentation for the class, and it should advertise, to anyone considering to use it, what methods and constants are available and accessible from the outside. So, as it name suggests, it always sits between the user and the class implementing it.

On the other hand, an abstract class is a tool aimed at helping the "implementor" of the classes that extend it. It is an infrastructure that can impose restrictions and guidelines about what the concrete classes should look like. From a class design perspective, abstract classes are more architecturally important than interfaces. In this case, the implementor sits between the abstract class and the concrete one, building the latter on top of the former.

So either we were talking past each other on whether the methods would be publicly accessible or not, or we were talking past each other about the relative use of abstract classes vs. interfaces, personally to me it feels like the latter based on your references to the interface being for the 'implementor' of the class (not the user of it) and if that's the case then we need to clarify this distinction properly, ideally adding to existing documentation on Drupal.org.

This doesn't mean the methods don't need more docs, there was a good 30 minutes of that one hour discussion where we weren't talking past each other and most of the review in #82 stands - but it'd be great to get interface vs. abstract cleared up so I only have to maintain one version of the patch.

Crell’s picture

I think it was actually the former that was the confusion, as I thought you were saying that the code sample above (which is the use case I was talking about) was legit. If your intent is that it would never ever be done, then no, persist() doesn't need to be public.

I do mostly agree with the quote from PHP.net above. The only place I'd question is is that I'd place interfaces as more important than abstract classes, and treat abstract classes as an implementation convenience (albeit an often very useful one). In many cases, they could even be replaced by traits in PHP 5.4 (whenever that becomes available).

The question still remains, though, what offsetSet does. The following code is not something we can prevent:

$schema = new SchemaCache();
$schema['foo'] = 'bar';

What do we want to happen on the second line? (That's what I was trying to get at this morning.) The options I see are:

1) It is an explicit cache set, either calling set() immediately or flagging persist() (internally).
2) It is equivalent to drupal_static, in that it persists for the rest of the request but is lost after that.
3) It is equivalent to drupal_static, unless the caller calls something else to insist that it get written (e.g., calling persist().)
4) We decide that no one should ever even consider doing so and throw an exception.

I don't have a great deal of preference as to which of those we do, but it needs to be clearly and explicitly documented. An implementation that does not extend from ArrayObject will not be able to inherit its implicit option 2, so we should document what it SHOULD do in order to remain compatible.

What I thought we discussed in IRC was taking option 3, which is what I documented above. Option 3 would require a new interface. The others strictly speaking do not, unless we wanted to use it for more explicit documentation (eg, don't use the setter, it go boom) and future flexibility (eg, a standardized constructor if we are able to do that later). Personally I tend to err on the side of having the interface rather than not, but I defer to Drieschick on that.

I also agree with the "things for D8 only for later" list from #81.

catch’s picture

Hmm, you seemed confused about why there was a persist() method at all and what it does (hence the new docs which are great) - i.e. why not override offsetSet() and add the logic there, rather than allowing implementing classes to differentiate between what amounts to static/persistent caching (which IMO is an implementation detail, not at all a requirement, and that all belongs in the base class providing sensible defaults rather than an interface). Either way clearly were were talking past each other.

If we take your $foo['bar'] = 'baz' example, if you do $foo = new SchemaCache(); you will be making your very own instance of the class (because elsewhere it is static cached and only accessed via a wrapper), set this offset within your function scope and it'll be discarded as soon as the object goes out of scope.

If you call drupal_get_schema() you never actually get returned a full SchemaCache object to play with - you either get the full array from drupal_get_complete_schema() (not the ArrayObject), or you get a single schema array for a table - the class is hidden behind those wrappers.

Eventually I would want to completely remove caching of the full schema and try to build the ArrayObject on demand only (if you want the full schema you'd have to really want it, and it'd generate each time with at most static caching). Some caches like views defaults get too big to for max_allowed_packet/memcache slab sizes, but it is an implementation detail again.

So with the patch as it stands, it'd be equivalent to none of the 1-4 above - and is exactly what happens if you do $foo = new ArrayObject(); To me this is a bit useless but does no harm either way. People could also do things like call ___destruct() on the object themselves, which would be inadvisable but there are limits.

The only thing I could possibly see being worth doing there would be having SchemaCache::offsetSet() itself throw an exception - but only on that particular implementation since classes should not have to override that method to prevent that behaviour. I am not at all convinced that is worth doing though. How many people call $schema = drupal_get_schema(); $schema['foo'] = 'bar'; and expect this to do something useful? You could say the exception behaviour should be documented in an interface, but I would disagree with having an interface that only exists to document what an implementing class may or may not want to do - there is currently no deviation from ArrayObject in terms of offsetSet() (i.e. no implementations override it at all).

Having said that, I've been thinking about what this could look like if the methods were public.

So say an implementation wants to support operating on the cache from outside:

$foo = new MyCacheArrayObject(); $foo['bar'] = 'baz'; $foo->persist('bar'));

Setting the methods as protected in the abstract class and not having an interface neither encourages nor prevents this.

Extending classes can make all of these methods public - set(), persist() etc. and/or adding their own - when inheriting from abstract classes you specify method visibility at the same or a less protected level than the parent - that is the only requirement. So if a class wants to support it they could - but that is not a decision we should impose either way. They could also inherit directly from CacheArrayObject() and do something else.

If it turned out that lots of modules have a use case for that (I can't think of one but it might come up), then that might be a a case for adding an interface - however CacheArrayObjectAbstract would not implement that interface because it has protected methods, and neither would any of the 5-10 follow-up implementations I have planned, and we can add that interface later without changing anything here.

i.e. nothing stops contrib or core at a later date adding:

interface CacheArrayObjectDoWhatYouLikeInterface

Then

class MyCacheArrayObject extends CacheArrayObject implements CacheArrayObjectDoWhatYouLikeInterface

(MyCacheArrayObject might be a specific implementation or could be its own base/abstract class).

I tried to come up with an example where you would specifically not want to have an interface defined for type hinting beyond ArrayAccess (other than the protected methods issue) to explain why I think it's not a requirement, see if this makes sense to you:

Let's consider the Drupal6RetroSchemaCaching implementation - which reverts core back to the old behaviour of loading everything in one go. This is a silly name, but there is a valid use case for 4.6-era path alias caching on tiny sites, and they would have similar code.

That class would look like:

class Drupal6RetroSchemaCache extends CacheArrayObject {
  function __construct() {
    $schema = drupal_get_complete_schema();
    parent::__construct($schema);
  }
}

That is probable the most minimal implementation - this class does not need set(), persist() or any other methods because it is not doing any lazy loading, and it inherits directly from CacheArrayObject(). It will still pass the ArrayAccess type hinting though.

I'm not sure if this helps or not, but I would rather leave this open (including to adding a separate interface that documents particular use cases that expose public methods at a later date) than impose particular implementations on people now.

catch’s picture

Status: Needs work » Needs review
FileSize
7.63 KB

OK here's the docs cleanup, still no interface, I did not add anything for offsetSet() yet.

Review of the new docs would be great, it is mainly around persist() - I added the extra optional parameter too.

That leaves offsetSet(), let's recap the choices:

1. Add it to the interface and have CacheArrayObject extend that interface, document these three possible behaviours:
- offsetSet doesn't do anything special (could throw an exception or just leave it as un-implemented)
- offsetSet works for the duration of the request, but needs an additional persist() to write back at the end.
- offsetSet works both for the request, and writing back to the cache.

Still not keen on adding an interface with one method that is already defined by the interface that CacheArrayObject inherits from, especially since most classes won't extend that method anyway, then documenting three different ways to do something with that method, just seems like a lot of overhead for an edge case - we wouldn't actually be enforcing or recommending anything other than "ArrayObject has this method, you should figure out how it should behave and here's some options".

2. Document this in the SchemaCache class docblock, with or without the exception for the adventurous. Leave it at that for now.

3. Add an interface but only have SchemaCache extend it, documenting the specific behaviour of it not doing anything (or throwing an exception).

4. Since the behaviour of offsetSet() is no different to a standard ArrayObject, don't document it anywhere special, but this leaves the potential for people who aren't Crell to also get confused.

I don't really like any of these, but at this point we are only really talking about where to document it, so if one sounds particularly good to people (given we are probably not going to force one pattern on anyone), then I'll try to live with it.

Any different feedback on the current patch very welcome. Since we're down to (maybe) one change here, and it's not actionable yet, tentatively marking back to CNR for more feedback and another chance for the bot to review.

aspilicious’s picture

Status: Needs review » Needs work
+++ b/includes/bootstrap.incundefined
@@ -159,6 +159,163 @@ define('REGISTRY_WRITE_LOOKUP_CACHE', 2);
+ *
+ * @see SchemaCache ¶

That sneaky whitespace won't go away ;)

Powered by Dreditor.

Crell’s picture

+++ b/includes/bootstrap.inc
@@ -159,6 +159,163 @@ define('REGISTRY_WRITE_LOOKUP_CACHE', 2);
+   *   be unflagged so that it will no written at the end of the request.

"it will NOT be written". Missing the 't' on not.

+++ b/includes/bootstrap.inc
@@ -2465,6 +2622,55 @@ function ip_address() {
+  public function __construct() {
+    // Cache by request method.
+    $this->cid = 'schema:runtime:' . $_SERVER['REQUEST_METHOD'] == 'GET';
+    parent::__construct($this->cid, 'cache');
+  }
+

You don't actually need to set $this->cid here. It will be set by the parent. I suppose it doesn't matter either way but just calling it $cid and passing that would be marginally cleaner. You could even inline it.

I think the only open question here is what offsetSet does and how to document it. At this point I don't care as much what it is as long as it's properly documented, so I defer to whatever other folks want to do with it.

catch’s picture

Status: Needs work » Needs review
FileSize
8.04 KB

Fixed the docs changes/whitespace/$this->cid.

I've added some docs for offsetSet() in the main docblock for CacheArrayObject for now, I'm open to ideas for elsewhere to document this.

Status: Needs review » Needs work

The last submitted patch, schema.patch, failed testing.

catch’s picture

Status: Needs work » Needs review
FileSize
8.33 KB

Some additional docs via chx in irc.

Re-posting the list of potential use cases for this:

- drupal_lookup_path() - system path cache and maybe parts of the current static caching.
- module_implements() cache (uses exactly this pattern except it's baked in)
- registry cache (similar)
- theme registry (already has an implementation at #1011614: Theme registry can grow too large for MySQL max_allowed_packet and memcache default slab size)
- variable cache.
- system_get_info()
- system_list()
- _field_info_collate_fields()

Some of these are more or less urgent performance issues at the moment, some would just be conversions for consistency at this point.

Status: Needs review » Needs work

The last submitted patch, schema.patch, failed testing.

catch’s picture

Status: Needs work » Needs review
FileSize
8.34 KB
chx’s picture

Status: Needs review » Reviewed & tested by the community

Summary: Several large arrays in core -- schema, theme registry, module implements, fields, instances etc. -- are cached. Currently we store and load everything on every page load which is one of the reasons Drupal 7 requires so much RAM. This patch changes this by introducing an automated mechanism where only items actually used on a page request is stored and consequently, loaded back. This is accomplished by a nice PHP 5 feature called ArrayObject which allows an object to behave as an array as well -- you can do $foo['bar']. So the caller function retrieves whatever item it needs either from cache or from the non-cached complete data and for the latter it marks the need for cache. When the object is destructed -- typically only happens at the end of request -- the cache is written back. Very tidy magick.

We already do something like this but it's crude, ugly and insanely error prone:

_registry_check_code(REGISTRY_WRITE_LOOKUP_CACHE);
  drupal_cache_system_paths();
  module_implements_write_cache();

in drupal_page_footer does the writes for the named functions. Automating this process is good. Otherwise we need very ugly arguments or sneaking out data via drupal_static.

Most discussion in the issue was just bikeshedding architecture: these objects extend ArrayObject and only public methods are defined by that so creating an interface is pointless as interface only contains public methods. Also. these objects can be used only by their relevant functions (ie. drupal_get_schema uses the schemaCache object) so the protected/public matters really little here as noone will instantiate these directly ever.

We may want to revisit aspects of this in Drupal 8, for example this pattern may be useful for the configuration object as opposed to just our current large arrays. However there are currently no objects in Drupal 7 or 8 to apply this pattern to, and the small number of Drupal 8 followups from #81 should not impact this either way.

I alreayd said some 20 comments ago this is good to go. Since then we have clairfied more comments. It's really good to go :)

pounard’s picture

I think the per table schema cache is a bit overrated. If I do multiple queries to fetch schema of many table, I will multiple cache backend queries.

Since the schema already is code, I don't see any need of caching it, in the PHP files it's already in the most optimized version we could have, all we need is to be able to fetch them only when needed.

Here is an alternative patch that provide only one cache entry (tables listing only) and that never do more than one cache query, but still is able to provide lazy table schema definition without any static cache, just including the right .install file at demand time.

Note it's unfinished because it gets rid of hook_schema_alter(). But hook_schema_alter() is useless because it doesn't alter *_schema_unprocessed_* functions, so that creates a desync between global cache and single various retrievals.

This patch is only a PoC, another reason why it's unfinished: if you use a foreach() on the object you will get a degraded table => module list intead of full schema, but it seems quite simple to fix, this wasn't the goal of the patch itself. FYI I did install a site using this patch, worked gracefully, no watchdog errors.

EDIT: Sorry I read an older version of the patch. Still doing a smaller cache, but it could cost more because of the function calls, need testing.

pounard’s picture

And it pass tests ^^ Luckshot!

catch’s picture

The patch here does not use one cache entry per table, there were patches perhaps 40 posts ago that did, but please at least read the issue before posting alternatives.

pounard’s picture

Yes sorry, edited my post. But my solution still is API oriented and not cache oriented. The best fix would be to be able to directly ask the module for the right table without any cache at all, it definitely removes SQL queries.
Ignore my posts here. I really don't like using caching because it does (sometime uneeded) remote calls, but in this case, catch is right from the start.

After a long IRC chat with catch, and he really enlightened me, I have to admit that the incremental cache is really a good idea for keeping API compatible signature with actual D7 core. I should have read the full 100 posts before discussing it, I was still discussing the state it was some days ago, didn't saw it changed that much. It's indeed the best choice for keeping the same existing API signature compatible without the trouble of refactoring.

andypost’s picture

@pounard Schema is not always stored in code, do you remember about hook_schema_alter() posting you patch? This way leads to unpredictable amount of code execution when getting schema

sun’s picture

Status: Reviewed & tested by the community » Needs review
+++ b/includes/bootstrap.inc
@@ -159,6 +159,173 @@ define('REGISTRY_WRITE_LOOKUP_CACHE', 2);
+abstract class CacheArrayObject extends ArrayObject {
...
+  /**
+   * Constructor.
+   *
+   * @param $cid
+   *   The cid for the array being cached.
+   * @param $bin
+   *   The bin to cache the array.
+   */
+  public function __construct($cid, $bin) {
+    $this->cid = $cid;
+    $this->bin = $bin;
...
+  abstract protected function resolveCacheMiss($offset);

@@ -2465,6 +2632,54 @@ function ip_address() {
+class SchemaCache extends CacheArrayObject {
+
+  public function __construct() {
+    // Cache by request method.
+    parent::__construct('schema:runtime:' . $_SERVER['REQUEST_METHOD'] == 'GET', 'cache');

I don't really get why the constructor allows to pass in $cid and $bin, when the class has to be extended for an actual implementation anyway.

If there need to be different implementations in the first place, why don't we make $cid and $bin properties abstract, too, so the implementation has to define them?

The existing SchemaCache implementation is a nice example -- it removes the arguments from the constructor. And I think that would be cleaner, no?

class SchemaCache ...

protected $bin = 'cache';

protected $cid;

public function __construct() {
  $this->cid = 'schema:runtime:' . (int) $_SERVER['REQUEST_METHOD'] == 'GET';
  parent::__construct();
}

Implementations not using a dynamic $cid can specify it directly and don't have to override __construct().

5 days to next Drupal core point release.

catch’s picture

I'm hoping that at some point, we'll be able to pass a "rebuild the cache" callback to the constructor, and then it wouldn't be necessary to override any methods to use this. Schema would be a decent candidate for this since it hardly deviates from the default. Passing a $cid and $bin would be necessary then.

There's also the dependency injection thing with the cache backend - if we wanted to do a proper unit test for this, we'd need to:
- pass a cache object in the constructor (_cache_get_object($bin))
- pass a lock object in the constructor (lock is procedural so this isn't currently possible).

That would let you write a test without hitting the database at all using mock cache and lock classes.

So it's not strictly necessary now (although it's convenient - previous versions of the patch were passing the $cid into the SchemaCache constructor, I preferred it this way though - like you say it's a bit prettier), but it will likely become necessary later on.

I'm probably 50/50 on the usefulness of the default constructor at the moment, but medium term we'll likely end up having those arguments back in some guise.

sun’s picture

Status: Needs review » Reviewed & tested by the community
FileSize
8.34 KB

Alright, re-uploading #96 for clarity then.

Crell’s picture

Status: Reviewed & tested by the community » Needs review
+++ b/includes/bootstrap.inc
@@ -159,6 +159,173 @@ define('REGISTRY_WRITE_LOOKUP_CACHE', 2);
+  /**
+   * An array of keys to add to the cache at the end of the request.
+   */
+  protected $add_keys = array();

Coding standards say this should be called $addKeys, not $add_keys. (Or maybe $keysToSave, since "add" implies a verb, aka a method.)

Yes, I'm being pedantic. :-)

+++ b/includes/bootstrap.inc
@@ -159,6 +159,173 @@ define('REGISTRY_WRITE_LOOKUP_CACHE', 2);
+  protected function persist($offset, $persist = TRUE) {
+    if ($persist) {
+      $this->add_keys[] = $offset;
+    }
+    else {
+      // Unflag the item for writing to the persistent cache.
+      $this->add_keys = array_diff($this->add_keys, array($offset));
+    }
+  }

Minor nitpick, this would be slightly cleaner as $this->addKeys[$offset] = $persist; Then it's a single line that also automagically handles someone adding the same key multiple times.

Then in the destructor, we can just do foreach(array_keys(array_filter($this->addKeys))).

That wouldn't block the issue, but if we're rerolling anyway it's a nice minor cleanup.

Powered by Dreditor.

I'm happy with the documentation here now, and I'm very +1 on the approach overall. Nice work, catch!

catch’s picture

FileSize
1.47 KB
8.2 KB

Spoke with Crell about persist, with the new function signature the code can be a bit less verbose.

Marking back to CNR. Also it looks like during the re-rolls the $foo + bar (instead of array_merge()) was reverted, put that back in.

Attaching interdiff as well.

catch’s picture

FileSize
8.2 KB

I like keysToSave, that is the only change with this one. Agreed addKeys sounds too much like a method name.

Status: Needs review » Needs work
Issue tags: -Performance, -Needs tests, -memory, -Needs backport to D7

The last submitted patch, schema.patch, failed testing.

catch’s picture

Status: Needs work » Needs review
Issue tags: +Performance, +Needs tests, +memory, +Needs backport to D7

#108: schema.patch queued for re-testing.

Crell’s picture

Status: Needs review » Reviewed & tested by the community

I think I'm out of things to complain about. :-) Nice work, catch!

Peter Bowey’s picture

Refer #71

The enclosed *patch* @ http://drupal.org/files/issues/schema_cache_D6.patch
causes this error (using Pressflow 6.22)

Warning: mysqli_real_escape_string() expects parameter 1 to be mysqli, null given in db_escape_string()

I rolled the given patch out! Now I see no errors...

Crell’s picture

#112: This patch is for Drupal 8 and Drupal 7, and the current patch is the one attached to comment #108. What you're linking to is a very old version of it that will not and will never work against Drupal 6 / Pressflow 6.

Peter Bowey’s picture

The link #108 @ http://drupal.org/files/issues/schema_21.patch does not state if it is D6?
Thoughts please! I targeted the 'latest' for D6.

This seems to be what is also done at http://bazaar.launchpad.net/~catch-drupal/pressflow/cachearrayobject/rev.... However, that Pressflow back-port has the same issue.

catch’s picture

@peter I'll update the Pressflow branch once this is in D8/7, however it can never be committed to D6 (php5 only let alone anything else) and may or may not get committed to Pressflow. Not at my computer at the moment but if there's somewhere to submit the report on launchpad that might be easier to track than on this issue.

Peter Bowey’s picture

Refer #115
Thanks @catch, I will submit a report to pressflow launchpad -> https://launchpad.net/~catch-drupal.
If I am inclined ':)' I may backport from this D7/8 patch...

webchick’s picture

This is one of those patches that could really use Dries's eyeballs on it. I've emailed him about it, and hopefully he can get to it this week. Won't be for 7.3, unfortunately, but probably next month!

catch’s picture

sdboyer’s picture

sun pointed me here, and after reading this and #1011614: Theme registry can grow too large for MySQL max_allowed_packet and memcache default slab size for an hour, I've mostly got just a +1 to offer. Keeps the interface a nice, simple array, but avoids bloating memory with unnecessary data. Awesome.

One question, mostly just a sanity check. With the way this is implemented, there are a bunch of leftover methods from ArrayObject in CacheArrayObject that I haven't seen discussed or mentioned. I'm not actually clear on where things came down on the "do we use an interface or just a base abstract class," nor am I interested in reopening the discussion, but I'd just like to make sure we're not concerned that leaving those methods available ends up leaving us open to weird, potentially destabilizing uses of the API. Some ones that jumped out at me:

Obviously there's the, "you do something stupid, it's on you" view on this. Like I said, just raising this to make sure our ducks are in a row. And that we don't open up some backdoor that's convenient in the short term, but in the long term causes headaches.

Dave Reid’s picture

+++ b/includes/bootstrap.incundefined
@@ -2465,6 +2628,54 @@ function ip_address() {
+    $schema = new schemaCache();

Should be 'SchemaCache' right?

sdboyer brings up a good point. If those potential functions coming from the base class are destructive, we should consider overriding them to do nothing in CacheArrayObject().

19 days to next Drupal core point release.

pounard’s picture

@#129 I think that, for this usage, you are probably right and the ArrayAccess interface should be enough. Plus, all of the useful ArrayObject methods have been overridden, which means it's an almost useless overhead to use this class as base class. Morever, using an interface here would allow more flexibility for specializations development.

catch’s picture

So right now the class is using ArrayObject storage, and several methods that aren't overridden - so it's not the case at all that all the useful methods are overridden - at least a couple aren't, and some that are are still used internally. If it switched to ArrayAccess, we'd have to have our own custom storage and maintain that instead. Probably not much else would change though however it's going to be a fair bit more verbose since we get quite a bit for free using the storage.

I am fine with leaving sdboyer's list not dealt with (I can think of use cases for exchangeArray - needed to do something similar with the path system list cache once, but not the rest). If people really wanted to override those methods to null, then I would likely rather go the ArrayAccess route though (unless implementation wise it got messy), since it's seems silly to override methods to no-op just to save implementing other methods.

Dave Reid is right in #120. No-one has set this back to CNR or CNW for the above, so I'm not going to either in the hope we can get some high level feedback from Dries soonish - couple of weeks since webchick pinged him and the bulk of the patch is going to be the same with or without the recent reviews incorporated. I'm unlikely to have much time this week, but if I do I might look at how much work just implementing ArrayAccess and using our own storage would be.

pounard’s picture

Thanks for the explaination.

sdboyer’s picture

So right now the class is using ArrayObject storage, and several methods that aren't overridden - so it's not the case at all that all the useful methods are overridden - at least a couple aren't, and some that are are still used internally. If it switched to ArrayAccess, we'd have to have our own custom storage and maintain that instead. Probably not much else would change though however it's going to be a fair bit more verbose since we get quite a bit for free using the storage.

ArrayAccess would be more verbose and less performant. Maybe negligibly for most of the cases we define in core, but if this is a pattern to be followed, we could start feeling it as a couple thousand array lookups per request have to pass through userland logic in order to snatch the right values. As much as I love SPL, I'm always wary about the number of user function calls it introduces. Which is why I tend to look at the way catch has constructed this around utilizing the built-in C logic for most of the storage as a victory here, even if it's at the cost of some architectural elegance.

catch’s picture

FileSize
8.2 KB

Updated for capitalization per Dave Reid, since that's the only change and no-one has marked this down from RTBC for anything that's been brought up since I won't either.

Also added an issue summary.

Status: Reviewed & tested by the community » Needs work
Issue tags: -Performance, -Needs tests, -memory, -Needs backport to D7

The last submitted patch, schema.patch, failed testing.

pillarsdotnet’s picture

Status: Needs work » Needs review
Issue tags: +Performance, +Needs tests, +memory, +Needs backport to D7

#125: schema.patch queued for re-testing.

Anonymous’s picture

+    // Since this method merges with the existing cache entry if it exists,
+    // ensure that only one process can update the cache item at any one time.
+    // This ensures that different requests can't overwrite each others'
+    // partial version of the cache and should help to avoid stampedes.
+    // When a lock cannot be acquired, the cache will not be written by
+    // that request. To implement locking for cache misses, override
+    // __construct().

can we change this to something like:

+    // Lock around writing to the cache to help avoid stampedes.
+    // To implement locking for cache misses, override __construct().

this code really doesn't ensure that we keep the cache consistent, so lets not make people think it does.

the consistent cache stuff is a bigger issue, and is coupled with cache strategy, so we don't want to try one-size-fits-all.

otherwise, i'm happy with this as RTBC.

catch’s picture

FileSize
7.86 KB

That makes loads of sense, there's some stampede protection but we can't guarantee atomicity just by virtue of using the class. Updated the comment.

Anonymous’s picture

coolio, back to RTBC.

Anonymous’s picture

Status: Needs review » Reviewed & tested by the community

ahem.

catch’s picture

Just a note I've been running a backport of this to Drupal 6 on a client site for a couple of weeks now with no side-effects.

There's also an issue summary now.

catch’s picture

Issue summary: View changes

Updated issue summary.

catch’s picture

Title: drupal_get_schema() wastes 1-3mb of memory for nearly every request » Introduce CacheArrayObject and use it for drupal_get_schema()

Updated the issue summary, also changing title.

crea’s picture

This is very smart, impressive stuff! Great job!

webchick’s picture

Assigned: catch » Dries

Putting this into Dries's queue to review (sorry, catch).

Dries’s picture

I spent a good 40 minutes reading the issues summary and various comments as well as the code. My initial reaction is that I'm comfortable with this approach. The only thing I wasn't 100% sure about was the API. The offset-stuff was a bit confusing at first, and we seemed to mix 'save' and 'persist'. I need to thinker about it more to see if we can streamline the internal API. All in all, I give this approach a thumbs up.

Crell’s picture

The offset*() methods are part of the ArrayAccess interface, which ArrayObject implements. Their signature and behavior are fixed by PHP. That's what allows $myobj['foo'] to work.

catch’s picture

The actual offset* method names are defined by the interface as Crell says.

The only other thing with the OffsetGet/OffsetExists that's a bit tricky is the !== NULL check.

It would be possible to move that out of this class, and not cover the case when non-existing keys are requested, but since drupal_get_schema() and any other places don't trigger a full cache rebuild when missing keys are requested I wanted to keep that behaviour by default. Otherwise implementing classes would need to re-implement that pattern individually to keep that behaviour, which can't be used often.

With persist vs. save, we could do:

- $this->keysToSave
+ $this->keysToPersist

That was a tricky property to name, so other suggestions also welcome.

Dries’s picture

I read the ArrayAccess documentation. I'm surprised that they picked those function names as the rest of the world calls this 'indices', not 'offsets'.

More importantly, I guess I'm not 100% sure why we decided to start from the ArrayAccess class. It doesn't hurt and I'm comfortable with it, but I'm not sure I fully understand what we win.

catch’s picture

It gives us an api that can be retrofitted to pretty much any procedural API that has previously been relying on massive arrays held in memory to operate. And we're able to do that with often only a few lines of code changed outside the class override. As well as schema there is the theme registry, field info, system info, ctools defaults, views defaults, variables and many other places that have issues with stampedes, read locking and/or memory that desperately need backportable fixes in Drupal 7. A default install uses around 2mb more memory than D6 did and this gap can raise dramatically once you add modules or define instances due to the amount of metadata we're loading all the time. We've also increased the number of caches that need to be rebuilt globally before any requests can be served by a site, and this patch will allow some of them to be built as needed, reducing the window immediately after cache clears where all requests are held up or stampeding.

ArrayAccess means that code that was previously just doing isset() or accessing array values can continue to do so without breaking. fwiw I also quite like ArrayAccess in general. $foo['bar'] can be more natural than getters, I think the wscci is planning to use that for the $context object as well.

On the other hand in most cases where we're needing to apply this it might be a sign that the subsystem is in need of an overhaul in general. Chx mentioned try to standardize theme hooks to the point where there's no hook_theme() for example. So this means we may introduce this to lots of places in D8 then slowly go through and remove it again. Since the code changes for any one system are pretty small it should not make more radical refactoring any harder. I don't think the problem of dealing with data from multiple different sources that may be needed or not depending on site usage is going to go away though. There are also no patches in the queue that are refactoring these systems in D8 yet, so there shouldn't be any conflicts from doing this first and backporting it, then circling around later to revisit the individual APIs.

andypost’s picture

for type hinting, we can use ArrayAccess.

Also ArrayObject as base class brings a new methods to CacheArrayObject which are useless for solving the task

catch’s picture

It brings some methods that are useless, but there are some which are not used in this patch that are useful elsewhere.

#987768-99: [PP-1] Optimize variable caching makes use of exchangeArray()

#853864-53: views_get_default_view() - race conditions and memory usage adds a custom iterator.

Also the patch is using ArrayObject's own storage, rather than keeping track of a custom property itself for the actual cache.

So it would be possible to have a patch that only implements ArrayAccess rather than inheriting from ArrayObject, that was brought up before, but it means more userspace code, and means some useful parts of ArrayObject would not be available . You'd have to either add the CacheArrayObject version as a second base class, or systems would need to add back the features missing themselves (by either extending ArrayObject or re-implementing some of the methods in userspace).

catch’s picture

FileSize
8.21 KB

I ended up today not using CacheArrayObject in that views patch at all, and just using ArrayObject directly.

Since a few people have brought this up, I wanted to see how it'd look with just implementing ArrayAccess.

The answer is not a lot different, here's a patch that does this.

This one is not for commit, I tried not to introduce anything evil in it, but it probably needs review compared to the previous one.

tbh since the actual amount of code in CacheArray/CacheArrayObject is pretty small, just implementing ArrayAccess seems alright to me and it's no more verbose except for adding two missing methods that aren't used most of the time (assuming I didn't miss something stupid and break tests). So I'm fine either way with this.

Status: Reviewed & tested by the community » Needs work
Issue tags: -Performance, -Needs tests, -memory, -Needs backport to D7

The last submitted patch, cachearray.patch, failed testing.

andypost’s picture

Status: Needs work » Needs review
Issue tags: +Performance, +Needs tests, +memory, +Needs backport to D7

#143: cachearray.patch queued for re-testing.

andypost’s picture

The main difference in storage (ArrayAccess-based version implements it's own protected $storage = array();) and result of resolveCacheMiss() which seems more saner to return data if resolved (suppose this was the bug in ArrayObject implementation)

I think the name og abstract class should be in drupal namespace to avoid collisions with third-party software classes which I have a lot maintaining a cacherouter module

catch’s picture

Yeah looking at it again today I preferred having resolveCacheMiss return as well as set the data, it's not a massive detail but I'd be happy to go and apply that to the ArrayObject version as well.

So you want DrupalCacheArray for the abstract class name? Doesn't bother me, this is why we should adopt PSR-0 :)

And yeah the only other difference really is protected $storage vs. ArrayObject's own.

I tried to do an interdiff but it wasn't any clearer than just comparing the patches side by side.

Crell’s picture

We can easily call it DrupalCacheArray for now for backporting purposes, and then rename it to \Drupal\Cache\CacheArray later in D8. :-)

Dries’s picture

I'll let you go back and forth on this a bit longer, if needed. I'm conceptually OK with either approach, but I'm glad to see that we're trying to optimize this. If you want me to make a decision, let me know.

andypost’s picture

#148 +1 to DrupalCacheArray

I think ArrayAccess is a lightweight to be a base class. Also all sorting and iteration methods of ArrayObject's are useless for light cache array implementation

catch’s picture

Yeah I'm fine with #149 as well. If a subsytem needs a similar pattern and the other goodies that ArrayObject provides it could extend ArrayObject directly. The thing that is mainly changing my mind on this is that if you dp
the object you get a lot less to worry about this way. Will re-roll in the next couple of hours.

catch’s picture

FileSize
8.25 KB

OK new patch with two changes:

Base class is now DrupalCacheArray.

I renamed $keysToSave to $keysToPersist for consistency with the persist() method.

catch’s picture

Title: Introduce CacheArrayObject and use it for drupal_get_schema() » Introduce DrupalCacheArray and use it for drupal_get_schema()
Status: Needs review » Reviewed & tested by the community

I'm going to move this back to Dries. #152 is implementing ArrayAccess directly, #129 extends ArrayObject (#129 could use some of the cleanup in #152 if we went with that, but I prefer #152 so not going to do that pre-emptively).

After working on #152 I personally favour that approach for the base class since it's a bit simpler - there's no difference in functionality and it's still possible to copy the pattern and use it with ArrayObject for something more advanced.

While not related to the schema implementation, using the same overall approach I was able to save 15-20mb of memory in #853864-53: views_get_default_view() - race conditions and memory usage on a D6 site with 200 default views (a spot check on another D6 client site and they had 150 defaults so this is not outlandish unfortunately). Once this approach has been applied to a few more subsystems I'm hopeful that runtime memory usage of sites could be reduced by 5-10mb in many cases, and it opens up the possibility to reduce peak memory usage on cache rebuilds as well, especially in Drupal 8 (the views patch is a decent example where that can be done).

catch’s picture

Here's another implementation, for the path alias whitelist, #1209226: Avoid slow query for path alias whitelists.

sdboyer’s picture

Status: Reviewed & tested by the community » Needs review
FileSize
8.25 KB
9.32 KB

tl;dr - We cannot use multidimensional array syntax externally on these objects; from outside code that's utilizing the ArrayAccess interface, it's only safe to directly access or write to the top-level array. Best we can do is document this, which is what my updated patch does. Setting this back to needs review, because to make sure we're OK with the limitation. Secondary patch is included for completeness, but would mean Drupal 8 would require php5.3.4 at minimum.

So we missed something, and I ran into this with dog today: errors happen when attempting to use multidimensional array syntax on the ArrayAccess-implementing object, e.g.:

$array = new ThingieImplementingArrayAccess();
$array['foo'] = 'bar'; // not a problem
$array['foo']['bar'] = 'baz'; // php goes whoopsie to the tune of: "Indirect modification of overloaded element of ThingieImplementingArrayAccess has no effect."

Turns out this is a longstanding PHP issue, and if you think about it, it makes sense - how could the simple nature of the offsetSet() and offsetGet() as specified by the ArrayAccess interface possibly accommodate a multilevel array request? Turns out what it does is return the array identified by the first key (so in the above example, the array identified by key foo) and then that gets written to in the calling code...but that's useless, because that top-level array is returned by value, not by reference, so the write is performed on the copy and the original datastructure within the object is unaffected.

There's a ton of bellyaching around the internetz on this, including a few php bugs, but the sorta good news is that the php folks finally relented and made a fix that relaxed prototype checking restrictions on returns (their discussion on this). They were even nice enough to explain this concisely in the notes section of for ArrayAccess::offsetGet(). All you have to do is specify your implementation as passing by reference, so public function &offsetGet($offset), and it'll work with no error message.

Bottom line is, it's now possible to allow conventional array syntax-based access to multidimensional arrays. Problem is...that only works post-5.3.4 (Lucid is only on 5.3.2, just for one example). Which is pretty much a non-starter. Try to do that prior to 5.3.4, and you get a PHP Fatal error: Declaration of [whatever]::offsetGet() must be compatible with that of ArrayAccess::offsetGet(). It's also impossible to create a reference to a value (any value, not just a nested one) prior to 5.3.4. Here's the >=5.3.4 version that works:

class SomeContainer implements ArrayAccess {
     protected $_data = array('foo' => array(1));
     public function &offsetGet($name) { // returns by ref
         $r = & $this->_data['foo'];
         return $r;
     }
     public function offsetSet($name, $value) {}
     public function offsetExists($name) {}
     public function offsetUnset($name) {}
}

$a = new SomeContainer();
var_dump($a);

/* Output:
object(SomeContainer)#1 (1) {
  ["_data":protected]=>
  array(1) {
    ["foo"]=>
    array(1) {
      [0]=>
      int(1)
    }
  }
}
 */

$b = &$a['foo'][0];
$b = 10;
var_dump($a);
/* Output:
object(SomeContainer)#1 (1) {
  ["_data":protected]=>
  array(1) {
    ["foo"]=>
    array(1) {
      [0]=>
      &int(10) <-- this has been updated and has the reference marker
    }
  }
}
*/

And here's the <5.3.4 version that will compile without error, but fails to create the reference and throws the good ol' "Indirect modification" error:

class SomeContainer implements ArrayAccess {
     protected $_data = array('foo' => array(1));
     public function offsetGet($name) { // returns by val
         $r = & $this->_data['foo'];
         return $r;
     }
     public function offsetSet($name, $value) {}
     public function offsetExists($name) {}
     public function offsetUnset($name) {}
}

$a = new SomeContainer();
var_dump($a);

/* Output:
object(SomeContainer)#1 (1) {
  ["_data":protected]=>
  array(1) {
    ["foo"]=>
    array(1) {
      [0]=>
      int(1)
    }
  }
}
 */

$b = &$a['foo'][0];
$b = 10;
var_dump($a);
/* Output:
object(SomeContainer)#1 (1) {
  ["_data":protected]=>
  array(1) {
    ["foo"]=>
    array(1) {
      [0]=>
      int(1) <-- still the original value even though we assigned $b to it by ref
    }
  }
}
*/

Also note that somehow, ArrayObject makes this all go away. This code works all the way back into 5.2, with no complaints and exactly how you'd expect:

$a = new ArrayObject(array());

$a['added']['fookey'] = 'foo'; // new multidim array is properly created, value is set
$b = &$a['added']['fookey']; // reference is successfully created
var_dump($a);

$b = 'bar'; // value of ['added']['fookey'] is updated to 'bar' within the object
var_dump($a);

$b = NULL; // value of ['added']['fookey'] is NULL
var_dump($a);

However, there's something magical in ArrayObject's implementation of offsetGet() because, even in >=5.3.4, as soon as I override the implementation in userspace, everything gets breaky. And weirdly breaky, too. Swap in this custom class:

class A extends ArrayObject {
  public function &offsetGet($offset) {
    if (!parent::offsetExists($offset)) {
      return;
    }   
    $var = &parent::offsetGet($offset);
    return $var;
  }
}

If someone could figure out how to make an overridden ArrayObject::offsetGet() work, then we might be able to have the best of both worlds without having to worry about the 5.3.4 requirement. As it is, though, ArrayObject is completely off the table if we ever care about allowing nested writes to occur. Of course, ALL this assumes we even *want* to let people treat these as anything more than an interface into a k/v store. Which I think is what it all really comes down to...

It seems like the central goal here is that these really are treated externally as flat k/v stores, even if the "v" is itself an array. We never directly externally access (or more importantly, modify) internal parts of that contained array (the operation those php docs refer to as an "indirect modification.") If that's the case, then we're fine, but we have to stick to that. So I've updated the patch with some additional documentation to reflect this restriction. However, I'm setting this back to needs review because this is a nontrivial limitation for what we're hoping can act as a core pattern, and I want to make sure we can accept the shortcomings. In the weird event a 5.3.4 minimum is ok, though, I've also attached a secondary patch that adds the small bits necessary to make all these problems disappear. Untested, though.

Status: Needs review » Needs work

The last submitted patch, cachearray-5.3.4-only.patch, failed testing.

sdboyer’s picture

rfay tells me that the testbots run 5.3.3, which means a compile-time error when they hit that code. Which is also further evidence that a 5.3.4 minimum is a non-starter.

catch’s picture

I've been aware of that limitation and it's no problem as used here. For D8 we can look at adding > 5.3.4 as a requirement if it's not already but I don't see the limitation as a big deal at all. I'm away from computer until the end of next week so would be great if this could be kept warm.

sdboyer’s picture

OK, glad you're apprised of the issue...we'll just call that exhaustive post "for the public record," then :)

catch’s picture

The extra docs look good, although maybe should use @code? While it doesn't affect the implementation here I do want to apply this to a lit if places so makes sense to try to anticipate where we could run into issues.

sdboyer’s picture

And, I don't think 5.3.4 is a good idea, even for D8. Lucid, which is Ubuntu's current LTS, only has 5.3.2. I'd rather just operate with the restricted capacity system. Or if someone can figure a way to wrangle ArrayObject...

catch’s picture

Hmm that's reasonable. I can't think of obvious places in core we'd run into this issue, at least nowhere so far, so hopefully it won't come up.

sdboyer’s picture

Cool. The only worrisome part is that it'll fail silently and weirdly if someone *does* try to do it. And all my tinkering has pretty much indicated to me that it's not possible to detect when a multidimensional operation is being performed from within the object, so we can't even add anything to make failure noisy.

Also, quick side note @Dries re: #139 - while the ArrayAccess documentation all refers to 'offset,' the docs for ArrayObject say index. Can also see it if you dump the reflection export of ArrayObject (trimmed to the relevant bits):

$ php --rc ArrayObject
Class [ <internal:SPL> <iterateable> class ArrayObject implements IteratorAggregate, Traversable, ArrayAccess, Serializable, Countable ] {
    Method [ <internal:SPL, prototype ArrayAccess> public method offsetExists ] {

      - Parameters [1] {
        Parameter #0 [ <required> $index ]
      }
    }

    Method [ <internal:SPL, prototype ArrayAccess> public method offsetGet ] {

      - Parameters [1] {
        Parameter #0 [ <required> $index ]
      }
    }

    Method [ <internal:SPL, prototype ArrayAccess> public method offsetSet ] {

      - Parameters [2] {
        Parameter #0 [ <required> $index ]
        Parameter #1 [ <required> $newval ]
      }
    }

    Method [ <internal:SPL, prototype ArrayAccess> public method offsetUnset ] {

      - Parameters [1] {
        Parameter #0 [ <required> $index ]
      }
    }
}
Crell’s picture

I'm also OK with just living with the limitation for the time being as long as it's documented, especially as, for the moment, we won't be directly accessing the array anyway from most module code. Nice sleuthing, sdboyer!

Anonymous’s picture

following all the awesome sleuthing, i'd just like to repeat - for D8, i'd really, really like to see us port schema to an object, and keep all the lazy loading stuff behind the public API. will try to get a proof of concept schema object up in a sandbox at drupalcon, so i don't repeat this any more times without code to point at...

catch’s picture

Status: Needs work » Reviewed & tested by the community
FileSize
9.32 KB

I'm re-uploading #155 verbatim. the extra comment is good, that's the only change and this was CNW due to bot environment running 5.3.3 vs. 5.3.4 for the other patch which we've discounted.

catch’s picture

By the way while he didn't actually post on this issue, Berdir had the original idea of using ArrayAccess for this, see #1011614-10: Theme registry can grow too large for MySQL max_allowed_packet and memcache default slab size so should be credited in any commit.

Dries’s picture

Version: 8.x-dev » 7.x-dev

Committed #166 to 8.x -- that will enable us to make progress. Moving to 7.x for @webchick.

catch’s picture

Thanks!

webchick’s picture

Assigned: Dries » Unassigned

Since Dries has chimed in on this, removing the assigned to value.

webchick’s picture

I'm confused about which of the patches here we can apply to D7, which has to support PHP 5.2. Is #166 still ok? I wasn't sure if compat referred to 5.3.3 or 5.2 => 5.3 inclusive.

catch’s picture

#166 should be fine for 5.2.

webchick’s picture

Status: Reviewed & tested by the community » Fixed

Cool, thanks. Committed and pushed to 7.x.

Status: Fixed » Closed (fixed)
Issue tags: -Performance, -Needs tests, -memory, -Needs backport to D7

Automatically closed -- issue fixed for 2 weeks with no activity.

Anonymous’s picture

Issue summary: View changes

Updated issue summary.