Alternative to #2488538: Add SafeMarkup::remove() to free memory from marked strings when they're printed

Problem/Motivation

Steps to reproduce:

1. Install standard profile
2. Visit admin/user/permissions (profiling with xhprof or your profiler of choice)
3. Note that memory usage is similar to the following:

Total Incl. MemUse (bytes):	51,542,168 bytes
Total Incl. PeakMemUse (bytes):	55,488,944 bytes

4. Add a role and profile the permissions page again

memory usage goes up as follows:

Total Incl. MemUse (bytes):	52,504,248 bytes
Total Incl. PeakMemUse (bytes):	57,392,560 bytes

At three custom roles, memory usage had gone up as follows:

Total Incl. MemUse (bytes):	54,607,112 bytes
Total Incl. PeakMemUse (bytes):	61,356,968 bytes

Possible culprits:
SafeMarkup::set() - with a lot of checkboxes the array grows to mb.
#2488610: Use ModuleHander::getName() instead of rebuilding module data on permissions page
Html::cleanCssIdentifier() / Html::getId() / HtmlgetUniqueId() - combination of the strtr() and the static cache in HtmlgetUniqueId()

I think the culprit is SafeMarkup::set() in twig_render_template() - every checkbox is a template render. See attached xhprof screenshots.

Marking this critical for now because #2289201: [Meta] Make drupal install and run within reasonable php memory limits so we can reset the memory requirements to lower levels is setting the standard profile memory limit to 64mb, and it looks like five custom roles and/or additional permissions could easily send this over the top.

Proposed resolution

Hash string longer than 63 bytes in SafeMarkup::set() and related functions to constrain the possible memory usage of the string registry.

Remaining tasks

User interface changes

None

API changes

None

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

effulgentsia’s picture

Title: Moify SafeMarkup to use hashed strings from marked strings to minimize memory usage » Modify SafeMarkup to use hashed strings from marked strings to minimize memory usage

The SafeMarkup::set() at the very end of Renderer::doRender() gets called a lot, and for progressively larger strings, so the CPU cost of hashing might be a problem there. But #2503447: Use Twig_Markup to help with the SafeMarkup::$safeStrings memory load from Renderer::doRender() might remove that problematic usage.

pwolanin’s picture

Issue summary: View changes
Status: Needs work » Needs review
FileSize
3.19 KB

Quick first patch for the bot.

pwolanin’s picture

@effulgentsia - PHP internally has to hash the string to use it as an array key, so depending on that implementation it's not going to cause much added CPU to hash it beforehand.

http://stackoverflow.com/questions/2350361/how-is-the-php-array-implemen...

Status: Needs review » Needs work

The last submitted patch, 2: 2503445-1.patch, failed testing.

pwolanin’s picture

I'm stumped why that's failing - the verbose screen shows the expected text is there.

However, I see the test fail locally so it's something real that's not matching between the strings.

DuaelFr’s picture

I started investigating on this a few days ago.

I sent all the safe strings to an ES instance to do some statistics and it turned out that most of the strings where really shorter than a sha1 hash length. What's interesting though is that this totally changes for foreign languages that uses a lot of utf8 chars. The sha1 being 100% ascii, we could gain a lot of memory that way.

My tests didn't consider hashing only long strings because I thought that if we want to reduce the memory usage without increasing the CPU usage too much we should avoid the call to strlen() there and just hash everything.

If we choose going that way we should investigate on the best hash function to use regarding to the collision risk and the CPU usage. It seems that sha1 is heavier than md5 which is heavier than crc32.

My 2 cts

pwolanin’s picture

@DuaelFr - there is no need to use ascii output - the patch here uses the binary output to minimize the memory overhead.

A hash call is 10x or more slower than the strlen call, so I do not think always hashing is the right choice.

md5and sha1 must not be added to any new code and are 100% off the table and frankly the speed difference isn't enough to be worth debating (see prior benchmarks by alexpott).

sha512 is a reasonable choice here is terms of output length, and as a 64 bit algorithm is faster than sha256 on 64 bit hardware, which is the norm now.

DuaelFr’s picture

@pwolanin Sorry for the misunderstanding. I was not commenting the way you did this but just reporting my own (poor) experimentations.

According to your last message I'd say that we have to choose the hash algorithm carefully. The thing is that we don't need to have a very strong security on these hashes as, even if we have a collision with another string, the chances that this string can be unsafe are really low. So we need to be aware of both the hash length and the algorithm speed. Plus, if the best algorithm for our need has results shorter than 64B we could lower the threshold and hash more strings.

Even if I totally support your proposition I have to add that your argument about PHP storing keys as hashes is not really convincing. Even if you give PHP a hash as an array key, it's going to consider it as a string so it's going to rehash it anyway.

Finally, I agree with your explanation about the strlen call. That's right, keep it :)

pwolanin’s picture

@DuaelFr - the time hashing takes is proportional to the string length after some setup, so giving PHP shorter strings as the array keys will make some difference at least.

cilefen’s picture

This issue is a good idea.

joelpittet’s picture

@pwolanin looked for that failure, seemed to be around an error message being passed to drupal_get_message():

Deleted and replaced configuration entity "@name"

Which was stored in ConfigImporter::$errors which in the UI is somewhere grabbed and sent to the UI via drupal_set_message()/drupal_get_message() so session may be a factor?

And when it's output in the UI the quotes get escaped as:
Deleted and replaced configuration entity "config_test.dynamic.secondary"

Baffled as well here but maybe that helps?

pwolanin’s picture

@joelpittet - yes, I saw the same, which it perplexing since it seems like the hash of the same string doesn't match.

need to debug more, but I don't see how that's possible.

xjm’s picture

mr.baileys’s picture

Status: Needs work » Needs review
FileSize
2.39 KB

Batch processing stores the strings marked as safe for the next request. It fetches the list of safe strings via SafeMarkup::getAll() (note that after this patch, this list contains a mix of hashes and strings), and restores them in the next request using SafeMarkup::setMultiple(). With the current patch, SafeMarkup::setMultiple() is hashing all strings > 63 characters, including the hashes themselves (64 bytes), causing all hashed strings to no longer be considered safe.

So either we need to up the strlen limit a tiny bit (so hashes are not rehashed but saved in the array as-is), or we need to prepend the hashes so SafeMarkup::setMultiple() can differentiate between a safe string and the hash for a safe string.

Re-rolled and upped the limit for hashing to 64 which resolves the remaining test failure.

This might need a specific test in the SafeMarkup or Batch tests to ensure correct safe string behavior between different batch requests?

effulgentsia’s picture

The number of strings in the SafeMarkup static will be drastically reduced when #2557113: Make t() return a TranslationWrapper object to remove reliance on a static, unpredictable safe list lands. This issue might or might not make sense after that, I don't know.

dawehner’s picture

Yeah before continuing here we should a) test it with HEAD now b) test it with the translation wrapper and then see whether its still worth at all to even think about it.

pwolanin’s picture

There's some value I think to having the memory usage be basically linear in the number of string rather than potentially unknown depending on what's in the page.

dawehner’s picture

There's some value I think to having the memory usage be basically linear in the number of string rather than potentially unknown depending on what's in the page.

Well, the point is that we will remove basically all of SafeMarkup usages ... so we don't store the strings in the first place anymore.

pwolanin’s picture

Status: Needs review » Closed (won't fix)

yes, alexpott wants to totally remove this list, so maybe we should just close this issue?