file_scan_directory() is a function called (in D7) from drupal_system_listing() and other places.
drupal_system_listing() is called from _system_rebuild_module_data() and other places, usually during cache clear.

For my own projects, most of the milliseconds lost on file_scan_directory() come from _system_rebuild_module_data().

The functions have a number of performance and other issues. They are fragile and slow and overall make me sad.
I wonder why so much of contrib calls these functions, instead of using simple one-off custom implementations.

Flexibility is expensive

file_scan_directory() and drupal_system_listing() are designed for arbitrary regular expressions, and allow various options.

A lot of the calls to drupal_system_listing() / file_scan_directory(), including the main one from _system_rebuild_module_data(), do not need all the regex flexibility and all the options. A simpler and faster alternative implementation could be provided for these cases.

Inefficient file_scan_directory() implementation

file_scan_directory() contains code that is unnecessarily repeated in every recursive call:
- $options += .. (default options)
- $options['key'] = ...
- is_dir($dir) -> this was already checked before the recursive call.

It applies regular expressions before doing cheaper checks.
- if (!preg_match($options['nomask'], $filename) && $filename[0] != '.') { ..
The latter (with !==) is faster, so it should be first.

It uses != or == instead of !== or ===, which is not only bad but also slow.

The default $options['nomask'] = '/(\.\.?|CVS)$/' could be replaced with if ($filename[0] !== '.' && $filename !== 'CVS'), which would merge with the other $filename[0] !== '.'.

I also had the idea that we are calling file_stream_wrapper_uri_normalize() more often than necessary.

readdir() vs scandir() vs glob()

In my own experiments, for directories that contain only few positive matches and a lot of noise (e.g. when looking for *.module), a well-crafted glob() was much faster than readdir().
I also found scandir() to be slightly faster than opendir() + readdir(), but this also can have memory implications.

The trick is to encode as much as possible from the $mask regular expression into the glob pattern, and then do a preg_grep() to weed out remaining false positives.

DirectoryIterator performed worse than scandir() and readdir(), so I think we can discard it as an alternative.

Recursive call and array_merge() overhead

Every function call with its passing of parameters causes an tiny but measurable overhead, e.g. for the parameter copying.
Besides, the array_merge() can be costly.
Besides, If it was all within one function, we could unpack some of the variables in $options at the beginning, and thus reduce the number of array lookups.

Bugs / undesirable behavior

The default nomask '/(\.\.?|CVS)$/' is also poorly crafted. It catches 'XYZCVS' and 'foo.' instead of just 'CVS', '.' and '..', because it lacks an opening '^' symbol.

A "find usages" for file_scan_directory() and drupal_system_listing() (on a D7 project of your choice) shows that many calls (in contrib, and also core I think) call it with wrong or poorly-crafted regular expressions.
E.g. if the correct regex would be '/\.inc$/', what I find instead is '/.inc$/', '/.inc/', '/.*\.inc$/', '/.*.inc$/', '.inc' (without the delimiters), etc.

drupal_system_listing() checks for *.info files even when looking for *.tpl.php or other, when this really should only happen for *.module.

drupal_system_listing() checks for incompatible core version only in some cases, but not in others. E.g. if (in D7) sites/all/modules/foo/foo.module is D7, but sites/default/modules/foo/foo.module is D6, then it will choose the correct D7 version. But if it is vice versa, then it will not. Also, if two versions of the same module exist within sites/all/modules (e.g. contrib vs custom) then it does not care about the version at all.

Not very useful return format

The code makes a lot of effort to return an array of objects, keyed by a key of choice ('uri', 'name', 'filename').
And yet, in a lot of cases the calling code is not happy with these objects, and will rearrange and overwrite the keys.

E.g. when looking for *.tpl.php, the objects will contain $file->name = 'page.tpl', because pathinfo($filename, PATHINFO_FILENAME) only removes the '.php', not the '.tpl.php'. So the calling code needs to remove the remaining '.tpl' from the name.

Or, when looking for *.module, the calling code will overwrite $file->filename = $file->uri.

Solution

Improved file_scan_directory()

I started with a conservative optimization of file_scan_directory() which is mostly equivalent with the original, but cuts the time to almost 1/2. Patches will follow. It still uses recursion and readdir().

Alternative scanning algorithm with glob()

I did some experiments locally. The following algorithm proved to be the fastest I could come up with, for the purpose of *.module (or, possibly, *.tpl.php) discovery: https://gist.github.com/donquixote/898a573dab8aefd473c27c11cb4d5f14

With some glue code, this algorithm allowed to reduce the time for file_scan_directory() to around 1/5.

The algorithm can be extended to support $min_depth or multiple lookup directories.

It should be mentioned that the order of results is different than in file_scan_directory(). It is the job of the "glue code" to figure this out.

It should also be noted that while neither glob(*, GLOB_NOSORT) nor opendir() + readdir() sort the result in a predictable way, the resulting order is different. See http://stackoverflow.com/questions/36244511/different-order-of-results-w...

"Glue code" / BC

The challenge is how to let the existing functions use this algorithm, while still supporting all the flexibility in a backwards-compatible way. E.g. the results need to be further ordered, filtered, turned into objects.

The main idea is that the "glue code" only needs to deal with a limited number of results (e.g. all *.module files found), whereas the main scanning algorithm has to deal with all files in the directories it is scanning. So we want to move complexity out of the scanning code and into the glue code.

It requires to extract a glob pattern from a regular expression. And it has to process the result to get the object format.

Provide alternatives to calling code

An alternative would be to provide alternative implementations of drupal_system_listing(), and let let calling code use these instead. This way we can avoid translating regular expressions into glob patterns.

Drupal 7 vs Drupal 8

I would propose that we do this optimization in Drupal 7 only.

For Drupal 8, I would propose a different plan:
- An external composer library with the scanning algorithm and proper unit tests.
- Internal glue code.

Comments

donquixote created an issue.