http://symfony.com/doc/2.0/components/finder.html

Symfony2 provides a library that handles various issues that deal with files including:

  • getting the path of a file
  • retrieving file metadata
  • handle custom stream wrappers

We currently have a mature handling of files that is improving by the day with the effort to make files to be entities. Perhaps adopting this library will help that effort. Perhaps it won't. We should find out.

Comments

Crell’s picture

Issue tags:+symfony

I don't think Symfony's Finder system is really a replacement for File Entities. It works at a lower level, and is mostly about file system traversal and manipulation. They're not directly comparable.

Damien Tournoud’s picture

If I read the OP properly, the point is about replacing some of our lower-level helper functions (drupal_mkdir(), drupal_chmod(), drupal_dirname(), etc. and possibly some of the methods of DrupalStreamWrapperInterface). We might have to contribute some stuff back (I'm not sure Symfony has our streamwrapper-friendly chmod, for example).

Crell’s picture

Damien: Yes, I was clarifying the OP's statement that this is relating to files becoming entities. I don't think it is. It's a good thing to consider all on its own. :-)

I don't think anyone would object to us contributing back upstream, but we should do it soon as I think 2.1 is supposed to hit stable well before Drupal 8 does.

brianV’s picture

I just read through the Finder component documentation, and I fail to see how it applies to our current file system handling.

It is, as the name implies, primarily built around finding files in a filesystem based on set criteria:

1. You can specify in which directories to search for files, which to exclude, and whether to search recursively.
2. Specify whether the returned list contains is files, directories, or files and directories.
3. How the resulting list should be sorted.
4. Filter results to match or exclude entries based on name, filesize, date, or a few other criteria.

Since the paths to which we've saved files are stored in the database, we never really have the use case in which we are searching the drive (or appropriate stream) for files. This *could* make for some interesting additional functionality at some point, but doesn't appear to offer much to simplify existing file handling.

That is, it has no support for moving, renaming, copying, chmodding etc. files.

pounard’s picture

There is some features such as the drupal_scan_files() or whatever high level functions that could use such high level API, but even before looking at Symfony for this I'd first look at SPL, with classes such as DirectoryIterator or RecursiveDirectoryIterator which are supposedly (when we look at various benchmarks) highly faster than glob(), readdir() and friends, which can be used with all kind of FilterIterator implementation such as the RegexIterator which can be used upon the SplFileInfo class.

EDIT: After a short reading of Finder component code, it uses all of that and seems to do some work about stream wrappers, it's probably a good idea to use it.

ardas’s picture

Thats for sure!

Since we all want to move towards Symfony, we would like to see their Finder Component inside Drupal... At least Library API module can use it to traverse 'libraries' directory to gather libraries.

I can say that file read and write operations - is one of the slowest things in Drupal (right after amount of SELECTs and loading ALL modules on bootstrap FULL stage).

sun’s picture

Just happened to have a chance to have a deeper look into Finder for some other issue.

I took the most common use-case we have in core (drupal_system_listing()), extracted the effective file_scan_directory() arguments of it, and compared that to Finder. The results are not in favor of Finder:

$ php bench.drupal.system-listing.php
ref: refs/heads/8.x
Peak memory before test: 2,912.03 KB
Iterations: 10

nothing:                      0 seconds
function no_op():             0 seconds
file_scan_directory():   2.6179 seconds
Finder:                 12.2466 seconds

Peak memory after test: 3,401.31 KB
Memory difference: +489.28 KB

Effective bench code:

$dir = 'core/modules';
$mask = '/^' . DRUPAL_PHP_FUNCTION_PATTERN . '\.module$/';
    $files = file_scan_directory($dir, $mask, array(
      'key' => 'name',
      'min_depth' => 1,
      'nomask' => '/^(CVS|lib|templates|css|js)$/',
    ));
  $finder = new Symfony\Component\Finder\Finder();
  $finder
    ->files()
    ->depth('> 1')
    ->name($mask)
    ->exclude(array('lib', 'templates', 'js', 'css', 'config'))
    ->in($dir);
  $files = array();
  foreach ($finder as $file) {
    $file->uri = $file->getPathName();
    $file->filename = $file->getFileName();
    $file->name = pathinfo($file->filename, PATHINFO_FILENAME);
    $files[$file->name] = $file;
  }

It's also noteworthy that Finder is not really flexible/customizable. E.g., we'd typically pass the FilesystemIterator::UNIX_PATHS | FilesystemIterator::SKIP_DOTS flags to skip hidden files and achieve platform-agnostic filepaths, but Finder doesn't allow to customize the $flags currently.

Crell’s picture

Is there anything we could legally push upstream to improve the Finder component to make it more compelling for us, and thus reduce the total amount of code in the world?

sun’s picture

Based on my tonight's investigation, Finder would have to be completely rewritten from scratch, in order to leverage RecursiveDirectoryIterator instead of DirectoryIterator, and likewise, re-implementing all filters as RecursiveFilterIterator instead of FilterIterator.

Essentially, Finder is running into the same trap like a gazillion of PHP code snippets I found on the net:

<?php
        $iterator
= new \RecursiveIteratorIterator(
            new
Iterator\RecursiveDirectoryIterator($dir, $flags),
            \
RecursiveIteratorIterator::SELF_FIRST
       
);

# ...which translates into:

       
$directory = new Iterator\RecursiveDirectoryIterator($dir, $flags);
       
// ^^ This is completely unfiltered *AND* recursive; i.e., all files, all directories.

        // vv As soon as this is invoked, the total filesystem scan happens, unfiltered.
       
$iterator = new \RecursiveIteratorIterator($directory, \RecursiveIteratorIterator::SELF_FIRST);

       
// ...whereas Finder only starts to filter _here_.
?>

However, to perform filtering before recursing until the end of the world, the RecursiveDirectoryIterator has to be wrapped with a RecursiveFilterIterator, before the RecursiveIteratorIterator is invoked.

E.g., like this:

<?php
$directory
= new RecursiveDirectoryIterator('core/modules', $flags);
$filter    = new SystemListRecursiveFilterIterator($directory, 'module', array('lib', 'config', 'js', 'css', 'templates'));
$iterator  = new RecursiveIteratorIterator($filter);

$files = array();
foreach (
$iterator as $filename => $file) {
 
$file->uri = $file->getPathName();
 
$file->filename = $file->getFileName();
 
$file->name = pathinfo($file->filename, PATHINFO_FILENAME);
 
$files[$file->name] = $file;
}
?>

A pure RecursiveDirectoryIterator implementation with proper RecursiveFilterIterators is able to cut down the total time approx. by half on my machine, but that is still 3x times slower than file_scan_directory().

See also: #1833732: Find a way to skip ./config directories without skipping core/modules/config directory in drupal_system_listing()

pounard’s picture

Don't forget the RegexIterator class too, which can be used for filtering by pattern. It should be tested in place of the SystemListRecursiveFilterIterator. And you should check performances of SystemListRecursiveFilterIterator too.

Damien Tournoud’s picture

The point of using Symfony components is not and has never been performance. The amount of indirection everywhere in Symfony is going to slow down Drupal 8 by orders of magnitude. This is by design.

Scanning the filesystem being a relatively infrequent operation anyway, could we just decide we don't care?

pounard’s picture

I agree with Damien about this one, the finder looks good for us. But even without the finder we still need to consider using the SPL right, which could drastically reduce Drupal system listing code to a 3 iterators objects instanciation and a simple foreach.

A patch written by chx is actually doing that for bootstrap/kernel stuff , I don't remember which one exactly. EDIT: See #1831350: Break the circular dependency between bootstrap container and kernel and https://drupal.org/files/1831350_22.patch

sun’s picture

  1. As long as Finder uses iterators instead of recursive iterators, it cannot be considered for core. 5 times slower is not acceptable.

    I asked upstream whether there are any plans to convert it to recursive iterators. @fabpot didn't object to it, but someone would have to perform the conversion (which isn't particularly trivial). I'm also not sure whether the switch to recursive iterators wouldn't demand for a changed iterator architecture in Finder — i.e., I think a lot of investigation and architectural design work is needed there. Given the remaining time we have until D8 feature freeze, it rather appears unlikely to be able to 1) fix the library upstream, and 2) get it ready for core inclusion afterwards.

  2. Filesystem scans are not as rare as you might think. Any performance decrease there significantly slows down the installer, update.php, and from my perspective most importantly, tests. Drupal also has to perform filesystem scans in case all caches are empty — the slower the scan is, the higher the chance for race conditions and parallel requests getting (b)locked. The performance impact is measurable and visible on all fronts, both for users and developers.

    E.g., only just recently, we had to tweak the existing scan functions, so as to get the performance of unit tests back under control.

  3. I certainly know of RecursiveRegexFilter and I tested it in my early benchmarks — it performed very slow. I think it only makes sense to use that filter when subclassing it or when using it within a stack of other filters.

pounard’s picture

Did you benchmark the GlobIterator?

I did use RecursiveWhatnotIterator and RegexIterator a lot, for parsing a huge volume of XML and HTML files (converting static site to a CMS content) and I never experienced any performance issues. Actually, in most case, those iterators runtime was so small compared to whatever I had to do arround that it was insignificant to me.

I'd be curious to know in which conditions you tested those iterators, which filesystem, which kind of harddrive, and on which OS.

From there http://stackoverflow.com/questions/11652481/php-fast-recursive-directory... the post actually bencmarks 30 secondes for reading 60,000 files/subdir, I guess this is a lot slower than the hardcore more C-like version, but I don't think we're ever gonna parse 60,000 files/subdir in Drupal. I'd still like to see more benchmarks before saying this is not acceptable.

Parsing files, in most cases especially during a normal runtime, is not acceptable, when we're dealing with modules finding for example we don't really care about performances because it's a pure administrative task that we are not supposed to ever do during normal runtime.

Even thought Simpletest sounds like an edge case we're never gonna encounter in such volumetry elsewhere in core, I'd be happy to use the slower iterators everywhere and make exceptions in cases such as Simpletest.

EDIT: And final note, trying to benchmark the iterators with different flags (for example not returning SplFileInfo objects, but just filename instead) might also change benchmark results.

bzitzow’s picture

Issue summary:View changes

This is an interesting thread. Did the conversation continue elsewhere?