Problem/Motivation

In the event that the image toolkit is used on a large PDF file.

It spawns a ghostscript process in the background for every single page within the PDF during the 'identify' stage, which ends up needlessly using up loads of server resources, and can sometimes lead to zombie processes being left behind when a PHP timeout occurs as per #3184834: Broken images when consumed all free memory (in the case where 10s of PDFs are called by the toolkit at the same time).

This is especially the case when the identify process tries to load all the pages just to specify that the file is a PDF. This consistently resulted in a timeout.

This optimization issue was also highlighted in #2815483-13: Convert pdf to jpg

Steps to reproduce

1. Set ImageMagick as the default toolkit.
2. Attempt to call something along the lines of this with lots of pages:

// e.g. https://research.nhm.org/pdfs/10840/10840.pdf
// My specific example was a 150 page long PDF with tables and and lots of illustrations.
$image = \Drupal::service('image.factory')->get('public://my-very-large.pdf');
// This also doesn't work (and instead creates multiple PNG images for each page because the file type wasn't explicitly set)
image->save('public://my-very-large.pdf.png');

Proposed resolution

Provide a way to specify how many frames are processed by default for the 'identify' and 'convert' process for a specific file type.

In my specific case, I don't care to identify if there are more than 1 or 2 pages for PDFs within the toolkit, so we should be able to specify that default configuration for optimizations and timeout purposes.

This can now be achieved through two new properties on the file type configuration (identify_frames and convert_frames):

  PDF:
    mime_type: application/pdf
    enabled: true
    weight: 0
    # For efficiency reasons, limit identification to just the first 5 pages by default.
    identify_frames: '[0-4]'
    # Rarely needed, but limit all conversions to this file type to just the first 2 pages by default.
    # convert_frames: '[0-1]'

Remaining tasks

1. Provide MR
2. Provide working tests.

User interface changes

N/A

API changes

N/A

Data model changes

N/A

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

codebymikey created an issue. See original summary.

codebymikey’s picture

Issue summary: View changes
mondrake’s picture

Thanks! Please target 5.0.x. Yes, this would need a config toggle in the UI. Also would need a test and a fixture pdf test file.

codebymikey’s picture

Assigned: Unassigned » codebymikey

Makes sense, on further testing, enabling this for all extensions might not be desirable, as as per the failing tests, this can potentially break animated gifs (as only the first frame would then be processed).

Initially planned to keep it simple and implement this as a checkbox (and have it apply on default file types of our choosing e.g. PDF, PSD and TIFF), but it might be better to let the user specify the default image index range to use for each file type.

I think it might be easier to expand the existing Enable/disable image formats config to include an additional property called page_index or similar, that'd have the default index to apply to those file types if it's not been specified on the source:

TIFF:
  mime_type: image/tiff
  enabled: true
  # Process first two images.
  page_index: '1-2'
PDF:
  mime_type: application/pdf
  enabled: true
  # Process just the first page.
  page_index: '1'

That way, it's backwards compatible, and we don't have to introduce any new config properties. And we can have reasonable defaults for new installations.

mondrake’s picture

#5 sounds great

codebymikey’s picture

Issue summary: View changes
Status: Active » Needs work

Tests are still failing, looking into it.

codebymikey’s picture

Assigned: codebymikey » Unassigned

Tests are passing for me locally (on 11.x), but not on gitlab CI, I unfortunately don't have much more bandwidth to work on getting the tests to pass, so leaving it for now and will use the patch internally. Feel free to pick it up and amend if you're interested.

Whilst testing the source frame functionality, there were some interesting differences between the default behaviour between imagemagick and graphicsmagick when the +adjoin option is not specified and multiple frames are requested, so added a comment for it within the current test.

codebymikey’s picture

Version: 4.0.x-dev » 5.0.x-dev
Issue summary: View changes

mondrake changed the visibility of the branch 3557458-process-first-page to hidden.

mondrake’s picture

Test failures are legit - it looks like the parsing of the fixture file does not return the source format.

codebymikey’s picture

There's something off going on with CI, as the tests still pass locally, testing with https://github.com/ddev/ddev-drupal-contrib.

I even downloaded the entire artifact from the composer (next minor) pipeline, into a clean folder and retested locally, and I still couldn't get the failure:

imagemagick tests passing on a clean project

When you get the chance, can you confirm if you're able to get the same failures locally with something along the lines of:

ddev phpunit web/modules/custom/imagemagick/tests/src/Functional/EventSubscriberTest.php --filter 'testDefaultSourceFrames' --testdox --colors=always --fail-on-phpunit-deprecation --fail
-on-deprecation -c web/core

I've also pushed an update which ensures that the default identify frame configuration still takes effect in the event the user tries to identify a disabled image format (otherwise it still attempts the potential heavy processing).

mondrake’s picture

Looks like my local and GitLabCI are missing ghostscript: https://stackoverflow.com/questions/53569383/imagick-fails-to-execute-gs...

mondrake’s picture

Status: Needs work » Needs review
mondrake’s picture

Status: Needs review » Needs work

We may need to:

1) add some info in the README to guide people to install ghostscript and other packages (if need be) if they want to deal with multipage PDFs

2) consider setting (for new installations, no update path) identify_frames: [0] in the shipped configuration for the PDF format

codebymikey’s picture

Status: Needs work » Needs review

Added the PDF/postscript ghostscript requirements to the README, and set the default identify_frames config for PDFs.

Also added additional message contexts to ToolkitImagemagickFileMetadataTest, so that it's easier to track the exact file or operation was being ran when an assert failure occurred, otherwise it's a bit vague since it loops through multiple files - https://git.drupalcode.org/project/imagemagick/-/jobs/7493040#L549

  • mondrake committed 1f0b07e3 on 5.0.x authored by codebymikey
    feat: #3557458 Ensure that only the first page of images is processed by...
mondrake’s picture

Status: Needs review » Fixed

This is a good addition. Thanks!

Now that this issue is closed, review the contribution record.

As a contributor, attribute any organization that helped you, or if you volunteered your own time.

Maintainers, credit people who helped resolve this issue.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.