Add a plugin for Unstructured.io [#3519494]

Problem/Motivation

I think that in these AI days, it would be useful to enhance the text extraction from different file formats using AI.

Proposed resolution

Add a new plugin to use the Unstructured.io service.

Remaining tasks

TBD

Comment	File	Size	Author
#6	test.pdf	1.02 MB	j-barnes

Issue fork search_api_attachments-3519494

Show commands

Start within a Git clone of the project using the version control instructions.

Add & fetch this issue fork’s repository

Or, if you do not have SSH keys set up on git.drupalcode.org:

Add & fetch this issue fork’s repository

3519494-add-unstructured-io-plugin changes, plain diff MR !42
Check out this branch for the first time

Check out existing branch, if you already have it locally

About issue forks

Comments

Comment #1

16 April 2025 at 13:21

robertoperuzzo created an issue. See original summary.

Comment #2

adanielyan commented 27 May 2025 at 05:15

Thank you for this! I hope this will get merged to the dev branch soon.

Comment #3

25 June 2025 at 21:43

robertoperuzzo opened merge request !42

Comment #4

robertoperuzzo

Italian

🇮🇹 Tezze sul Brenta, VI

commented 26 June 2025 at 06:19

Hi @adanielyan and @izus, I was wondering if we could use the Batch API to test the extractors on settings submission, in order to avoid a timeout. During my tests using the Unstructured.io API locally, I encountered a timeout error a couple of times.

What do you think? Will it be compatible with the other kind of extractors?

Comment #5

robertoperuzzo

Italian

🇮🇹 Tezze sul Brenta, VI

commented 1 July 2025 at 21:32

Assigned:	robertoperuzzo	» Unassigned
Status:	Active	» Needs review

Comment #6

j-barnes commented 3 July 2025 at 17:29

Status	File	Size
new	test.pdf	1.02 MB

@robertoperuzzo – thanks for the work on this!

Our team was looking for a way to leverage search_api_attachments with Unstructured to clean up the text for our RAG search, and this is the perfect solution. We have tons of legacy content that hasn’t been OCR’d, so this works well for that.

I did run into a few issues and have a couple of wish-list items:

Chunking elements – the form values don’t persist after save (adding those fields to submitConfigurationForm() fixed it for us).

Chunking settings not applied – the options aren’t being included in the payload request, so they don’t appear to take effect yet.

Large files time-out – anything over ~1 MB (we have one that’s 1.1 MB) hits a DelayedRequeueException loop. Increasing the Guzzle time-outs solved it locally; exposing these as configurable options would be great.

php
  $options = [
    RequestOptions::HEADERS => [
      'Accept'               => 'application/json',
      'unstructured-api-key' => $api_key?->getKeyValue() ?? '',
    ],
    RequestOptions::MULTIPART => [
      [
        'name'     => 'files',
        'contents' => $file_resource,
        'filename' => $file->getFilename(),
        'headers'  => [
          'Content-Type' => $file_mime_type,
        ],
      ],
      [
        'name'     => 'strategy',
        'contents' => 'ocr_only',
      ],
    ],
    RequestOptions::TIMEOUT         => 300,
    RequestOptions::CONNECT_TIMEOUT => 30,
    RequestOptions::READ_TIMEOUT    => 300,
  ];

Expose strategy choices – it would be awesome to have a dropdown for extraction strategy (e.g., ocr_only, high_res, etc.) so we can pick the best option for non-OCR’d PDFs.

Anyways, great work on this -- hope we can get this merged in soon. Attaching PDF that we have been having issues with if you need it for testing.

Comment #7

j-barnes commented 3 July 2025 at 17:29

Status:

Needs review

» Needs work

Comment #8

robertoperuzzo

Italian

🇮🇹 Tezze sul Brenta, VI

commented 25 September 2025 at 07:47

Thank you, j-barnes, for your precious feedback.
If I understand correctly, you have already implemented the fixes listed in your comment. So, please add them to the MR or share the patch to be applied.

Comment #9

j-barnes commented 8 October 2025 at 16:08

Hey @robertoperuzzo,

We ended up moving forward with the Solr extractor since it fit our workflow a bit better. I did make a few local changes that helped with timeouts, but they were more of a short-term workaround (increasing timeouts and such), so I didn’t think they’d add much value upstream. Really appreciate the work you’ve done here though.

Add a plugin for Unstructured.io

Problem/Motivation

Proposed resolution

Remaining tasks

Issue fork search_api_attachments-3519494

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Referenced by

News items

Our community

Documentation

Drupal code base

Governance of community