Problem/Motivation
I think that in these AI days, it would be useful to enhance the text extraction from different file formats using AI.
Proposed resolution
Add a new plugin to use the Unstructured.io service.
Remaining tasks
TBD
Issue fork search_api_attachments-3519494
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
adanielyan commentedThank you for this! I hope this will get merged to the dev branch soon.
Comment #4
robertoperuzzoHi @adanielyan and @izus, I was wondering if we could use the Batch API to test the extractors on settings submission, in order to avoid a timeout. During my tests using the Unstructured.io API locally, I encountered a timeout error a couple of times.
What do you think? Will it be compatible with the other kind of extractors?
Comment #5
robertoperuzzoComment #6
j-barnes commented@robertoperuzzo – thanks for the work on this!
Our team was looking for a way to leverage search_api_attachments with Unstructured to clean up the text for our RAG search, and this is the perfect solution. We have tons of legacy content that hasn’t been OCR’d, so this works well for that.
I did run into a few issues and have a couple of wish-list items:
submitConfigurationForm()fixed it for us).DelayedRequeueExceptionloop. Increasing the Guzzle time-outs solved it locally; exposing these as configurable options would be great.Anyways, great work on this -- hope we can get this merged in soon. Attaching PDF that we have been having issues with if you need it for testing.
Comment #7
j-barnes commentedComment #8
robertoperuzzoThank you, j-barnes, for your precious feedback.
If I understand correctly, you have already implemented the fixes listed in your comment. So, please add them to the MR or share the patch to be applied.
Comment #9
j-barnes commentedHey @robertoperuzzo,
We ended up moving forward with the Solr extractor since it fit our workflow a bit better. I did make a few local changes that helped with timeouts, but they were more of a short-term workaround (increasing timeouts and such), so I didn’t think they’d add much value upstream. Really appreciate the work you’ve done here though.