Problem/Motivation
There is a lot of overlap between extractor modules, and that is why we created the Document Loader, so you can load any type of file into text (or markdown/html).
When we have the plugins added, unstructured, and any future document loaders, that means that anyone can use this to abstract the loading of documents.
So, in Automator, you say that you want a document loader to load DocX to MD, but unless you go into advanced, you do not have to express which one unless you need a specific one, just that you have one. Same with a Tool API Tool.
That way, we can move agents or automators to recipes where they do not have to care about which document loader is being used. Just that one exists.
Proposed resolution
- Create the Document Loader Plugin
Remaining tasks
User interface changes
API changes
Data model changes
Issue fork unstructured-3582438
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
ahmad khader commentedComment #3
marcus_johansson commentedComment #4
arianraeesi commentedComment #5
ahmad khader commentedWe may need to implement #3577857: Allow Document Loader to define LLM inputs or something similar before we can integrate with document loader
Comment #7
ahmad khader commentedThe text fields have been migrated to the document loader. However, tables and images are still utilizing our existing automation processes. To fully migrate all automation processes to the document loader, we need to implement support for these fields/outputs within the document loader.
Comment #8
ahmad khader commentedThe ticket also depends on #3577857: Allow Document Loader to define LLM inputs since we need to configure the option per loader so we can pass our configuration to the automators.
Comment #9
ahmad khader commentedComment #10
marcus_johansson commentedComment #11
marcus_johansson commentedComment #12
ahmad khader commentedComment #13
marcus_johansson commentedreviewing
Comment #14
marcus_johansson commentedReady to merge from my side