https://github.com/sajari/docconv - is a simple Go binary that can be used for extracting text from a number of document types.

CommentFileSizeAuthor
#10 2842299-10.patch8.84 KBbenjy
#8 2842299-8.patch8.11 KBbenjy
#4 2842299-4.patch8.09 KBbenjy

Comments

benjy created an issue. See original summary.

izus’s picture

Hi,
I'am ok for it :)
if anyone wants to suggest a patch with a plugin for docconv, please feel free to do it.
Thanks

benjy’s picture

I've already written it, will tidy up and post it tomorrow.

I had to make some changes into FilesExtractor base class though because filterForPropertyPath() no longer exists in the version of Search API i'm using.

benjy’s picture

Status: Active » Needs review
StatusFileSize
new8.09 KB

Here's a first pass.

It looks like methods this module depended on were remove last November in #2795861: Remove all deprecated methods. I noticed automated testing is disabled for this module, it might be worth adding a few tests which will catch breaks between this module and search api much earlier.

benjy’s picture

benjy’s picture

Any chance of a review, getting this committed?

izus’s picture

Hi,
We need people to test here and mark it as RTBTC if everything went good.
also, maybe adding a usage example of the new plugin in README file can help more contributors to try it.

benjy’s picture

StatusFileSize
new8.11 KB

Here's a re-roll against HEAD.

izus’s picture

Status: Needs review » Needs work

Tried to apply it but it seems to need a rebase.

we lack some people to test and RTBC here but given the time since this issue was opened and because it will only add a separate plugin. i'm ok to merge it without waiting for people to RTBTC as this will helpe move forward and doesn't change anything in the existing code

Please rebase it and i'll merge the new plugin

Thanks for your contribution

benjy’s picture

Status: Needs work » Needs review
StatusFileSize
new8.84 KB

Re-roll.

  • izus committed 768ee61 on 8.x-1.x authored by benjy
    Issue #2842299 by benjy, izus: Add support for docconv
    
izus’s picture

Status: Needs review » Fixed

Thanks
This is now merged and will be part of next release

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

andyg5000’s picture

Thanks for writing this. I was hoping to have a go binary that I could include and do the extraction. It looks like this is just a wrapper for pdftotext, which has to be installed anyway and is already supported by this module.

Sorry to update an old issue, just providing my notes for others if they're looking to do the same. Let me know if I'm wrong on the above!