I will need for an OCR Feature for attachments. For example if a PDF file is not extractable (because in fact it has only images scanned), we could then apply an OCR fallback processor to extract text form theses images.

Could you considerer to support such feature in search api attachments ?
Or are you thinking than a separated module is more relevant ?

Thanks for your feedback

Comments

flocondetoile created an issue. See original summary.

izus’s picture

hi,

i'm not against having this feature in this module at first. so please go for it if you want so.

The first case i'd poc would be to know if there is a way to have the list of images in a pdf (to process them throw OCR), then process the rest throw an existing search_api_attachments extractor(tika...), and finally merge the two contents to be indexed.

hope this can be realized

Thanks

flocondetoile’s picture

Thanks for your feedback.

For reference
https://stackoverflow.com/questions/16564905/determine-if-pdf-file-has-searchable-text-in-php
https://stackoverflow.com/questions/32969930/how-to-detect-if-a-pdf-is-text-or-image

This may could be done easily (I hope) if this feature lives in search_api_attachments.

Generally if a PDF has not extractable text, it's because it is based on scanned documents, and so has only images, and not even one word.

We could check the text extracted (with the extractor used : tike, pdf2text, etc.) by the existing extractor, and if less than X words (configurable) then perform an OCR on the document, and why not merge the text extracted (even if less than X word) with the result of the OCR. The main difficulty shoud be, I believe, in the results's quality of the OCR process.

This library seems to be a reference (at least as an open source project) for this job.
https://github.com/thiagoalessio/tesseract-ocr-for-php

mfb’s picture

Tika has built-in support for tesseract now. E.g. on debian or ubuntu just `apt install tesseract-ocr` alongside tika-server or tika-app.

So you can send a PDF with embedded images to Tika and have them parsed.

With tika-server, you send the PDF to a different tika endpoint: The recursive metadata endpoint. http://localhost:9998/rmeta/text --header "X-Tika-PDFextractInlineImages: true"

So support would need to be added here in search_api_attachments module.

RavenWood23’s picture

I had the same problem. I went to a specialist who helped me extract text from the image.

AprilMathews’s picture

Have you tried apps that scan photos and documents? As far as I know, some of them have the function of recognizing and scanning text in photos. But the quality of the scanning is not very good.

izus’s picture

Status: Active » Closed (won't fix)

also tika starting from 2.0.0 supports natively OCR https://tika.apache.org/2.0.0/index.html
version 10.0.8 of this module support a config path for tika leting people enable or disable OCR in tika (performance++)

Solr itself ships with Apache Tika built-in

i dont think we'll implement sth special here apart from supporting a way of enable/disable OCR feature of used plugins as we dit with tika

Now that this issue is closed, review the contribution record.

As a contributor, attribute any organization that helped you, or if you volunteered your own time.

Maintainers, credit people who helped resolve this issue.