OCR feature [#3008151]

I will need for an OCR Feature for attachments. For example if a PDF file is not extractable (because in fact it has only images scanned), we could then apply an OCR fallback processor to extract text form theses images.

Could you considerer to support such feature in search api attachments ?
Or are you thinking than a separated module is more relevant ?

Thanks for your feedback

Comments

Comment #1

21 October 2018 at 13:12

flocondetoile created an issue. See original summary.

Comment #2

izus commented 24 October 2018 at 09:57

hi,

i'm not against having this feature in this module at first. so please go for it if you want so.

The first case i'd poc would be to know if there is a way to have the list of images in a pdf (to process them throw OCR), then process the rest throw an existing search_api_attachments extractor(tika...), and finally merge the two contents to be indexed.

hope this can be realized

Thanks

Comment #3

flocondetoile

French

Lyon

commented 24 October 2018 at 19:56

Thanks for your feedback.

For reference
https://stackoverflow.com/questions/16564905/determine-if-pdf-file-has-searchable-text-in-php
https://stackoverflow.com/questions/32969930/how-to-detect-if-a-pdf-is-text-or-image

This may could be done easily (I hope) if this feature lives in search_api_attachments.

Generally if a PDF has not extractable text, it's because it is based on scanned documents, and so has only images, and not even one word.

We could check the text extracted (with the extractor used : tike, pdf2text, etc.) by the existing extractor, and if less than X words (configurable) then perform an OCR on the document, and why not merge the text extracted (even if less than X word) with the result of the OCR. The main difficulty shoud be, I believe, in the results's quality of the OCR process.

This library seems to be a reference (at least as an open source project) for this job.
https://github.com/thiagoalessio/tesseract-ocr-for-php

Comment #4

mfb

they or he

commented 18 January 2019 at 22:11

Tika has built-in support for tesseract now. E.g. on debian or ubuntu just `apt install tesseract-ocr` alongside tika-server or tika-app.

So you can send a PDF with embedded images to Tika and have them parsed.

With tika-server, you send the PDF to a different tika endpoint: The recursive metadata endpoint. http://localhost:9998/rmeta/text --header "X-Tika-PDFextractInlineImages: true"

So support would need to be added here in search_api_attachments module.

Comment #5

RavenWood23 commented 24 February 2023 at 04:43

I had the same problem. I went to a specialist who helped me extract text from the image.

Comment #6

AprilMathews commented 24 February 2023 at 12:12

Have you tried apps that scan photos and documents? As far as I know, some of them have the function of recognizing and scanning text in photos. But the quality of the scanning is not very good.

Comment #7

izus commented 27 March 2026 at 13:57

Status:

Active

» Closed (won't fix)

also tika starting from 2.0.0 supports natively OCR https://tika.apache.org/2.0.0/index.html
version 10.0.8 of this module support a config path for tika leting people enable or disable OCR in tika (performance++)

Solr itself ships with Apache Tika built-in

i dont think we'll implement sth special here apart from supporting a way of enable/disable OCR feature of used plugins as we dit with tika

Comment #8

27 March 2026 at 13:57

Now that this issue is closed, review the contribution record.

As a contributor, attribute any organization that helped you, or if you volunteered your own time.

Maintainers, credit people who helped resolve this issue.

OCR feature

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

News items

Our community

Documentation

Drupal code base

Governance of community