Closed (won't fix)
Project:
Search API attachments
Version:
8.x-1.x-dev
Component:
Code
Priority:
Normal
Category:
Feature request
Assigned:
Unassigned
Reporter:
Created:
21 Oct 2018 at 13:12 UTC
Updated:
27 Mar 2026 at 13:57 UTC
Jump to comment: Most recent
Comments
Comment #2
izus commentedhi,
i'm not against having this feature in this module at first. so please go for it if you want so.
The first case i'd poc would be to know if there is a way to have the list of images in a pdf (to process them throw OCR), then process the rest throw an existing search_api_attachments extractor(tika...), and finally merge the two contents to be indexed.
hope this can be realized
Thanks
Comment #3
flocondetoileThanks for your feedback.
For reference
https://stackoverflow.com/questions/16564905/determine-if-pdf-file-has-searchable-text-in-php
https://stackoverflow.com/questions/32969930/how-to-detect-if-a-pdf-is-text-or-image
This may could be done easily (I hope) if this feature lives in search_api_attachments.
Generally if a PDF has not extractable text, it's because it is based on scanned documents, and so has only images, and not even one word.
We could check the text extracted (with the extractor used : tike, pdf2text, etc.) by the existing extractor, and if less than X words (configurable) then perform an OCR on the document, and why not merge the text extracted (even if less than X word) with the result of the OCR. The main difficulty shoud be, I believe, in the results's quality of the OCR process.
This library seems to be a reference (at least as an open source project) for this job.
https://github.com/thiagoalessio/tesseract-ocr-for-php
Comment #4
mfbTika has built-in support for tesseract now. E.g. on debian or ubuntu just `apt install tesseract-ocr` alongside tika-server or tika-app.
So you can send a PDF with embedded images to Tika and have them parsed.
With tika-server, you send the PDF to a different tika endpoint: The recursive metadata endpoint. http://localhost:9998/rmeta/text --header "X-Tika-PDFextractInlineImages: true"
So support would need to be added here in search_api_attachments module.
Comment #5
RavenWood23 commentedI had the same problem. I went to a specialist who helped me extract text from the image.
Comment #6
AprilMathews commentedHave you tried apps that scan photos and documents? As far as I know, some of them have the function of recognizing and scanning text in photos. But the quality of the scanning is not very good.
Comment #7
izus commentedalso tika starting from 2.0.0 supports natively OCR https://tika.apache.org/2.0.0/index.html
version 10.0.8 of this module support a config path for tika leting people enable or disable OCR in tika (performance++)
Solr itself ships with Apache Tika built-in
i dont think we'll implement sth special here apart from supporting a way of enable/disable OCR feature of used plugins as we dit with tika