Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
Hi,
Is there any way to use options from apache tika, such as -T (Output plain text content) to extract text from pdf? As all my pdfs are structured layout, whenever there's line break the sentence between line breaks are unable to query.
Thanks
EDIT: See also issue #1839620: Ability to configure options to pass to CLI Java
Comments
Comment #1
stevecty CreditAttribution: stevecty commentedComment #2
Nick_vhplease show us some more options? we already do -t
Comment #3
stevecty CreditAttribution: stevecty commentedFor tika 1.1 the available options when i run "java -jar tika-app.jar -?" are:
-? or --help Print this usage message
-v or --verbose Print debug level messages
-V or --version Print the Apache Tika version number
-g or --gui Start the Apache Tika GUI
-s or --server Start the Apache Tika server
-f or --fork Use Fork Mode for out-of-process extraction
-x or --xml Output XHTML content (default)
-h or --html Output HTML content
-t or --text Output plain text content
-T or --text-main Output plain text content (main content only)
-m or --metadata Output only metadata
-j or --json Output metadata in JSON
-y or --xmp Output metadata in XMP
-l or --language Output only language
-d or --detect Detect document type
-eX or --encoding=X Use output encoding X
-z or --extract Extract all attachements into current directory
-r or --pretty-print For XML and XHTML outputs, adds newlines and whitespace, for better readability
Thanks!
Comment #4
amonteroLinking and adding to OP related issue: #1839620: Ability to configure options to pass to CLI Java