Hi,

Is there any way to use options from apache tika, such as -T (Output plain text content) to extract text from pdf? As all my pdfs are structured layout, whenever there's line break the sentence between line breaks are unable to query.

Thanks

EDIT: See also issue #1839620: Ability to configure options to pass to CLI Java

Comments

stevecty’s picture

Title: Add ability to use apache tina options. » Add ability to use apache tika options.
Nick_vh’s picture

Status: Active » Needs work

please show us some more options? we already do -t

    'extractFormat' => 'text', // Matches the -t command for the tika CLI app.
stevecty’s picture

For tika 1.1 the available options when i run "java -jar tika-app.jar -?" are:

-? or --help Print this usage message
-v or --verbose Print debug level messages
-V or --version Print the Apache Tika version number

-g or --gui Start the Apache Tika GUI
-s or --server Start the Apache Tika server
-f or --fork Use Fork Mode for out-of-process extraction

-x or --xml Output XHTML content (default)
-h or --html Output HTML content
-t or --text Output plain text content
-T or --text-main Output plain text content (main content only)
-m or --metadata Output only metadata
-j or --json Output metadata in JSON
-y or --xmp Output metadata in XMP
-l or --language Output only language
-d or --detect Detect document type
-eX or --encoding=X Use output encoding X
-z or --extract Extract all attachements into current directory
-r or --pretty-print For XML and XHTML outputs, adds newlines and whitespace, for better readability

Thanks!

amontero’s picture