Add option to index only first N bytes of extracted text [#2888827]

Arguably files like word documents, XLS files, PPT, PDFs are by nature "rich" in useful and relevant words in the first portions of them. If we only indexed the first portions of text of those documents we would still end up with useful additions to the search index like:

Documents: title, author, tables of contents
Presentations: title, agenda, etc.
Spreadsheets or data files: column names

So, I'm wondering if the options a /admin/config/search/search-api/index/[index-name]/processors could have an option to truncate the extracted text to a certain size. This is the content that would make it into both the index and would be cached locally.

Issue #2888827 by izus: Add option to index only first N bytes of...

Comment #6

izus commented 13 July 2018 at 16:41

Status:

Active

» Fixed

i pushed the commit for this

Comment #7

27 July 2018 at 16:44

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Add option to index only first N bytes of extracted text

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Referenced by

News items

Our community

Documentation

Drupal code base

Governance of community