Arguably files like word documents, XLS files, PPT, PDFs are by nature "rich" in useful and relevant words in the first portions of them. If we only indexed the first portions of text of those documents we would still end up with useful additions to the search index like:

  • Documents: title, author, tables of contents
  • Presentations: title, agenda, etc.
  • Spreadsheets or data files: column names

So, I'm wondering if the options a /admin/config/search/search-api/index/[index-name]/processors could have an option to truncate the extracted text to a certain size. This is the content that would make it into both the index and would be cached locally.

Comments

janusman created an issue. See original summary.

sassafrass’s picture

+1

izus’s picture

Hi,
if you worked on this, can you please provide a patch
thanks

izus’s picture

worked on this and added ->limitBytes method for this that wrapps mainly the mb_strcut() function.
it is configurable in the processor setting form

  • izus committed a8255d2 on 8.x-1.x
    Issue #2888827 by izus: Add option to index only first N bytes of...
izus’s picture

Status: Active » Fixed

i pushed the commit for this

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.