Problem / Motivation

Currently, AI Content Chat only indexes text-based fields (text, text_long, text_with_summary, string). If a content entity has a file field (e.g., a PDF, TXT, SRT, CSV, or DOC file), the content of that file is ignored during indexing. Many sites store important information in uploaded documents that should be searchable by the chatbot.

Proposed resolution

Add support for extracting and indexing text content from file fields. When a configured bundle has a file or media reference field, the indexer should read the file content and include it in the index.

Supported file types to start with:

  • Plain text (.txt) – read directly
  • SRT subtitles (.srt) – strip timestamps, index the text
  • CSV (.csv) – read as plain text
  • PDF (.pdf) – extract text content
  • DOCX (.docx) – extract text content

The settings form should list file fields alongside text fields when configuring content sources. The indexer should detect the file type and use the appropriate extraction method. Unsupported file types should be skipped gracefully.

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

solimanharkas created an issue. See original summary.

solimanharkas’s picture

Assigned: solimanharkas » Unassigned
Status: Active » Fixed

Now that this issue is closed, review the contribution record.

As a contributor, attribute any organization that helped you, or if you volunteered your own time.

Maintainers, credit people who helped resolve this issue.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.