Problem / Motivation
Currently, AI Content Chat only indexes text-based fields (text, text_long, text_with_summary, string). If a content entity has a file field (e.g., a PDF, TXT, SRT, CSV, or DOC file), the content of that file is ignored during indexing. Many sites store important information in uploaded documents that should be searchable by the chatbot.
Proposed resolution
Add support for extracting and indexing text content from file fields. When a configured bundle has a file or media reference field, the indexer should read the file content and include it in the index.
Supported file types to start with:
- Plain text (.txt) – read directly
- SRT subtitles (.srt) – strip timestamps, index the text
- CSV (.csv) – read as plain text
- PDF (.pdf) – extract text content
- DOCX (.docx) – extract text content
The settings form should list file fields alongside text fields when configuring content sources. The indexer should detect the file type and use the appropriate extraction method. Unsupported file types should be skipped gracefully.
Issue fork ai_content_chat-3573456
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #4
solimanharkas commented