Problem/Motivation

The postgres VDB provider creates a DB column in the vector table for every field in the Search API index, regardless of its indexing role. Fields configured as main_content or contextual_content send their data to the embedding — the corresponding columns are never populated. This creates unnecessary schema bloat that grows with every field added to the index.

Additionally, fields configured as attributes are stored directly on each chunk row. Since a single entity can produce multiple chunks, these values are duplicated identically across every chunk for the same entity.

Proposed resolution

1. Don't create columns for non-attribute fields by default. Fields with role main_content or contextual_content should not produce a DB column, since their data goes to the embedding vector, not the column. This should be overridable with a per-field opt-in (e.g., "also store as column") for cases where a field needs to be both embedded and stored for direct querying — for example, storing title as a column to support keyword LIKE matching alongside vector similarity search.

2. Normalize entity-level attributes into a separate table. Move attribute fields to a one-row-per-entity table joined by drupal_entity_id. The chunk table would retain only chunk-specific data: id, content, embedding, drupal_entity_id, drupal_long_id, server_id, index_id. This eliminates per-chunk duplication of entity metadata and simplifies metadata-only updates (one row to update instead of N).

3. Include an update hook that migrates existing data into the new schema without regenerating embeddings. The embeddings are already correct — requiring a full re-index just to reorganize metadata would trigger unnecessary embedding API calls for every indexed entity.

Remaining tasks

Working on a patch.

Data model changes

Before: Chunk table has columns for every indexed field, most never populated. Entity-level attributes duplicated across every chunk row.

After: Chunk table contains only chunk-specific data (id, content, embedding, entity reference). Entity-level attributes in a separate one-row-per-entity table. No columns for main_content/contextual_content fields unless explicitly opted in.

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

ezeedub created an issue. See original summary.

ezeedub’s picture

Status: Active » Needs review
ezeedub’s picture

Implementation overview

Schema normalization

The core change introduces a {collection}_entities table with one row per entity, holding drupal_entity_id (UNIQUE) plus any attribute columns. The chunks table retains only native fields: id, content, drupal_entity_id (FK to entities), drupal_long_id, server_id, index_id, and embedding. Relation tables for multi-value fields now reference entity_id on the entities table instead of chunk_id on chunks.

All queries (querySearch, vectorSearch) INNER JOIN the entities table. prepareFieldArrayForSql qualifies columns with the correct table using a CHUNK_NATIVE_FIELDS constant. Filter conditions in processConditionGroup route to the chunks or entities table as appropriate.

Column creation

hook_search_api_index_update only creates entity-table columns for fields with the attributes indexing option or fields explicitly opted in via "Store as attribute". Main content and contextual content fields no longer get columns — they feed embeddings only and were always NULL on the old schema.

Store as attribute UI

An optional "Store as attribute" setting is available on the postgres config page (/admin/config/ai/vdb_providers/postgres) under a collapsible section per index. It lists fields configured as Main Content or Contextual Content, letting site builders opt specific fields into direct queryability alongside vector search. Config is stored per-index at ai_vdb_provider_postgres.index.{index_id}. Columns are created immediately on save.

Migration (update_10001)

The update hook uses Drupal's batch API ($sandbox) with three phases per collection:

  1. Setup — creates the entities table, populates entity IDs, adds attribute columns
  2. Batch — copies attribute data from chunks to entities in batches of 5,000
  3. Cleanup — drops migrated columns and empty non-attribute columns from chunks, migrates relation tables from chunk_id to entity_id, adds foreign keys (in a transaction)

The migration checks ai_search.index.{index_id} config to determine which columns are actual attributes vs. empty main/contextual content columns that should just be dropped. Idempotent — skips collections where the entities table already exists.

Provider detection

All provider detection uses instanceof PostgresProvider instead of string-matching the 'postgres' plugin ID. This ensures the module works correctly with custom providers that extend PostgresProvider.

Code quality (rolled in)

  • SQL injection fix in prepareRelationQuery (field values now escaped)
  • Operator allowlist in processConditionGroup
  • Removed phantom third argument from recursive processConditionGroup call
  • Variable naming to snake_case per Drupal coding standards
  • Replaced {@inheritdoc} with proper docblocks
  • dropCollection now drops relation tables to prevent orphans

Feedback welcome

This is a first pass — happy to adjust the approach based on feedback. In particular, I'd welcome thoughts on the migration strategy, the "Store as attribute" UX placement, and whether the schema split feels right. Open to suggestions on anything else that could be improved.

m4olivei’s picture

Duplicate of #3547625: Only create columns for Filterable attributes fields, not sure which to move forward with perhaps the authors could collaborate on a single patch?

ezeedub’s picture

Thanks for flagging this — I hadn't seen #3547625 when I started. codebymikey identified the column-creation problem first and their MR !10 is a clean, targeted fix for that narrow slice.

This issue's MR goes further: it also normalises entity-level attributes into a separate {collection}_entities table (one row per entity instead of duplicated across every chunk), adds a "Store as attribute" opt-in for main/contextual content fields on the postgres config page, and includes a batch-API migration for existing installs. The narrow "don't create columns for non-attribute fields" fix is covered here as part of the larger change.

Happy to coordinate either way:

@codebymikey / @m4olivei, I'm interested in your thoughts on the schema split and migration approach here.