Problem/Motivation
The postgres VDB provider creates a DB column in the vector table for every field in the Search API index, regardless of its indexing role. Fields configured as main_content or contextual_content send their data to the embedding — the corresponding columns are never populated. This creates unnecessary schema bloat that grows with every field added to the index.
Additionally, fields configured as attributes are stored directly on each chunk row. Since a single entity can produce multiple chunks, these values are duplicated identically across every chunk for the same entity.
Proposed resolution
1. Don't create columns for non-attribute fields by default. Fields with role main_content or contextual_content should not produce a DB column, since their data goes to the embedding vector, not the column. This should be overridable with a per-field opt-in (e.g., "also store as column") for cases where a field needs to be both embedded and stored for direct querying — for example, storing title as a column to support keyword LIKE matching alongside vector similarity search.
2. Normalize entity-level attributes into a separate table. Move attribute fields to a one-row-per-entity table joined by drupal_entity_id. The chunk table would retain only chunk-specific data: id, content, embedding, drupal_entity_id, drupal_long_id, server_id, index_id. This eliminates per-chunk duplication of entity metadata and simplifies metadata-only updates (one row to update instead of N).
3. Include an update hook that migrates existing data into the new schema without regenerating embeddings. The embeddings are already correct — requiring a full re-index just to reorganize metadata would trigger unnecessary embedding API calls for every indexed entity.
Remaining tasks
Working on a patch.
Data model changes
Before: Chunk table has columns for every indexed field, most never populated. Entity-level attributes duplicated across every chunk row.
After: Chunk table contains only chunk-specific data (id, content, embedding, entity reference). Entity-level attributes in a separate one-row-per-entity table. No columns for main_content/contextual_content fields unless explicitly opted in.
Issue fork ai_vdb_provider_postgres-3576852
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #3
ezeedub commentedComment #4
ezeedub commentedImplementation overview
Schema normalization
The core change introduces a
{collection}_entitiestable with one row per entity, holdingdrupal_entity_id(UNIQUE) plus any attribute columns. The chunks table retains only native fields:id,content,drupal_entity_id(FK to entities),drupal_long_id,server_id,index_id, andembedding. Relation tables for multi-value fields now referenceentity_idon the entities table instead ofchunk_idon chunks.All queries (
querySearch,vectorSearch) INNER JOIN the entities table.prepareFieldArrayForSqlqualifies columns with the correct table using aCHUNK_NATIVE_FIELDSconstant. Filter conditions inprocessConditionGrouproute to the chunks or entities table as appropriate.Column creation
hook_search_api_index_updateonly creates entity-table columns for fields with theattributesindexing option or fields explicitly opted in via "Store as attribute". Main content and contextual content fields no longer get columns — they feed embeddings only and were always NULL on the old schema.Store as attribute UI
An optional "Store as attribute" setting is available on the postgres config page (
/admin/config/ai/vdb_providers/postgres) under a collapsible section per index. It lists fields configured as Main Content or Contextual Content, letting site builders opt specific fields into direct queryability alongside vector search. Config is stored per-index atai_vdb_provider_postgres.index.{index_id}. Columns are created immediately on save.Migration (update_10001)
The update hook uses Drupal's batch API (
$sandbox) with three phases per collection:chunk_idtoentity_id, adds foreign keys (in a transaction)The migration checks
ai_search.index.{index_id}config to determine which columns are actual attributes vs. empty main/contextual content columns that should just be dropped. Idempotent — skips collections where the entities table already exists.Provider detection
All provider detection uses
instanceof PostgresProviderinstead of string-matching the'postgres'plugin ID. This ensures the module works correctly with custom providers that extendPostgresProvider.Code quality (rolled in)
prepareRelationQuery(field values now escaped)processConditionGroupprocessConditionGroupcall{@inheritdoc}with proper docblocksdropCollectionnow drops relation tables to prevent orphansFeedback welcome
This is a first pass — happy to adjust the approach based on feedback. In particular, I'd welcome thoughts on the migration strategy, the "Store as attribute" UX placement, and whether the schema split feels right. Open to suggestions on anything else that could be improved.
Comment #5
m4oliveiDuplicate of #3547625: Only create columns for Filterable attributes fields, not sure which to move forward with perhaps the authors could collaborate on a single patch?
Comment #6
ezeedub commentedThanks for flagging this — I hadn't seen #3547625 when I started. codebymikey identified the column-creation problem first and their MR !10 is a clean, targeted fix for that narrow slice.
This issue's MR goes further: it also normalises entity-level attributes into a separate
{collection}_entitiestable (one row per entity instead of duplicated across every chunk), adds a "Store as attribute" opt-in for main/contextual content fields on the postgres config page, and includes a batch-API migration for existing installs. The narrow "don't create columns for non-attribute fields" fix is covered here as part of the larger change.Happy to coordinate either way:
@codebymikey / @m4olivei, I'm interested in your thoughts on the schema split and migration approach here.