This patch implements the node link tracker changes in the search indexer, as discussed in #145560.
- Instead of keeping track of linked words in search_index, we put this relation in a separate table search_node_links. Then, whenever a node is indexed, we fetch all the links pointing to it from the database, and add their caption in as normal body content.
In order to ensure all links are indexed properly, we touch the changed timestamp of any node that is linked to, to force it to be reindexed. It will correctly deal with nodes that point to each other: it only forces a re-indexing if new links have been discovered or old ones have been removed. In extreme cases, this can (temporarily) cause the indexed percentage to decrease or stagnate. However, it will always still converge to 100% eventually.
The end result is behaviour that is near-identical to what we had before. The main difference is that as a simplification, we no longer look at any inline markup inside links, since it is so rarely used. This virtually no effect on real life HTML content.
- This change simplifies the search tables a lot (especially search_index) and allows us to use primary (unique) keys for more columns.
- Since I changed the table structure anyway, I also took the liberty of removing some redundant keys. Any multi-column database index (a, b, c) automatically serves as an index for (a, b) and (a) too. So, with a primary key of (sid, type, word), having a separate (sid, type) index is not necessary. I verified this with EXPLAIN on MySQL, and checked both the MySQL and PgSQL docs (which confirm this).
- I also realized the (word) index on search_index is sub-optimal, since the search query always searches by type, as well as keyword. So, the index (type, word) should be used instead.
- Finally, as an improvement, I kept the link caption in the source text under a strong score penalty, rather than removing it entirely. This means the source page will still show up in queries for the link caption, but that the target node will show up higher and was a one line change I've long wanted to do.
If someone wants to benchmark this, feel free to, but it's rather obvious this will improve things: a lot of complexity is moved out of the search process and replaced with simple logic in the indexer. The tables are much smaller, and there are less and stricter indexes. This means that the first search pass will be more effective as well.
Because the database changes are so invasive, a re-indexing is needed.
Testers: the search settings page is broken due to the Form API 3 changes and the re-index button is broken. For now, you can manually go to admin/settings/search/wipe to wipe the index.
Note: search_update_1() calls search_install(). This unorthodox pattern is justified because we're wiping the search index completely. Should a future schema change occur, we wouldn't ever want to deal with the intermediate schema version.