Consider removing indexes (rtid) on {user_relationships} with low cardinality [#1187608]

Currently, there are the following indexes on the {user_relationships} table.

mysql> SHOW INDEXES FROM user_relationships;
*************************** 1. row ***************************
       Table: user_relationships
  Non_unique: 0
    Key_name: PRIMARY
Seq_in_index: 1
 Column_name: requester_id
   Collation: A
 Cardinality: 13768
    Sub_part: NULL
      Packed: NULL
        Null: 
  Index_type: BTREE
     Comment: 
*************************** 2. row ***************************
       Table: user_relationships
  Non_unique: 0
    Key_name: PRIMARY
Seq_in_index: 2
 Column_name: requestee_id
   Collation: A
 Cardinality: 234072
    Sub_part: NULL
      Packed: NULL
        Null: 
  Index_type: BTREE
     Comment: 
*************************** 3. row ***************************
       Table: user_relationships
  Non_unique: 0
    Key_name: PRIMARY
Seq_in_index: 3
 Column_name: rtid
   Collation: A
 Cardinality: 234072
    Sub_part: NULL
      Packed: NULL
        Null: 
  Index_type: BTREE
     Comment: 
*************************** 4. row ***************************
       Table: user_relationships
  Non_unique: 1
    Key_name: requester_id
Seq_in_index: 1
 Column_name: requester_id
   Collation: A
 Cardinality: 10177
    Sub_part: NULL
      Packed: NULL
        Null: 
  Index_type: BTREE
     Comment: 
*************************** 5. row ***************************
       Table: user_relationships
  Non_unique: 1
    Key_name: requestee_id
Seq_in_index: 1
 Column_name: requestee_id
   Collation: A
 Cardinality: 6502
    Sub_part: NULL
      Packed: NULL
        Null: 
  Index_type: BTREE
     Comment: 
*************************** 6. row ***************************
       Table: user_relationships
  Non_unique: 1
    Key_name: rtid
Seq_in_index: 1
 Column_name: rtid
   Collation: A
 Cardinality: 11
    Sub_part: NULL
      Packed: NULL
        Null: 
  Index_type: BTREE
     Comment: 
*************************** 7. row ***************************
       Table: user_relationships
  Non_unique: 1
    Key_name: rid
Seq_in_index: 1
 Column_name: rid
   Collation: A
 Cardinality: 234072
    Sub_part: NULL
      Packed: NULL
        Null: 
  Index_type: BTREE
     Comment: 
7 rows in set (0.01 sec)

As can be seen from these indexes, rtid has a very low cardinality. This means that any indexes (PRIMARY and rtid) can cause quite a performance hit. I think it's a fair assumption that most sites will have very few contact types compared to the number of relationships, but I'd love to hear about any cases to the contrary. On this site in particular, there is only one relationship type (it's using InnoDB so cardinality is an estimate), so removing the indexes on rtid provides a very significant performance improvement.

The following indexes are working pretty well for me in testing:

mysql> SHOW INDEXES FROM user_relationships;
*************************** 1. row ***************************
        Table: user_relationships
   Non_unique: 1
     Key_name: requester_id
 Seq_in_index: 1
  Column_name: requester_id
    Collation: A
  Cardinality: 14655
     Sub_part: NULL
       Packed: NULL
         Null: 
   Index_type: BTREE
      Comment: 
Index_comment: 
*************************** 2. row ***************************
        Table: user_relationships
   Non_unique: 1
     Key_name: requestee_id
 Seq_in_index: 1
  Column_name: requestee_id
    Collation: A
  Cardinality: 12212
     Sub_part: NULL
       Packed: NULL
         Null: 
   Index_type: BTREE
      Comment: 
Index_comment: 
*************************** 3. row ***************************
        Table: user_relationships
   Non_unique: 1
     Key_name: rid
 Seq_in_index: 1
  Column_name: rid
    Collation: A
  Cardinality: 219825
     Sub_part: NULL
       Packed: NULL
         Null: 
   Index_type: BTREE
      Comment: 
Index_comment: 
*************************** 4. row ***************************
        Table: user_relationships
   Non_unique: 1
     Key_name: relationship
 Seq_in_index: 1
  Column_name: requester_id
    Collation: A
  Cardinality: 10991
     Sub_part: NULL
       Packed: NULL
         Null: 
   Index_type: BTREE
      Comment: 
Index_comment: 
*************************** 5. row ***************************
        Table: user_relationships
   Non_unique: 1
     Key_name: relationship
 Seq_in_index: 2
  Column_name: requestee_id
    Collation: A
  Cardinality: 219825
     Sub_part: NULL
       Packed: NULL
         Null: 
   Index_type: BTREE
      Comment: 
Index_comment:

Removing the rtid index is easy. Unfortunately, removing the PRIMARY key would expose the table to duplicate rows. I'd really like to have a way to enforce the uniqueness of (requester_id, requestee_id, rtid) without the performance impact of the index, but I'm not sure how to do that.

As a rough benchmark, these index changes brought the average time loading /relationships/1 from 4.8s to 1.8s.

Comment	File	Size	Author
#2	1187608.2-drop_evil_indexes-6.x.patch	1.8 KB	deviantintegral
#2	1187608.2-drop_evil_indexes-7.x.patch	1.62 KB	deviantintegral

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

mrf CreditAttribution: mrf commented 14 June 2011 at 00:21

I'd be shocked if anyone comes forward with more the 10 relationship types on a site, so I think you're right to make the assumption about cardinality.

Just a heads up that 6.x is mainly in maintenance mode and big changes are likely to go un-commited, but I'd love to see an exception for the db performance improvements you have been suggesting. I'm assuming similar gains could also apply to 7.x.

Comment #2

deviantintegral CreditAttribution: deviantintegral commented 15 June 2011 at 16:02

Version:	6.x-1.x-dev	» 7.x-1.x-dev
Status:	Active	» Needs review

File	Size
1187608.2-drop_evil_indexes-7.x.patch	1.62 KB
1187608.2-drop_evil_indexes-6.x.patch	1.8 KB

Here's a pair of patches.

Comment #3

Berdir

German

Switzerland

CreditAttribution: Berdir commented 18 June 2011 at 14:30

I'm not convinced that removing the primary key is an improvement.

Some EXPLAIN's would be nice, on a site with your amount of data (I currently only have a few test sites). Should be easy by clicking around in the relationship listings and using devel.module's query log.

Comment #4

deviantintegral CreditAttribution: deviantintegral commented 20 June 2011 at 13:58

Here's a query:

mysql> SET profiling = 1;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT COUNT(DISTINCT rid) AS count FROM user_relationships ur INNER JOIN user_relationship_types urt USING ( rtid ) WHERE (ur.requester_id = 1 OR ((ur.approved <> 1 OR ur.rtid IN (1)) AND ur.requestee_id = 1)) AND ur.approved = 1 AND ur.rtid = 1;

*************************** 1. row ***************************
count: 0
1 row in set (0.99 sec)

mysql> SHOW profiles;
*************************** 1. row ***************************
Query_ID: 1
Duration: 0.98712800
   Query: SELECT COUNT(DISTINCT rid) AS count FROM user_relationships ur INNER JOIN user_relationship_types urt USING ( rtid ) WHERE (ur.requester_id = 1 OR ((ur.approved <> 1 OR ur.rtid IN (1)) AND ur.requestee_id = 1)) AND ur.approved = 1 AND ur.rtid = 1
1 row in set (0.00 sec)

mysql> SET profiling = 0;
Query OK, 0 rows affected (0.00 sec)

Here's an EXPLAIN:

mysql> EXPLAIN SELECT COUNT(DISTINCT rid) AS count FROM user_relationships ur INNER JOIN user_relationship_types urt USING ( rtid ) WHERE (ur.requester_id = 1 OR ((ur.approved <> 1 OR ur.rtid IN (1)) AND ur.requestee_id = 1)) AND ur.approved = 1 AND ur.rtid = 1;
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: urt
         type: const
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: const
         rows: 1
        Extra: Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: ur
         type: ref
possible_keys: PRIMARY,rtid,requester_id,requestee_id
          key: rtid
      key_len: 4
          ref: const
         rows: 109924
        Extra: Using where

The following queries show the difference in performance. The {user_relationships} table has around 220K rows.

mysql> ALTER TABLE user_relationships DROP INDEX rtid;
Query OK, 0 rows affected (0.09 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> SELECT COUNT(DISTINCT rid) AS count FROM user_relationships ur INNER JOIN user_relationship_types urt USING ( rtid ) WHERE (ur.requester_id = 1 OR ((ur.approved <> 1 OR ur.rtid IN (1)) AND ur.requestee_id = 1)) AND ur.approved = 1 AND ur.rtid = 1;
*************************** 1. row ***************************
count: 0
1 row in set (0.48 sec)

mysql> SHOW profiles;
*************************** 1. row ***************************
Query_ID: 1
Duration: 0.49973000
   Query: SELECT COUNT(DISTINCT rid) AS count FROM user_relationships ur INNER JOIN user_relationship_types urt USING ( rtid ) WHERE (ur.requester_id = 1 OR ((ur.approved <> 1 OR ur.rtid IN (1)) AND ur.requestee_id = 1)) AND ur.approved = 1 AND ur.rtid = 1
1 row in set (0.01 sec)

mysql> ALTER TABLE user_relationships DROP PRIMARY KEY;
Query OK, 219406 rows affected (15.23 sec)
Records: 219406  Duplicates: 0  Warnings: 0

mysql> ALTER TABLE user_relationships ADD INDEX (requester_id, requestee_id);
Query OK, 0 rows affected (2.55 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> SELECT COUNT(DISTINCT rid) AS count FROM user_relationships ur INNER JOIN user_relationship_types urt USING ( rtid ) WHERE (ur.requester_id = 1 OR ((ur.approved <> 1 OR ur.rtid IN (1)) AND ur.requestee_id = 1)) AND ur.approved = 1 AND ur.rtid = 1;
*************************** 1. row ***************************
count: 0
1 row in set (0.01 sec)

mysql> SHOW PROFILES;
*************************** 1. row ***************************
Query_ID: 1
Duration: 0.00850100
   Query: SELECT COUNT(DISTINCT rid) AS count FROM user_relationships ur INNER JOIN user_relationship_types urt USING ( rtid ) WHERE (ur.requester_id = 1 OR ((ur.approved <> 1 OR ur.rtid IN (1)) AND ur.requestee_id = 1)) AND ur.approved = 1 AND ur.rtid = 1
1 row in set (0.00 sec)

Comment #5

Berdir

German

Switzerland

CreditAttribution: Berdir commented 7 July 2011 at 08:58

Would it be possible that you can send me the content of user_relationships table? I'd like to do my own tests with a database of your size. I don't think it contains any personal information...

Comment #6

giorgio79 CreditAttribution: giorgio79 commented 13 July 2011 at 20:01

http://www.lullabot.com/articles/slow-queries-check-cardinality-your-mys...

Comment #7

anavarre

French

🇪🇺

CreditAttribution: anavarre commented 14 July 2011 at 08:37

Comment #8

mdupont

French

Switzerland

CreditAttribution: mdupont commented 24 August 2011 at 08:10

If the primary key is removed, won't it allow duplicate entries? How about at least having a unique constraint on the requester_id, requestee_id, rtid tuple?

Comment #9

rjbrown99 CreditAttribution: rjbrown99 commented 22 September 2011 at 05:13

Subscribe, happy to test on a reasonably large 6.x site - few relationship types (two) but lots of users and established relationships.

Comment #10

deviantintegral CreditAttribution: deviantintegral commented 27 September 2011 at 14:40

@mdupont: Unfortuantely, adding the unique constraint creates an index, that the MySQL optimizer then uses. The only way around that is to use an index hint, which is MySQL only. I suppose we could check for the database type, and then add it, but I'm hesitant to add DB-specific code.

http://dev.mysql.com/doc/refman/5.1/en/index-hints.html

Comment #11

Berdir

German

Switzerland

CreditAttribution: Berdir commented 19 October 2011 at 23:27

Version:

7.x-1.x-dev

» 6.x-1.x-dev

Ok, finally found the time to play around with your data set!

This looks good. In fact, it looks like the performance actually gets *worse* when you have multiple types, I've tested it with 2 and 4 types and performance just got worse and worse.

Also important, removing the primary key makes updates/inserts *much* faster. To test multiple types, I've used the following query to distribute the rtid's randomly across the data set: "update user_relationships set rtid = (requestee_id % 4 + 1)". This took almost 2 minutes with the primary key and 2 seconds without ;)

Sure, this is not a relevant query in any way, but it still gives an idea of the performance an INSERT would have...

Avoiding duplicates needs to be handled anyway in code so the only thing we loose is the absolute proof that there are none.

Updated the patch (in the meantime, the module name changed, unrelated column names changed and we already added a 7003 update function... o.0). I did manually run the tests and verified the upgrade path.

Back to 6.x-1.x!