第25期：索引设计（索引的基数与可选择性）

2021-04-15发布 944浏览

这篇主要介绍 MySQL 索引的 Cardinality 值（基数）以及索引的可选择性。

索引基数值

索引基数的含义：

由索引中唯一值计算的一个预估值。索引基数值的准确程度直接影响到 MySQL 优化器基于此索引的查询计划是否准确高效。

与索引基数值最为密切的典型场景就是：一条 SQL 在某一时刻执行比较慢，其中较为可能的原因就是当前表记录更新频繁，这条 SQL 执行计划走的索引基数值没及时更新，优化器选择走备用索引或者走全表扫描，从而非最优执行计划，最终执行结果没有达到预期，总体查询时间较慢，这时可能得手工更新索引的基数值。

索引的可选择性：

索引的可选择性好与坏，和索引基数关系非常密切。基数值越高，索引的可选择性越好；相反，基数越低，索引的可选择性越差。优化器优先使用的索引一般选择性都不差，除非没得选，才会走选择性稍差点的索引或者走全表扫描。

影响索引基数值的相关指标：

表的 sample page 个数，也就是表样例数据页的个数，这个在之前表样例数据计算中详细讲过。
表的更新频率，一般来说，当 1/16 的数据页被更新过，就会自动更新索引基数。
索引列的数据分布程度，比如状态类，订单号，日期等。不同的数据分布，有不同的索引基数。
手动进行索引基数更新，比如 analyze table、show table status 等。

查看某个索引的基数值，有多种方式：

直接执行 show index from tablename。
查询数据字典表 information_schema.statstics。

举例

下面来举例说明索引基数在不同的数据分布场景下的变化以及对优化器的影响。

基础表结构如下：表 ytt_sample 有 7 个字段，5 个索引，其中主键的基数最大，可选择性最好，其他的索引要看数据的分布状况来定。

(localhost:mysqld.sock)|(ytt)>show create table ytt_sample\G
*************************** 1. row ***************************
       Table: ytt_sample
Create Table: CREATE TABLE `ytt_sample` (
  `id` int NOT NULL AUTO_INCREMENT,
  `r1` int DEFAULT NULL,
  `r2` int DEFAULT NULL,
  `r3` int DEFAULT NULL,
  `r4` int DEFAULT NULL,
  `r5` tinyint DEFAULT NULL,
  `r6` date DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_u1` (`r1`,`r2`,`r3`),
  KEY `idx_r4` (`r4`),
  KEY `idx_r5` (`r5`),
  KEY `idx_r6` (`r6`)
) ENGINE=InnoDB  mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)

这张表有 100 行记录。


(localhost:mysqld.sock)|(ytt)>select count(*) from ytt_sample;
+----------+
| count(*) |
+----------+
|      100 |
+----------+
1 row in set (0.03 sec)

检索数据字典，查看当前表索引基数排名：

(localhost:mysqld.sock)|(ytt)>select table_schema,table_name,index_name,cardinality from information_schema.statistics where table_schema='ytt' and table_name = 'ytt_sample' order by cardinality desc;
+--------------+------------+------------+-------------+
| TABLE_SCHEMA | TABLE_NAME | INDEX_NAME | CARDINALITY |
+--------------+------------+------------+-------------+
| ytt          | ytt_sample | PRIMARY    |         100 |
| ytt          | ytt_sample | idx_u1     |          97 |
| ytt          | ytt_sample | idx_u1     |          91 |
| ytt          | ytt_sample | idx_r6     |          85 |
| ytt          | ytt_sample | idx_u1     |          29 |
| ytt          | ytt_sample | idx_r4     |          20 |
| ytt          | ytt_sample | idx_r5     |          10 |
+--------------+------------+------------+-------------+
7 rows in set (0.02 sec)

从以上结果可以看出，主键基数最高，极限接近于表记录数；联合索引 idx_u1 次之；索引 idx_r6 值为 85，也不差；比较差的为 idx_r4、idx_r5，分别为 20、10，其中 idx_r5 最差，仅仅为表记录数的 1/10。索引 idx_r5 类似于我们常说的状态类索引，由于所以基数很低，优化器一般不选择这个索引，一般不需要加，加了反而影响表的写性能。

联合索引 idx_u1 的基数是按照多个键值依次组合计算，分别为(r1),(r1,r2),(r1,r2,r3)，以下为计算方式：


(localhost:mysqld.sock)|(ytt)>select count(distinct r1) idx_u1 from ytt_sample
    -> union all
    -> select count(distinct r1,r2) idx_u1 from ytt_sample
    -> union all
    -> select count(distinct r1,r2,r3) idx_u1 from ytt_sample;
+--------+
| idx_u1 |
+--------+
|     29 |
|     91 |
|     97 |
+--------+
3 rows in set (0.00 sec)

可以看出，联合索引的基数是 97,前两个字段的基数分别为 29 和 91。

接下来，用索引 idx_r4、idx_r5 举个例子，以下 SQL 1：

# SQL 1
select * from ytt_sample where  r4 = 2 and r5 = 2;

以上 SQL 1 如果运行非常频繁，那就应该删掉两个单值索引 idx_r4、idx_r5，建立一个包含这两个字段的联合索引，不过在建立联合索引时，字段的顺序非常重要。

单独查询 r4=2 的结果，有 4 条记录。

(localhost:mysqld.sock)|(ytt)>select * from ytt_sample where  r4 = 2 ;
+-----+------+------+------+------+------+------------+
| id  | r1   | r2   | r3   | r4   | r5   | r6         |
+-----+------+------+------+------+------+------------+
|   1 |    8 |   25 |   10 |    2 |    2 | 2021-03-12 |
|  82 |    8 |   18 |    4 |    2 |    2 | 2021-04-30 |
|  85 |    9 |    7 |    9 |    2 |    9 | 2021-03-12 |
| 135 |   22 |    6 |   23 |    2 |    4 | 2020-12-20 |
+-----+------+------+------+------+------+------------+
4 rows in set (0.00 sec)

单独查询 r5=2 的结果，有 11 条记录。

(localhost:mysqld.sock)|(ytt)>select * from ytt_sample where  r5 = 2 ;
+-----+------+------+------+------+------+------------+
| id  | r1   | r2   | r3   | r4   | r5   | r6         |
+-----+------+------+------+------+------+------------+
|   1 |    8 |   25 |   10 |    2 |    2 | 2021-03-12 |
|  19 |   17 |   21 |   22 |    9 |    2 | 2020-10-11 |
|  36 |   26 |   27 |   28 |   11 |    2 | 2020-10-29 |
|  37 |   28 |   25 |   10 |   16 |    2 | 2020-10-21 |
|  41 |   23 |   16 |   12 |   11 |    2 | 2021-01-29 |
|  82 |    8 |   18 |    4 |    2 |    2 | 2021-04-30 |
| 131 |   26 |    9 |   27 |    9 |    2 | 2021-03-18 |
| 136 |    6 |   22 |    2 |    3 |    2 | 2020-08-20 |
| 143 |    8 |   25 |   10 |   19 |    2 | 2020-09-01 |
| 147 |   18 |   18 |    5 |   20 |    2 | 2020-10-10 |
| 156 |    8 |   13 |    8 |   17 |    2 | 2021-06-13 |
+-----+------+------+------+------+------+------------+
11 rows in set (0.00 sec)

所以这时顺序已经敲定了，数据最少的在前面，也就是字段 r4 在前面，联合索引应该这样：idx_u2(r4,r5)

此时最优的查询计划如下：

(localhost:mysqld.sock)|(ytt)>explain select * from ytt_sample where  r4 = 2 and r5 = 2\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ytt_sample
   partitions: NULL
         type: ref
possible_keys: idx_u2
          key: idx_u2
      key_len: 7
          ref: const,const
         rows: 2
     filtered: 100.00
        Extra: NULL
1 row in set, 1 warning (0.00 sec)

那是不是只要建立联合索引 idx_u2，就一劳永逸了呢？也不一定，如果后期字段 r4,r5 的数据分布有变化，可能索引 idx_u2 的顺序也得改。所以说索引的建立和数据的分布关系非常大，也就是和索引的基数关系非常大。

那再看两条基于字段 r6 的 SQL 语句：

# SQL 2
select * from ytt_sample where r6 between '2020-01-25' and '2021-03-12'

# SQL 3
select count(*) from ytt_sample where r6 between '2020-01-25' and '2021-03-12'

上面 SQL 2、SQL 3 两条 SQL 的过滤条件一样，都是一个范围。不同的是 SQL 2 打印符合过滤条件的记录，而 SQL 3 是打印符号过滤条件的记录条数。单从过滤条件来看，这两条 SQL 的执行计划应该一样，那分别看下两条 SQL 的执行计划：

# EXPLAIN SQL 2
(localhost:mysqld.sock)|(ytt)>explain select * from ytt_sample where r6 between '2020-01-25' and '2021-03-12'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ytt_sample
   partitions: NULL
         type: ALL
possible_keys: idx_r6
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 100
     filtered: 73.00
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

# EXPLAIN SQL 3
(localhost:mysqld.sock)|(ytt)>explain select count(*) from ytt_sample where r6 between '2020-01-25' and '2021-03-12'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ytt_sample
   partitions: NULL
         type: range
possible_keys: idx_r6
          key: idx_r6
      key_len: 4
          ref: NULL
         rows: 73
     filtered: 100.00
        Extra: Using where; Using index
1 row in set, 1 warning (0.00 sec)

从执行计划看，SQL 2 没走索引，走全表扫描；而 SQL 3 直接走索引取回记录数，避免了访问表数据。那为什么两条 SQL 的过滤条件一样，执行计划却不一样呢？

SQL 2 顺序全表扫描表数据的速度要比走索引再随机扫描表数据快很多，因为要打印的记录数有 73 条，快接近表记录数 100 条了。索引 idx_r6 的基数其实很高了，但是由于最终返回的记录数太多，MySQL 只能放弃走索引；而 SQL 3 由于只求符合过滤条件的记录数，直接从索引入口就可以计算出来结果。

那如果把 SQL 2 的过滤条件收缩下，再次查看执行计划：

(localhost:mysqld.sock)|(ytt)>explain select * from ytt_sample where r6 between '2020-12-25' and '2021-03-12'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ytt_sample
   partitions: NULL
         type: range
possible_keys: idx_r6
          key: idx_r6
      key_len: 4
          ref: NULL
         rows: 15
     filtered: 100.00
        Extra: Using index condition
1 row in set, 1 warning (0.00 sec)

从执行计划看，直接走了索引。

那之前的 SQL 2 过滤条件要是不变化，能否会用到索引呢？这个就与索引的基数值以及索引基于一定过滤条件的选择性好坏有很大的关系。比如由于某些业务变化，表 ytt_sample 字段 r6 的数据分布发生了变化（表行数变大，字段 r6 对应范围的数据收窄），数据变化过程此处省略。

那还是同样的过滤条件：where r6 between '2020-01-25' and '2021-03-12'。

再次看看 SQL 2 的执行计划，很完美的走了索引：

(localhost:mysqld.sock)|(ytt)>explain  select * from ytt_sample where r6 between '2020-01-25' and '2021-03-12'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ytt_sample
   partitions: NULL
         type: range
possible_keys: idx_r6
          key: idx_r6
      key_len: 4
          ref: NULL
         rows: 73
     filtered: 100.00
        Extra: Using index condition
1 row in set, 1 warning (0.00 sec)

(localhost:mysqld.sock)|(ytt)>

由于数据分布发生变化，索引基数值也发生了改变，基于同样过滤条件的索引选择性也从差变好，所以 SQL 2 直接走索引 idx_r6 返回数据。

再次查看此时的索引基数值：

(localhost:mysqld.sock)|(ytt)>select table_schema,table_name,index_name,cardinality from information_schema.statistics where table_schema='ytt' and table_name = 'ytt_sample' order by cardinality desc;
+--------------+------------+------------+-------------+
| TABLE_SCHEMA | TABLE_NAME | INDEX_NAME | CARDINALITY |
+--------------+------------+------------+-------------+
| ytt          | ytt_sample | PRIMARY    |         808 |
| ytt          | ytt_sample | idx_r6     |         503 |
| ytt          | ytt_sample | idx_u1     |          98 |
| ytt          | ytt_sample | idx_u1     |          92 |
| ytt          | ytt_sample | idx_u2     |          79 |
| ytt          | ytt_sample | idx_r1     |          29 |
| ytt          | ytt_sample | idx_u1     |          29 |
| ytt          | ytt_sample | idx_u2     |          20 |
+--------------+------------+------------+-------------+
8 rows in set (0.00 sec)

索引 idx_r6 的基数由之前的 85 提升到 500 多。所以即使同样的字段，同样的过滤条件，不同的索引基数值以及基于索引基数值的索引选择性高低的不同，也会让优化器选择不同的执行计划。

关键字：MySQL数据复制、MySQL数据恢复、MySQL数据监控

故障分析 | 奇怪！内存明明够用，MySQL 却出现了 OOM

数据传输 | DTLE 3.21.03.0 来啦！

2024-04-22 发布 634 浏览

爱可生携手OceanBase共建生态繁荣，2024开发者大会分享《ActionDB生态体系建设与实践经验》

2024-03-28 发布 782 浏览

36氪发布AIGC行业报告，爱可生荣登向量数据库领域代表企业榜单

2024-02-02 发布 1,039 浏览

爱可生荣获证券基金行业信息技术应用创新联盟年度优秀成员奖

2024-02-01 发布 1,330 浏览

爱可生获得OceanBase特别认可“客户第一”，2023年度重点项目交付合作再创佳绩

2024-01-31 发布 3,287 浏览

信通院云大所大数据和区块链部向爱可生致谢，共铸业内首个《向量数据库技术要求》