What are the disadvantages of mysql clustered index?-Mysql Tutorial-php.cn

The clustered index is not a separate index type, but a data storage method (not a data structure, but a storage structure). The specific details depend on its implementation, but the clustered index of innodb is actually The btree index and data rows are saved in the same structure.

When a table has an index, its data rows are actually stored in the leaf pages of the index. Clustering means that the data rows and adjacent key values are compactly stored together, because the data rows cannot be stored at the same time. Stored in two different places, so a table can only have one clustered index. Because the storage engine is responsible for implementing the index, not all storage engines support clustered indexes. The following mainly introduces innodb, but the principles discussed below are applicable to any engine that supports clustered indexes:

The leaf page contains all the data of the row, but the node page only contains the index column (or it can be said that the non-leaf page The node pages of the node contain the index of the index value, because the values contained in these node pages are extracted from the index column).

Innodb will aggregate data by primary key. If there is no primary key defined, Innodb will choose the first non-empty unique index instead. If there is no non-empty unique index, Innodb will implicitly define a 6-byte rowid primary key. as a clustered index. InnoDB only aggregates records in the same page, pages containing adjacent key values may be far apart.

Note: Clustered primary keys may help performance, but they may also cause serious performance problems, especially when the table's storage engine is converted from innodb to another engine.

Aggregated data has some important advantages:

A: Related data can be saved together. For example, when implementing email, you can aggregate data based on user ID, so you only need to All emails of a user can be obtained by reading a small number of data pages from the disk. If the clustered index is not used, each email may cause a disk IO

B: Data access is faster, the clustered index will index and The data is stored in the same btree, so retrieving data from a clustered index is usually faster than looking up in a non-clustered index

C: Queries using covering index scans can directly use the primary key value in the page node

Disadvantages of clustered indexes:

A: Clustered data maximizes the performance of IO-intensive applications, but if the data is all placed in memory, the order of access is not so important. No, the clustered index has no advantage anymore

B: The insertion speed depends heavily on the insertion order. Inserting in the order of the primary key is the fastest way to load data into the innodb table, but if it is not loaded in the order of the primary key data, then it is best to use the optimize table command to reorganize the table after the loading is complete

C: Updating clustered index columns is very expensive because it forces innodb to move each updated row to a new location

D: When a table based on a clustered index inserts a new row, or when the primary key is updated and the row needs to be moved, it may face the problem of page splitting. When the primary key value of a row requires that the row must be inserted into a certain When the page is full, the storage engine will split the page into two pages to accommodate the row. This is a page splitting operation. Page splitting will cause the table to occupy more disk space

　E: Aggregation Indexes may cause full table scans to slow down, especially when rows are sparse, or data storage is discontinuous due to page splits

F: The secondary index may be larger than expected, because in the secondary index Leaf nodes contain the primary key columns of the reference rows.

G: Secondary index access requires two index lookups instead of one

Because what is stored in the secondary index leaf node is not the pointer to the physical location of the row, but the primary key value of the row. This means that when searching for rows through the secondary index, the storage engine needs to find the leaf node of the secondary index to obtain the corresponding primary key value, and then use this primary key value to find the corresponding row in the clustered index. Repeated work is done here, two btree lookups instead of once. For innodb, adaptive hash indexes can reduce such repeated work.

Comparison of data distribution between innodb and myisam physical storage:

Myisam:

It is stored on the disk in the order of data insertion. The primary key index and secondary level in myisam There is no difference in the structure of the index. The primary key index is a unique non-empty index named primary.

innodb:

Because innodb supports clustered indexes, it uses a very different way to store the same data. The innodb clustered index contains the data of the entire table, not just the index, because in Innodb , the clustered index is a table, so it does not require independent row storage like myisam. Each leaf node of the clustered index contains the primary key value, transaction ID, rollback pointer for transaction and MVCC, and the values of all remaining columns. If the primary key is a column prefix index, InnoDB also contains the complete primary key column and The remaining column values.

Another thing that is different from myisam is that the secondary index of innodb is very different from the clustered index. The leaf nodes of the secondary index of innodb store not the row pointer, but the primary key value, and use this as Pointers to rows. This strategy reduces the maintenance work of the secondary index when rows are moved or data pages are split. Using the primary key value as a pointer will make the secondary index take up more space. The benefit is, InnoDB does not need to update this pointer in the secondary index when moving rows.

Insert rows in the order of primary key in the innodb table. If you are using the Innodb table and there is no data to be aggregated, you can define a surrogate key as the primary key. This primary key data should have nothing to do with the application. The simplest The method is to use auto_increment to automatically increment the column, which can ensure that the data rows are inserted in order, and the performance of association operations based on the primary key will be better.

Do not use UUID as a clustered index, otherwise the performance will be very bad, because it makes the insertion of the clustered index completely random, making the data without any clustering characteristics. Because UUID is used as the primary key to insert rows, not only does it take longer, but the index is also larger. This is because the primary key field has become longer. On the other hand, it is undoubtedly due to the longer time caused by page splitting and the index change caused by fragmentation. big. Because the primary key values are sequential, Innodb stores each record after the previous record. When the maximum fill factor of the page is reached (InnoDB's default maximum fill factor is 15/16 of the page size, leaving (to free up some space for later modification), the next record will be written to a new page. Once the data is loaded in this sequence, the primary key page will be approximately filled with sequential records, which is what is expected. The results (however, secondary index pages may be different).

Under the UUID primary key, because the primary key value of the newly inserted row is not necessarily greater than the previous one, innodb cannot simply always insert the new row at the end of the index, but needs to find the new row. The appropriate location is usually the middle location of the existing data, and allocating new space will add a lot of extra work and lead to less than optimal data distribution. The following are some disadvantages of using UUID as the primary key:

A: The written target page may have been flushed to the disk and removed from the cache, or it has not been loaded into the cache. InnoDB has to find and read the target page from the disk into the memory before inserting it. This is Will result in a lot of random IO

B: Because writes are out of order, innodb has to do page splitting operations frequently to allocate space for new rows. Page splitting will cause a large amount of data to be moved and inserted at one time At least three pages need to be modified instead of one page

C: Due to frequent page splits, pages will become sparse and filled irregularly, so the final data will be fragmented

After loading these random values into the clustered index, you may need to do an optimize table to rebuild the table and optimize page filling. When using InnoDB, you should insert data in primary key order as much as possible, and use a simple increment of the value of the clustering key to insert new rows whenever possible.

Note: When does a sequential primary key cause worse results?

For high-concurrency workloads, inserting in the order of primary keys in Innodb may cause obvious contention. The upper bound of the primary key will be called a hotspot, because all insertions occur here, so concurrent insertions may cause Gap lock contention, another hotspot may be the auto_increment lock mechanism. If you encounter this problem, you may need to redesign the table or application, or change the innodb_autoinc_lock_mode configuration.

The above is the detailed content of What are the disadvantages of mysql clustered index?. For more information, please follow other related articles on the PHP Chinese website!