In MySQL, both Innodb and MyIsam use B-tree as the index structure (other indexes such as hash are not considered here). This article will start with the most common binary search tree, and gradually explain the problems solved by various trees and the new problems faced, thereby explaining why MySQL chooses B-tree as the index structure.
##1. Binary Search Tree (BST): Unbalanced
Binary Search Tree (BST, Binary Search Tree), also called binary sorting tree, needs to satisfy on the basis of binary tree: the value of all nodes on the left subtree of any node is not greater than the value of the root node, any node The value of all nodes in the right subtree is not less than the value of the root node. The following is a BST (picture source). When fast search is required, storing data in BST is a common choice, because at this time the query time depends on the tree height, and the average time complexity is O( lgn). However,BST may grow skewed and become unbalanced, as shown in the figure below (picture source). At this time, BST degenerates into a linked list, and the time complexity degenerates into O(n).
In order to solve this problem, balanced binary trees are introduced.2. Balanced Binary Tree (AVL): Rotation takes time
AVL tree is strictly balanced For binary trees, the height difference between the left and right subtrees of all nodes cannot exceed 1; AVL tree search, insertion and deletion are O(lgn) in both the average and worst cases. The key to achieving balance in AVL lies in the rotation operation: insertion and deletion may destroy the balance of the binary tree, and the tree needs to be rebalanced through one or more tree rotations. When inserting data, only one rotation (single rotation or double rotation) is required at most; but when data is deleted, it will cause the tree to become unbalanced. AVL needs to maintain the balance of all nodes on the path from the deleted node to the root node. Rotation The magnitude of is O(lgn).Due to the time-consuming rotation, the AVL tree is very inefficient when deleting data; when there are many deletion operations, the cost of maintaining balance may be higher than benefits, so AVL is not widely used in practice.
3. Red-black tree: The tree is too tall
Compared with AVL trees, red-black trees do not pursue strict balance, but is a rough balance: just make sure that the longest possible path from root to leaf is no more than twice as long as the shortest possible path. From an implementation point of view, the biggest feature of the red-black tree is that each node belongs to one of two colors (red or black), and the division of node colors needs to meet specific rules (the specific rules are omitted). An example of a red-black tree is as follows (picture source):Compared with an AVL tree, the query efficiency of a red-black tree will decrease. This is because the balance of the tree changes. Bad, higher height. However, the deletion efficiency of the red-black tree has been greatly improved, because the red-black tree also introduces color. When inserting or deleting data, only O(1) times of rotation and color changes are needed to ensure the basic balance. There is no need for AVL The tree performs O(lgn) number of rotations. In general, the statistical performance of red-black trees is higher than that of AVL. Therefore, in practical applications, AVL trees are used relatively rarely, while red-black trees are used very widely. For example, TreeMap in Java uses red-black trees to store sorted key-value pairs; HashMap in Java 8 uses linked-list red-black trees to solve hash conflict problems (when there are fewer conflicting nodes, use linked lists; when there are more conflicting nodes, use red-black Tree). For situations where the data is in memory (such as the above-mentioned TreeMap and HashMap), the performance of the red-black tree is very excellent. But
for the situation where the data is in auxiliary storage devices such as disks (such as MySQL and other databases), the red-black tree is not good at it, because the red-black tree still grows too high. . When the data is on the disk, disk IO will become the biggest performance bottleneck, and the design goal should be to minimize the number of IOs; the higher the height of the tree, the more IO times required for additions, deletions, modifications, and searches, which will seriously affect performance.
4. B-tree: Born for disk
BThe tree is also called B- tree (where - is not a minus sign) is a multi-path balanced search tree designed for auxiliary storage devices such as disks. Compared with binary trees, Each non-leaf node of the B tree can have multiple subtrees. Therefore, when the total number of nodes is the same, the height of the B-tree is much smaller than the AVL tree and the red-black tree (the B-tree is a "dumpty"), and the number of disk IOs is greatly reduced.
The most important concept in defining a B-tree is the order. For an m-order B-tree, the following conditions need to be met:
It can be seen that the definition of B-tree mainly limits the number of child nodes and records of non-leaf nodes.
The following picture is an example of a 3-order B-tree (picture source):
The advantages of the B-tree are not only the small tree height, but also the ability to access local areas. Utilization of sexual principles. The so-called locality principle means that when a piece of data is used, the data nearby has a higher probability of being used in a short time. B-tree stores data with similar keys in the same node. When one of the data is accessed, the database will read the entire node into the cache; when its adjacent data is accessed immediately, it can be read directly from the cache. , no disk IO is required; in other words, the cache hit rate of B-tree is higher.
B-tree has some applications in databases. For example, mongodb's index uses B-tree structure. However, in many database applications, B-tree, a variant of B-tree, is used.
5. B-tree
B-tree is also a multi-path balanced search tree. Its main difference from B-tree is:
Thus, B-tree has the following advantages compared to B-tree:
B trees also have disadvantages: they take up more space because keys are repeated. However, compared with the performance advantages, the space disadvantage is often acceptable, so B-trees are more widely used in databases than B-trees.
6. Feel the power of B-tree
As mentioned earlier, compared with binary trees such as red-black trees, B-tree/B-tree has the largest The advantage is that the tree height is smaller. In fact, for Innodb's B index, the height of the tree is generally 2-4 levels. Let’s make some specific estimates below.
The height of the tree is determined by the order. The larger the order, the shorter the tree; and the size of the order depends on how many records each node can store. Each node in Innodb uses a page (page), the page size is 16KB, of which metadata only accounts for about 128 bytes (including file management header information, page header information, etc.), most of the space is used to store data .
For non-leaf nodes, the record only contains the key of the index and a pointer to the next level node. Assuming that each non-leaf page stores 1000 records, each record takes up approximately 16 bytes; this assumption is reasonable when the index is an integer or a shorter string. By extension, we often hear suggestions that the length of the index column should not be too large. Here is the reason: if the index column is too long and each node contains too few records, the tree will be too tall and the indexing effect will be greatly reduced. And the index will waste more space.
For a 3-layer B-tree, the first layer (root node) has 1 page and can store 1000 records; the second layer has 1000 pages and can store 1000*1000 records. ;The third layer (leaf node) has 1000*1000 pages, and each page can store 100 records, so it can store 1000*1000*100 records, or 100 million records. For a binary tree, about 26 layers are needed to store 100 million records.
7. Summary
Finally, summarize the problems solved by various trees and the new problems faced:
1 ), Binary Search Tree (BST): solves the basic problem of sorting, but because balance cannot be guaranteed, it may degenerate into a linked list;
2), Balanced Binary Tree (AVL): solves the problem of balance through rotation , but the rotation operation efficiency is too low;
3), red-black tree: By abandoning strict balance and introducing red-black nodes, the problem of AVL rotation efficiency being too low is solved. However, in scenarios such as disks, the tree Still too high, too many IO times;
4), B-tree: By changing the binary tree to a multi-path balanced search tree, the problem of the tree being too high is solved;
5), B Tree: Based on the B-tree, non-leaf nodes are transformed into pure index nodes that do not store data, further reducing the height of the tree; in addition, the leaf nodes are connected into a linked list using pointers, making range queries more efficient.
Recommended learning: MySQL tutorial
The above is the detailed content of Why does MySQL choose B+ tree as the index structure? (detailed explanation). For more information, please follow other related articles on the PHP Chinese website!