The 2023 ACM SIGMOD/PODS International Conference on Data Management (SIGMOD 2023) will be held in Seattle, USA from June 18-23 local time. Recently, the conference announced the list of the best papers, with "Predicate Pushdown for Data Science Pipelines" from Microsoft Research and "Detecting Logic Bugs of Join Optimizations in DBMS" from Zhejiang University winning. This is the first time a mainland Chinese research team has won the best paper award at the conference since its inception in 1975. Among them, research from Zhejiang University proposed a novel method that can automatically discover logical vulnerabilities in database management systems such as MySQL, MariaDB, TiDB and PolarDB.
Over the past few decades, modern database management systems (DBMS) have continued to evolve to support a variety of new architectures, such as cloud platforms and HTAP , which requires increasingly sophisticated optimization of query evaluation. The query optimizer is considered to be one of the most complex and important components in a DBMS. Its function is to parse the input SQL query and then generate an efficient execution plan with the assistance of the built-in cost model. Errors in query optimizer implementation may lead to vulnerabilities (bugs), including crash vulnerabilities and logic holes. Crash vulnerabilities are easy to detect because a crash causes the system to stop immediately. However, logic vulnerabilities are easily overlooked because they can cause the DBMS to return erroneous result sets that are difficult to detect. This paper focuses on detecting these silent vulnerabilities.
There is an emerging approach to detecting logical vulnerabilities in DBMS, namely Pivoted Query Synthesis (PQS). The core idea of this method is to randomly select a pivot row (pivot row) from the table and then generate a query with this row as the result. If any synthesized query cannot return that data row, then a logic vulnerability has been detected. PQS is mainly used to support option queries in a single table, and 90% of its reported vulnerabilities only involve single-table queries. There is still a large research gap for multi-table queries that use different join algorithms and join structures (which are more error-prone than single-table queries).
The following figure shows two logical loopholes in connecting queries in MySQL. Both vulnerabilities can be detected using the new tools proposed in this article.
Figure 1: Example of logical vulnerability in join optimization in DBMS
Figure 1(a) shows a logic vulnerability in the hash join in MySQL 8.0.18. In this example, the first query returns the correct result set because it is executed using a block nested loop join. However, the second query using an inner hash join went wrong and returned an incorrectly empty result set. This is because the underlying hash join algorithm incorrectly assumes that 0 is not equal to −0.
The logic vulnerability in Figure 1 (b) originates from the semi-join processing in MySQL 8.0.28. In the first query, the nested loop inner join converts the data type varchar to bigint, resulting in the correct result set. When executing the second query using a hash semi-join, the data type varchar will be converted to double, resulting in a loss of data accuracy and errors in equivalent comparisons.
The difficulty of using query synthesis method for logical vulnerability detection of multi-table join queries is far more difficult than that of single-table queries. This involves two challenges:
In response to the above problems, researchers from Zhejiang University proposed a method called Transformed Query Synthesis (TQS). TQS is a new, universal and cost-effective tool for the task of detecting logical vulnerabilities in join optimization in DBMSs.
In response to the first challenge mentioned above, the solution proposed by researchers is DSG, which stands for Data-guided Schema and query Generation. ). Given a data set represented as a wide table, DSG can split the data set into multiple tables based on the detected paradigm. To speed up the discovery of vulnerabilities, DSG also injects some artificial noise data into the generated database. First, convert the database schema into a graph, where nodes are tables/columns and edges are relationships between nodes. DSG uses random walks on the pattern graph to select tables for the query, and then uses these tables to generate joins. For a specific join query involving multiple tables, we can easily find the truth result from the wide table. In this way, DSG can effectively generate (query, result) collections for database validation.
#In response to the second challenge mentioned above, the method designed by the researchers is KQE, which is Knowledge-guided Query space Exploration. The approach begins by extending the pattern graph into a plan-iterative graph, which represents the entire query generation space. Each join query is then represented as a subgraph. To score the generated query graph, KQE employs an embedding-based graph index that searches the already explored space for structurally similar query graphs. The random walk query generator is guided based on the coverage score to explore as much of the unknown query space as possible.
#In order to demonstrate the versatility and effectiveness of the method, the researchers evaluated TQS on four commonly used DBMS: MySQL, MariaDB, TiDB and PolarDB . After running for 24 hours, TQS successfully found 115 vulnerabilities, including 31 in MySQL, 30 in MariaDB, 31 in TiDB, and 23 in PolarDB. By analyzing the root causes, the types of these vulnerabilities can be summarized, among which there are 7 types of vulnerabilities in MySQL, 5 types in MariaDB, 5 types in TiDB, and 3 types in PolarDB. The researchers have submitted the discovered vulnerabilities to the corresponding communities and received positive feedback.
The following will describe the problem to be solved and the solution proposed by Zhejiang University in mathematical form.
There are two types of database vulnerabilities: crashes and logical vulnerabilities. Crash vulnerabilities arise from the execution of the operating system and DBMS. They can cause the DBMS to be forcibly terminated due to reasons such as insufficient resources such as memory or access to an invalid memory address. Therefore, crash vulnerabilities are easily discovered. Logic vulnerabilities are much harder to find because the database will still function normally and will process queries and return seemingly correct results (and most of the time they do, but in a few cases they may Reading wrong result set). These silent vulnerabilities are like invisible bombs and are more dangerous because they are difficult to detect and may affect the correctness of the application.
This paper introduces a query optimizer to detect logical vulnerabilities for multi-table join query problems. Researchers call these vulnerabilities join optimization bugs. Using the notation given in Table 1, the join optimization vulnerability detection problem can be formally defined as:
Definition: For each query qi in the query workload Q, let query optimize The server performs the join of qi through multiple actual plans and validates its result set using the ground truth . If , a connection optimization vulnerability has been found.
Table 1: Symbol Description Table
Figure 2 gives an architectural overview of TQS. Given a baseline data set and a target DBMS, TQS searches for possible logical vulnerabilities in the DBMS by generating queries based on the data set. TQS has two key components: data-guided schema and query generation (DSG) and knowledge-guided query space exploration (KQE)
Figure 2: TQS overview
DSG treats the input data set as a wide table, and in addition to the original tuple , DSG also deliberately synthesizes some tuples with error-prone values (such as null values or very long strings). For join queries, DSG creates a new schema for the wide table by splitting the wide table into multiple tables, ensuring that the tables conform to a functional dependency-based paradigm. DSG models the database schema as a graph and then generates logical/conceptual queries via random walks on the schema graph. A DSG will materialize a logical query into a physical execution plan and transform the query with different hints, allowing the DBMS to execute multiple different physical execution plans to search for vulnerabilities. For a join query, the ground truth result is obtained by mapping the join graph back to the wide table.
After completing the schema setup and data splitting, KQE expands the schema diagram into a planning iteration diagram. Each query is represented as a subgraph. KQE builds an embedding-based graph index for the embeddings of the query graph in history (i.e., within the already explored query space). Intuitively, the role of KQE is to ensure that the newly generated query graph is as far away from its nearest neighbors in the history as possible, i.e. this is to explore new query graphs rather than repeating existing query graphs. KQE does this by scoring the generated query graph based on structural similarity (to query graphs in history) while using an adaptive random walk method to generate queries. .
Algorithm 1 summarizes the core idea of TQS, where lines 2, 10, and 12 are DSG, and lines 4, 8, and 9 are KQE.
Given a dataset and a wide table sampled from , DSG will Table is split into multiple tables, and these tables form a database schema (line 2) that conforms to 3NF. The pattern can be viewed as a graph , where tables and columns are vertices and edges represent relationships between vertices. DSG uses a random walk on to generate a join expression of the query (line 10). In fact, join queries can be projected as subgraphs of . The DSG can easily retrieve the ground truth result for this query by mapping the subgraph back to the wide table (line 12).
KQE extends the pattern diagram into a planning iteration diagram (line 4). To avoid testing similar paths, KQE will build an embedding-based graph index
to index the embeddings of the existing query graph (line 9) . KQE updates the edge weight π of the planning iteration graph G based on the structural similarity between the current query graph and the existing query graph (line 8). KQE scores the next possible path, which guides the random walk generator, thereby preferring to explore the unknown query space.
For a query , TQS transforms the query through the prompt set to execute multiple different actual query plans (Section 11 OK). Finally, the result set of query
# is compared to the ground truth value (line 14). If they are inconsistent, then a join optimization vulnerability has been detected (line 15).
Please read the original paper for more detailed descriptions of DSG and KQE.
TQS successfully found some logical vulnerabilities in database management systems such as MySQL, MariaDB, TiDB and PolarDB. They are divided into 20 types, including 7 vulnerabilities in MySQL and 7 in MariaDB. There are 5 types of TiDB, 5 types of PolarDB, and 3 types of PolarDB, as shown in the table below.
Compared with other methods, the overall performance of TQS proposed by Zhejiang University is also quite outstanding, and it has achieved significant improvements in many indicators. Excellent results were obtained, and the effectiveness of each component was also tested through controlled variable experiments.
But researchers also said that TQS is currently focusing on equijoin queries. Nonetheless, the DSG and KQE ideas can also be extended to the case of non-equijoins. The only difficulty is how to generate and manage query truth results - in the case of non-equijoins, the size of these results increases exponentially. This aspect still needs further research in the future.
The above is the detailed content of Automatically discovered 100+ vulnerabilities in four major databases in one day, Zhejiang University research won the best paper in SIGMOD 2023. For more information, please follow other related articles on the PHP Chinese website!