The database system is the core of the management information system. Database-based online transaction processing (OLTP) and online analytical processing (OLAP) are one of the most important computer applications in banks, enterprises, governments and other departments. Judging from the application examples of most systems, query operations account for the largest proportion of various database operations, and the SELECT statement on which the query operation is based is the most expensive statement among SQL statements. For example, if the amount of data accumulates to a certain extent, such as a bank's account database table information accumulating to millions or even tens of millions of records, a full table scan often takes dozens of minutes or even hours. If a better query strategy than a full table scan is adopted, the query time can often be reduced to a few minutes. This shows the importance of query optimization technology. During the implementation of the application project, the author found that many programmers are using some front-end database development tools. (Such as PowerBuilder, Delphi, etc.) When developing database applications, they only focus on the gorgeousness of the user interface and do not pay attention to the efficiency of query statements, resulting in low efficiency of the developed application system and serious waste of resources. Therefore, how to design efficient and reasonable query statements is very important. Based on application examples and combined with database theory, this article introduces the application of query optimization technology in real systems.
Analysis of the problem
Many programmers believe that query optimization is the task of the DBMS (database management system) and has little to do with the SQL statements written by programmers. This is wrong. A good query plan can often improve program performance dozens of times. The query plan is a collection of SQL statements submitted by the user, and the query plan is a collection of statements generated after optimization. The process of DBMS processing query plan is as follows: after completing the lexical and syntax check of the query statement, the statement is submitted to the DBMS query optimizer. After the optimizer completes the algebraic optimization and access path optimization, the precompiled module Process the statement and generate a query plan, then submit it to the system for processing and execution at the appropriate time, and finally return the execution results to the user. In high versions of actual database products (such as Oracle, Sybase, etc.), cost-based optimization methods are used. This optimization can estimate the cost of different query plans based on the information obtained from the system dictionary table, and then select a Better planning. Although current database products are getting better and better at query optimization, the SQL statements submitted by users are the basis for system optimization. It is difficult to imagine that an originally poor query plan will become efficient after system optimization. Therefore The quality of the statements users write is crucial. We will not discuss the query optimization performed by the system for now. The following focuses on solutions to improve user query plans.
Solving the problem
The following takes the relational database system Informix as an example to introduce methods to improve user query plans.
1. Reasonable use of indexes
Index is an important data structure in the database, and its fundamental purpose is to improve query efficiency. Most database products now use the ISAM index structure first proposed by IBM. The use of indexes must be appropriate, and the usage principles are as follows:
● Create indexes on columns that are frequently connected but are not designated as foreign keys, while the optimizer automatically generates indexes for fields that are not frequently connected.
● Create indexes on columns that are frequently sorted or grouped (that is, group by or order by operations).
● Create searches on columns with many different values that are often used in conditional expressions. Do not create indexes on columns with few different values. For example, there are only two different values in the "Gender" column of the employee table, "Male" and "Female", so there is no need to create an index. If you create an index, not only will it not improve query efficiency, but it will seriously reduce the update speed.
●If there are multiple columns to be sorted, you can create a compound index on these columns.
●Use system tools. For example, the Informix database has a tbcheck tool that can check suspicious indexes. On some database servers, the index may be invalid or the reading efficiency may be reduced due to frequent operations. If a query using the index slows down for no apparent reason, you can try to use the tbcheck tool to check the integrity of the index and repair it if necessary. In addition, when a database table updates a large amount of data, deleting and rebuilding the index can improve query speed.
2. Avoid or simplify sorting
Duplicate sorting of large tables should be simplified or avoided. The optimizer avoids the sorting step when it can use an index to automatically produce output in the proper order. The following are some influencing factors:
●The index does not include one or several columns to be sorted;
●The order of the columns in the group by or order by clause is different from the order of the index ;
●The sorted columns come from different tables.
In order to avoid unnecessary sorting, it is necessary to correctly add indexes and merge database tables reasonably (although it may sometimes affect the normalization of the table, the improvement in efficiency is worth it). If sorting is unavoidable, you should try to simplify it, such as narrowing the range of columns to be sorted, etc.
3. Eliminate sequential access to large table row data
In nested queries, sequential access to tables may have a fatal impact on query efficiency. For example, using a sequential access strategy, if a query with three nested levels queries 1,000 rows at each level, then this query will query 1 billion rows of data. The main way to avoid this situation is to index the joined columns. For example, two tables: student table (student number, name, age...) and course selection table (student number, course number, grades). If two tables are to be connected, an index must be created on the connection field "student number".
You can also use union to avoid sequential access. Although there are indexes on all check columns, some forms of where clauses force the optimizer to use sequential access. The following query will force sequential operations on the orders table:
SELECT * FROM orders WHERE (customer_num=104 AND order_num>1001) OR order_num=1008
Although there are built-in functions on customer_num and order_num index, but in the above statement the optimizer still scans the entire table using sequential access paths. Because this statement retrieves a separate set of rows, it should be changed to the following statement:
SELECT * FROM orders WHERE customer_num=104 AND order_num>1001
UNION
SELECT * FROM orders WHERE order_num=1008
This way, the index path can be used to process the query.
4. Avoid correlated subqueries
If a column label appears in both the main query and the query in the where clause, then it is likely that when the column value in the main query changes, the subquery must be re-queried. The more nested levels of a query, the lower the efficiency, so subqueries should be avoided as much as possible. If a subquery is unavoidable, filter out as many rows as possible in the subquery.
5. Avoid difficult regular expressions
MATCHES and LIKE keywords support wildcard matching, technically called regular expressions. But this kind of matching is particularly time-consuming. For example: SELECT * FROM customer WHERE zipcode LIKE “98_ _ _”
Even if an index is created on the zipcode field, sequential scanning is still used in this case. If you change the statement to SELECT * FROM customer WHERE zipcode > "98000", the index will be used to query when executing the query, which will obviously greatly improve the speed.
Also, avoid non-starting substrings. For example, the statement: SELECT * FROM customer WHERE zipcode[2, 3]> "80" uses a non-starting substring in the where clause, so this statement will not use the index.
6. Use temporary tables to speed up queries
Sort a subset of the table and create a temporary table, which can sometimes speed up queries. It helps avoid multiple sort operations and otherwise simplifies the optimizer's work. For example:
SELECT cust.name,rcvbles.balance,...other columns
FROM cust,rcvbles
WHERE cust.customer_id = rcvlbes.customer_id
AND rcvblls.balance>0
AND cust.postcode>“98000”
ORDER BY cust.name
If this query is to be executed multiple times instead of just once, You can find all unpaid customers in a temporary file and sort them by customer name:
SELECT cust.name, rcvbles.balance,...other columns
FROM cust, rcvbles
WHERE cust.customer_id = rcvlbes.customer_id
AND rcvblls.balance>0
ORDER BY cust.name
INTO cust_with_balance
Then query in the temporary table in the following way:
SELECT * FROM cust_with_balance
WHERE postcode>“98000”
The rows in the temporary table are to be There are fewer rows than in the main table, and the physical order is the required order, which reduces disk I/O, so the query workload can be greatly reduced.
Note: The temporary table will not reflect the modification of the main table after it is created. When data in the main table is frequently modified, be careful not to lose data.
7. Use sorting to replace non-sequential access
Non-sequential disk access is the slowest operation and is manifested in the back and forth movement of the disk access arm. The SQL statement hides this situation, making it easy for us to write queries that require access to a large number of non-sequential pages when writing applications.
Sometimes, using the database's sorting capabilities instead of non-sequential access can improve queries.