Optimization of high concurrent low cardinality multi-field arbitrary combination query
1. Question
First explain the "low cardinality multi-field arbitrary combination query" that appears in this title. something. This refers to queries that meet the following conditions:
1. The search conditions involve a combination of multiple field conditions
2. The combination of these fields is uncertain
3. The selection of each individual field The sex is not good
This type of query has many usage scenarios, such as the product display page of e-commerce. Users will enter various combinations of query conditions: category, supplier, brand, promotion, price, etc..., and in the end they often have to sort and paginate the results.
The troublesome aspects of this type of problem are:
1. There are a large number of records. If a full table scan is performed, the performance will be low and it cannot meet the requirements of high concurrent access.
2. The selectivity of any single field involved in the query conditions is very low, and the query efficiency problem cannot be solved through a single field index.
3. If you build an ordinary Btree multi-field index, hundreds or thousands of indexes may be built due to too many combinations of user input conditions, which is unrealistic and difficult to maintain.
2. Solution
There are 2 solutions that I think of for this kind of problem
2.1 bitmap index
The characteristic of bitmap is to store the key and all values are equal to The bitmap of the row set of this key. For combined queries involving multiple keys, you only need to perform an AND or operation on the bitmaps corresponding to these keys. Since the size of bitmap is very small and the efficiency of bit AND-OR operations is also very high, bitmap is very suitable for this type of query.
Bitmap index also has disadvantages. Updating a record will lock the entire table, which is not suitable for scenarios with many concurrent writes. Another problem is that among common relational databases, only Oracle seems to support bitmap indexes, and many times we want to use open source databases.
2.2 Inverted index
Inverted index is similar to bitmap. It stores the key and the row set whose value is equal to this key. The row set may be a list or a tree or Other storage forms. For combined queries of multiple keys, just perform set operations on the results of these keys.
Inverted index is generally used for full-text retrieval, but many systems also use it to support structured data search, such as Elasticsearch. Elasticsearch supports fast search of JSON documents, compound queries, sorting, aggregation, distributed deployment and many other great features. But considering the following factors, we prefer to find a solution in a relational database.
-There is no need to use the advanced features provided by search engines for fuzzy matching. In fact, we need exact matching or simple fuzzy matching.
-The amount of data is not large enough to require a distributed search cluster.
-The original data is already in the relational database, so you don’t want to worry about data synchronization.
-I have developed an application based on the interface of a relational database and do not want to start over.
-Having mastered the operation and maintenance management of relational databases, I don’t know how many pitfalls I have to go through in a brand new system.
- Considering the performance difference between Java and C, the performance of the built-in relational database solution may not be inferior to that of professional search engines.
3. PostgreSQL solution
If the scope of the solution is limited to open source relational databases, there may be only one answer, which is PostgreSQL's gin index.
PostgreSQL's gin index is an inverted index. It is not only used for full-text search but also for regular data types, such as int and varchar.
For multi-dimensional queries, we can build indexes like this:
1. Create a unique multi-field gin index for all low-cardinality fields involved in equivalent conditions
2. For equivalent queries with good selectivity or For the fields involved in the range query, build a btree index
Some students may have questions. It is also a multi-field index. Why does gin's multi-field index only need to build one, but btree's multi-field index does? Consider various query combinations and build several. This is because each field in the gin multi-field index is equivalent and there is no leading field. Therefore, as long as a unique gin multi-field index is built, all query combinations can be covered; while the btree multi-field index is different. If the query conditions do not contain the suoyi leading field, the index cannot be used.
Each key stored internally in the multi-field gin index is in the form of (column number, key datum), so different fields can be distinguished without confusion. The stored value is the set of ctids of all records matching key. This set is stored in the form of btree when the number of records is relatively large, and has been compressed, so the storage space occupied by the gin index is very small, only about one-twentieth of the equivalent btree index, which is also derived from another Improved performance.
For multiple fields involved in multi-dimensional queries, fields included in the multi-field gin index, the ctid set is merged by the gin index (taking the union or intersection), and then the obtained ctid set and other indexes are obtained The ctid collection is then merged with BitmapAnd or BitmapOr. The efficiency of merging ctid sets within the gin index is much higher than that of merging ctid sets between indexes, and the gin index optimizes low-cardinality fields better, so making full use of the characteristics of the gin index is better than building a separate btree index for each field and then using BitmapAnd Or BitmapOr merges the result set much more efficiently.
4. A real case
4.1 Original query
The following SQL is a simplified version of a real SQL in a system.
- SELECT CASE WHEN gpppur.GB_BEGINDATE <= '2016-02-29 14:36:00' AND gpppur.GB_ENDDATE > '2016-02-29 14: 36:00' THEN 1
- WHEN gpppur.PREVIEW_BEGINDT <= '2016-02-29 14:36:00' AND gpppur.PREVIEW_ENDDT > '2016-02-29 14:36: 00' THEN 2
- ELSE 3 END AS flag,
- gpppur.*
- FROM T_MPS_INFO gpppur
- WHERE gpppur.ATTRACT_TP = 0
- AND gpppur.COLUMN_ID = 1
- AND gpppur.FIELD2 = 1
- AND gpppur. STATUS = 1
- ORDER BY flag ASC,gpppur.PC_SORT_NUM ASC,gpppur.GB_BEGINDATE DESC
- LIMIT 0,45
Using MySQL database , the total data volume is 60w, in which a multi-field index of FIELD2 STATUS is built.
The value distribution of the four fields involved in the query conditions is as follows:
- postgres=# select ATTRACT_TP,count(*) from T_MPS_INFO group by ATTRACT_TP;
- attract_tp | count
- ------------ --------
- | 16196
- 6 | 251
- 2 | 50
- 1 | 3692
- 3 | 143
- 10 | 314
- 4 | 214
- 5 | 194333
- 9 | 326485
- 7 | 1029
- 0 | 6458
- (11 rows)
- postgres=# select COLUMN_ID,count(*) from T_MPS_INFO group by COLUMN_ID;
- column_id | count
- ------------ - -------
- | 2557
- 285 | 20
- 120 | 194
- 351 | 2
- 337 | 79
- 227 | 26
- 311 | 9
- 347 | 2
- 228 | 21
- 318 | 1
- 314 | 9
- 54 | 10
- 133 | 27
- 2147483647 | 1
- 336 | 1056
- 364 | 1
- 131 | 10
- 243 | 5
- 115 | 393
- 61 | 73
- 226 | 40
- 196 | 16
- 350 | 5
- 373 | 72
- 377 | 2
- 260 | 4
- 184 | 181
- 363 | 1
- 341 | 392
- 64 | 1
- 344 | 199271
- 235 | 17
- 294 | 755
- 352 | 3
- 368 | 1
- 225 | 1
- 199 | 8
- 374 | 2
- 248 | 8
- 84 | 1
- 362 | 1
- 361 | 331979
- 319 | 7
- 244 | 65
- 125 | 2
- 130 | 1
- 272 | 65
- 66 | 2
- 240 | 2
- 775 | 1
- 253 | 49
- 60 | 45
- 121 | 5
- 257 | 3
- 365 | 1
- 0 | 1
- 217 | 5
- 270 | 1
- 122 | 39
- 56 | 49
- 355 | 5
- 161 | 1
- 329 | 1
- 222 | 9
- 261 | 275
- 2 | 3816
- 57 | 19
- 307 | 4
- 310 | 8
- 97 | 37
- 202 | 20
- 203 | 3
- 85 | 1
- 375 | 641
- 58 | 98
- 1 | 6479
- 59 | 114
- 185 | 7
- 338 | 10
- 379 | 17
- (80 rows)
- postgres=# select FIELD2,count(*) from T_MPS_INFO group by FIELD2;
- field2 | count
- -------- --------
- | 2297
- 6 | 469
- 2 | 320
- 1 | 11452
- 3 | 286
- 10 | 394
- 4 | 291
- 5 | 200497
- 9 | 331979
- 0 | 2
- 7 | 1178
- (11 rows)
- postgres=# select STATUS,count(*) from T_MPS_INFO group by STATUS;
- status | count
- - ------- --------
- | 2297
- 0 | 15002
- 3 | 5
- 4 | 1
- 1 | 531829
- 2 | 31
- (6 rows)
Since the value distribution of these fields is extremely uneven, we construct the following Lua script to generate different select statements to simulate the load.
qx.lua:
- pathtest = string.match(test, "(.*/)") or ""
- dofile(pathtest .. "common.lua")
- function thread_init(thread_id)
- set_vars()
- end
- function event(thread_id)
- local ATTRACT_TP,COLUMN_ID,FIELD2,STATUS
- ATTRACT_TP = sb_rand_uniform(0, 10)
- COLUMN_ID = sb_rand_uniform(1, 100)
- FIELD2 = sb_rand_uniform(0, 10)
- STATUS = sb_rand_uniform(0, 4)
- rs = db_query("SELECT CASE WHEN gpppur.GB_BEGINDATE <= '2016-02-29 14:36:00' AND gpppur.GB_ENDDATE > '2016-02-29 14:36:00' THEN 1
- WHEN gpppur.PREVIEW_BEGINDT <= '2016-02-29 14:36:00' AND gpppur.PREVIEW_ENDDT > '2016-02-29 14:36:00' THEN 2
- ELSE 3 END AS flag,
- gpppur.*
- FROM T_MPS_INFO gpppur
- WHERE gpppur.ATTRACT_TP = "..ATTRACT_TP.."
- AND gpppur.COLUMN_ID = "..COLUMN_ID.."
- AND gpppur.FIELD2 = "..FIELD2.."
- AND gpppur.STATUS = "..STATUS.."
- ORDER BY flag ASC,gpppur.PC_SORT_NUM ASC,gpppur.GB_BEGINDATE DESC
- LIMIT 45")
- end
然后用sysbench进行压测,结果在32并发时测得的qps是64。
- [root@rh6375Gt20150507 ~]# sysbench --db-driver=mysql --test=/opt/sysbench-0.5/sysbench/tests/db/qx.lua --mysql-db=test --mysql-user=mysql --mysql-password=mysql --mysql-host=srdsdevapp69 --num-threads=32 --max-time=5 run
- sysbench 0.5: multi-threaded system evaluation benchmark
-
- Running the test with following options:
- Number of threads: 32
- Random number generator seed is 0 and will be ignored
-
-
- Threads started!
-
- OLTP test statistics:
- queries performed:
- read: 825
- write: 0
- other: 0
- total: 825
- transactions: 0 (0.00 per sec.)
- read/write requests: 825 (64.20 per sec.)
- other operations: 0 (0.00 per sec.)
- ignored errors: 0 (0.00 per sec.)
- reconnects: 0 (0.00 per sec.)
-
- General statistics:
- total time: 12.8496s
- total number of events: 825
- total time taken by event execution: 399.6003s
- response time:
- min: 1.01ms
- avg: 484.36ms
- max: 12602.74ms
- approx. 95 percentile: 222.79ms
-
- Threads fairness:
- events (avg/stddev): 25.7812/24.12
- execution time (avg/stddev): 12.4875/0.23
4.2 优化后的查询
对于上面那个特定的SQL虽然我们可以通过建一个包含所有等值查询条件中4个字段(ATTRACT_TP,COLUMN_ID,FIELD2,STATUS)的组合索引进行优化,但是需要说明的是,这条SQL只是各种查询组合产生的1000多种不同SQL中的一个,每个SQL涉及的查询字段的组合是不一样的,我们不可能为每种组合都单独建一个多字段索引。
所以我们想到了PostgreSQL的gin索引。为了使用PostgreSQL的gin索引,先把MySQL的表定义,索引和数据原封不动的迁移到PostgreSQL。
在添加gin索引前,先做了一个测试。另人惊讶的是,还没有开始进行优化,PostgreSQL测出的性能已经是MySQL的5倍(335/64=5)了。
- [root@rh6375Gt20150507 ~]# sysbench --db-driver=pgsql --test=/opt/sysbench-0.5/sysbench/tests/db/qx.lua --pgsql-db=postgres --pgsql-user=postgres --pgsql-password=postgres --pgsql-host=srdsdevapp69 --num-threads=32 --max-time=5 run
- sysbench 0.5: multi-threaded system evaluation benchmark
- Running the test with following options:
- Number of threads: 32
- Random number generator seed is 0 and will be ignored
- Threads
- OLTP test statistics:
- queries performed:
- read: 1948
- write: 0
- other: 0
- total: 1948
- transactions: 0 (0.00 per sec.)
- read/write requests: 1948 (335.52 per sec.)
- other operations: 0 (0.00 per sec.)
- ignored errors: 0 (0.00 per sec.)
- reconnects: 0 (0.00 per sec.)
- General statistics:
- total time: 5.8059s
- total number of events: 1948
- total time taken by event execution: 172.0538s
- response time:
- min: 0.90ms
- avg: 88.32ms
- max: 2885.69ms
- approx. 95 percentile: 80.01ms
- Threads fairness:
- events (avg/stddev): 60.8750/27.85
- execution time (avg/stddev): 5.3767/0.29
下一步,添加gin索引。
- postgres=# create extension btree_gin;
- CREATE EXTENSION
- postgres=# create index idx3 on t_mps_info using gin(attract_tp, column_id, field2, status);
- CREATE INDEX
再进行压测,测出的qps是5412,是MySQL的85倍(5412/64=85)。
- [root@rh6375Gt20150507 ~]# sysbench --db-driver=pgsql --test=/opt/sysbench-0.5/sysbench/tests/db/qx.lua --pgsql-db=postgres --pgsql-user=postgres --pgsql-password=postgres --pgsql-host=srdsdevapp69 --num-threads=32 --max-time=5 run
- sysbench 0.5: multi-threaded system evaluation benchmark
- Running the test with following options:
- Number of threads: 32
- Random number generator seed is 0 and will be ignored
- Threads
- OLTP test statistics:
- queries performed:
- read: 10000
- write: 0
- other: 0
- total: 10000
- transactions: 0 (0.00 per sec.)
- read/write requests: 10000 (5412.80 per sec.)
- other operations: 0 (0.00 per sec.)
- ignored errors: 0 (0.00 per sec.)
- reconnects: 0 (0.00 per sec.)
- General statistics:
- total time: 1.8475s
- total number of events: 10000
- total time taken by event execution: 58.2706s
- response time:
- min: 0.95ms
- avg: 5.83ms
- max: 68.36ms
- approx. 95 percentile: 9.42ms
- Threads fairness:
- events (avg/stddev): 312.5000/47.80
- execution time (avg/stddev): 1.8210/0.02
4.3 补充
作为对比,我们又在MySQL上添加了包含attract_tp, column_id, field2和status这4个字段的多字段索引,测出的qps是4000多,仍然不如PostgreSQL。可见业界广为流传的MySQL的简单查询性能优于PostgreSQL的说法不可信!(对于复杂查询PostgreSQL的性能大大优于MySQL应该是大家的共识。我例子中的SQL不能算是复杂查询吧?)
5. 总结
gin索引(还包括类似的gist,spgist索引)是PostgreSQL的一大特色,基于它可以挖掘出很多好玩的用法。对于本文提到的场景,有兴趣的同学可以把它和Oracle的bitmap索引以及基于搜索引擎(Elasticsearch,Solr等)的方案做个对比。另外,本人所知有限,如果有其它更好的方案,希望能让我知道。
http://www.bkjia.com/PHPjc/1108026.htmlwww.bkjia.comtruehttp: //www.bkjia.com/PHPjc/1108026.htmlTechArticleOptimization of high concurrency, low cardinality, multi-field queries in any combination 1. Question: First, explain the "low" that appears in this title What does “cardinality multi-field arbitrary combination query” refer to. This means satisfying...