mongodb - 百万数据aggregate group sum 统计超级耗时的问题，求解决方案

Question

1、文档结构示例 {代码...} 2、场景：user为'xiaoming'的文档有六七百万条 3、问题：怎么提升aggregate+group+sum速度 {代码...} 用上面这个来统计xiaoming带有wrong的文档数量，结果 {代码...} 耗时30s-40s。use...

PHP中文网 · Answer

The first thing to explain is that for OLAP-type operations, expectations should not be too high. After all, it is an operation of large amounts of data. IO alone far exceeds the usual OLTP operation, so it is unrealistic and meaningless to require the speed and concurrency of OLTP operations. But that doesn’t mean there’s no room for optimization.
Let’s start with the index. How long does it take to find 6 million {user: "xiaoming"} without indexing? Full table scan COLLSCAN to find 6 million pieces of data out of 7 million pieces of data, and finding 6 million pieces of data out of 100 million pieces of data are obviously two different concepts. If the index IXSCAN is hit, the difference will be much smaller and almost negligible. So you are wrong to say that {user: 1} this index has no effect. It may just be because the amount of data in the collection is too small to see the difference. By the way, it should be mentioned that to see whether there is a difference in efficiency, you should look at the execution plan, not the execution time, as the time is inaccurate. {user: "xiaoming"}需要多少时间？全表扫描COLLSCAN从700w条数据中找出600w条，跟从1亿条数据中找出600w条显然是两个概念。命中索引IXSCAN，这个差异就会小很多，几乎可以忽略。所以你说{user: 1}这个索引没有作用是不对的，可能只是因为集合数据量太少看不出差异而已。顺便应该提一下看效率是否有差异应该看执行计划，不要看执行时间，时间是不准确的。
在有user索引的前提下，结果仍然有600w条，剩下的部分是个regex，regex无法命中索引，所以不管有没有对info的索引都没有意义。在找到600w条数据之后还有一个对600w数据的filter操作。唯一对这个操作可能有帮助的只有全文索引Under the premise of user index, there are still 6 million results, and the remaining part is a regex. regex cannot hit the index, so no matter It doesn’t make sense whether there is an index to

. After finding 6 million pieces of data, there is a filter operation on 6 million pieces of data. The only thing that may be helpful for this operation is full-text index, but full-text index cannot completely replace regular expressions. You need to read the documentation for specific questions. When considering full-text indexing is feasible, a composite index can be established:

db.coll.createIndex({
  user: 1,
  info: "text"
});

The corresponding query should be changed to:

db.coll.aggregate([
  {$match:{user: 'xiaoming', $text: { $search: "wrong" }}},
  {$group:{_id:null, count:{$sum:1}}}
])

For an introduction to composite full-text index, please refer to here. There are still some limitations that need to be noted. After this optimization, it is expected that the time can be reduced to less than 20 seconds under the same hardware, which is still far from the 10 seconds you want. The reason was mentioned at the beginning, you cannot have such high expectations for OLAP. If you really have this need, you should start from the source and consider:

infoCount every time a
field is updated or inserted
or
Complete statistics every once in a while, cache the statistical results, and display them directly to the user during query

🎜

某草草 · Answer

I don’t know, but would it be better if it could be split into two matches? .

Similar to

aggregate([
    {$match:{user: 'xiaoming'}}，
    {$match:{info: /wrong/}},
    {$group:{_id:null, count:{$sum:1}}}
])

I think the main thing is to spend time regularly.
If there is an index, index user.

给我你的怀抱 · Answer

The real-time requirement is not high and can be counted regularly and cached