1、文档结构示例
{
_id: xxxx,
user: 'xiaoming',
level: 5,
from: 'iPhone',
info: 'something wrong'
}
2、场景:user为'xiaoming'的文档有六七百万条
3、问题:怎么提升aggregate+group+sum速度
aggregate([
{$match:{user: 'xiaoming', info:{$regex:'wrong'}}},
{$group:{_id:null, count:{$sum:1}}}
])
用上面这个来统计xiaoming带有wrong的文档数量,结果
{"_id": null, "count": 2299999 }
耗时30s-40s。user、info、user+info三种索引都尝试过,速度都没有提升
baidu、google查到‘带条件计数慢无解’
怎么提升效率,10s以内能实现吗
The first thing to explain is that for OLAP-type operations, expectations should not be too high. After all, it is an operation of large amounts of data. IO alone far exceeds the usual OLTP operation, so it is unrealistic and meaningless to require the speed and concurrency of OLTP operations. But that doesn’t mean there’s no room for optimization.
. After finding 6 million pieces of data, there is aLet’s start with the index. How long does it take to find 6 million
{user: "xiaoming"}
without indexing? Full table scanCOLLSCAN
to find 6 million pieces of data out of 7 million pieces of data, and finding 6 million pieces of data out of 100 million pieces of data are obviously two different concepts. If the indexIXSCAN
is hit, the difference will be much smaller and almost negligible. So you are wrong to say that{user: 1}
this index has no effect. It may just be because the amount of data in the collection is too small to see the difference. By the way, it should be mentioned that to see whether there is a difference in efficiency, you should look at the execution plan, not the execution time, as the time is inaccurate.{user: "xiaoming"}
需要多少时间?全表扫描COLLSCAN
从700w条数据中找出600w条,跟从1亿条数据中找出600w条显然是两个概念。命中索引IXSCAN
,这个差异就会小很多,几乎可以忽略。所以你说{user: 1}
这个索引没有作用是不对的,可能只是因为集合数据量太少看不出差异而已。顺便应该提一下看效率是否有差异应该看执行计划,不要看执行时间,时间是不准确的。在有
user
索引的前提下,结果仍然有600w条,剩下的部分是个regex
,regex
无法命中索引,所以不管有没有对info
的索引都没有意义。在找到600w条数据之后还有一个对600w数据的filter
操作。唯一对这个操作可能有帮助的只有全文索引
Under the premise ofuser
index, there are still 6 million results, and the remaining part is aregex
.regex
cannot hit the index, so no matter It doesn’t make sense whether there is an index tofilter
operation on 6 million pieces of data. The only thing that may be helpful for this operation isfull-text index
, but full-text index cannot completely replace regular expressions. You need to read the documentation for specific questions. When considering full-text indexing is feasible, a composite index can be established:
orinfo
Count every time afield is updated or inserted
I don’t know, but would it be better if it could be split into two matches? .
Similar to
I think the main thing is to spend time regularly.
If there is an index, index user.
The real-time requirement is not high and can be counted regularly and cached