mongodb sharding 的chunksize 的值设置为多少比较合理？

Question

RT, 听人说过：20M的情况下插入1百万的数据进行分片,会丢数据63M的情况下插入6千万条的数据进行分片也有丢数据的情况不同的mongodb版本也有差异 大伙对这个值有经验吗？

高洛峰 · Answer

"Lost data" and chunksize are two unrelated things and have no direct logical connection. I don't know who would put these two things together and tell you. Since I don’t know what specific scenario the data loss is referring to, I will give you some answers that may be useful to you based on what I know.
It seems that you are concerned about the problem of data loss and not the chunksize at all. Furthermore, usually there is no problem if the chunksize is left at the default value, so I will skip the chunksize issue.
Some explanations on "losing data". For any database, losing data without reason is intolerable. So there is a situation where data is lost, either

There are irresistible factors, such as power outage, hardware damage, network failure, etc.
Configuration reason
There is a serious bug in the software.
There is nothing you can do about 1 anyway. This should be minimized through the replication function of ReplicaSet.

Point 2: If you do not have journal open (it is open by default), you may lose data within 30ms if there is a power outage or program crash. If the data is very important and cannot tolerate 30ms loss, please turn on the j parameter:
mongodb://ip:port/db?replicaSet=rs&j=1
(The above parameters may also be specified through code at the granularity of a single request , please check the driver documentation you are using)
This parameter ensures that data writing is blocked until the journal is written to the disk.
But do you think that data is safe once it is downloaded? Remember that this is a distributed environment, and the data security of a single machine does not represent the cluster. So in case of emergency, although the journal is placed, it has not had time to be copied to other nodes of the replica. Then, an interesting situation called rollback will occur. If you are interested, you can read it. Of course, the copying speed is usually very fast, and rollback is very rare. Well, you may still feel that it is not safe enough, then there is a w parameter that can be used: primary正当掉了，就会有其他结点通过选举成为新的primary mongodb://ip:port/db?replicaSet=rs&j=1&
w=1The w parameter can ensure that the write operation is blocked until Data falls on multiple nodes (w=1/2/3...n).
Is this safe? Sorry, in a particularly unlucky situation (you should really buy a lottery ticket), you copied the data to more than one node. What if this set of nodes fails at the same time? So we have w=majority (majority). When the cluster loses most nodes, it will become read-only, so no new data will be written, and there will be no rollback. When everything is restored, your data will still be there.
The above are some situations where data loss occurs. It can be imagined that the configuration of w and j will definitely affect the writing efficiency to a great extent while ensuring data security. This should actually be a policy you customize based on your tolerance for data loss, and is not considered a bug.
Another thing that comes to mind is that I often encounter people who like to do this kind of thing in the community:

kill -9 mongod

If you ask me, it is simply too cruel. Why are you using cannons to hit mosquitoes as soon as they come up? Data loss in this case can only be said to be deserved. Actually,

kill mongod

It is safe, but -9 is your fault.

As for point 3, bugs that caused data loss did occur during the development process of mongodb. 3.0.8-3.0.10 is the hardest hit area. Avoid these versions. Having said that, which software development process does not have some problems? 3.0.11 was released on the same day that the problem was discovered in 3.0.10, and the repair speed was already very fast.

Okay, after saying so much, I don’t know if it is of any use to the questioner. Just a reminder, describe the problem as clearly as possible, otherwise you can only guess like me what kind of problem you encountered in what scenario. The most likely situation is the old saying:

Garbage in, garbage out