python - 180万的MongoDB数据，如何分割？

Question

写了个采集爬虫，采集了大概180万条URL，现在要给它平均分割成多份，然后保存文件。请问要怎么做比较好。

黄舟 · Answer

使用mongo自带的工具

使用mongo自带的工具可以导出为json或者cvs或者txt格式.json或者cvs或者txt格式.

格式为

mongoexport --db {dbname} --collection {collectionname} --out traffic.json

举个栗子

数据库为test,collections为col。里面的数据如下

> db.col.find().pretty()
{
    "_id" : ObjectId("573173ce83358fa60470e0db"),
    "id" : 1,
    "name" : "adamweixuan"
}
{
    "_id" : ObjectId("573173e983358fa60470e0dc"),
    "id" : 2,
    "name" : "nicholas"
}
{ "_id" : ObjectId("573173f383358fa60470e0dd"), "id" : 3, "name" : "test" }
{
    "_id" : ObjectId("5731740383358fa60470e0de"),
    "id" : 4,
    "name" : "test001"
}
{
    "_id" : ObjectId("5731740a83358fa60470e0df"),
    "id" : 5,
    "name" : "test002"
}
{
    "_id" : ObjectId("5731741283358fa60470e0e0"),
    "id" : 6,
    "name" : "test003"
}

现在平均分成三个文件导出。

# id 不大于2的
mongoexport --port 10510 -d test -c col -q '{id : {$lte:2}}' --out ./names1.txt

# id 在2和4之间的
mongoexport --port 10510 -d test -c col -q '{id : {$gt:2} , id :{$lte:4}}' --out ./names2.txt

# id大于4的
mongoexport --port 10510 -d test -c col -q '{id : {$gt:4}}' --out ./names3.txt

你可以写个脚本，试试。

说明：-d 是指定database ，-c 指定集合，-q 是查询，后面的文件格式支持json、cvs、txt 格式为

rrreee 🎜举个栗子🎜 🎜数据库为test,collections为col。里面的数据如下🎜 rrreee 🎜现在平均分成三个文件导出。🎜 rrreee 🎜你可以写个脚本，试试。🎜 🎜说明：-d 是指定database ，-c 指定集合，-q 是查询，后面的文件格式支持json、cvs、txt🎜🎜

ringa_lee · Answer

180w的数据通常来说……不用分割。
Shard key的选取原则其实在官方文档中已经有详细说明，想了解的话不妨参考一下。