python - 180万的MongoDB数据，如何分割？

Question

写了个采集爬虫，采集了大概180万条URL，现在要给它平均分割成多份，然后保存文件。请问要怎么做比较好。

黄舟 · Answer

使用mongo自備的工具

使用mongo自帶的工具可以匯出為json或cvs或txt格式.json或者cvs或者txt格式.

格式为

mongoexport --db {dbname} --collection {collectionname} --out traffic.json

举个栗子

数据库为test,collections为col。里面的数据如下

> db.col.find().pretty()
{
    "_id" : ObjectId("573173ce83358fa60470e0db"),
    "id" : 1,
    "name" : "adamweixuan"
}
{
    "_id" : ObjectId("573173e983358fa60470e0dc"),
    "id" : 2,
    "name" : "nicholas"
}
{ "_id" : ObjectId("573173f383358fa60470e0dd"), "id" : 3, "name" : "test" }
{
    "_id" : ObjectId("5731740383358fa60470e0de"),
    "id" : 4,
    "name" : "test001"
}
{
    "_id" : ObjectId("5731740a83358fa60470e0df"),
    "id" : 5,
    "name" : "test002"
}
{
    "_id" : ObjectId("5731741283358fa60470e0e0"),
    "id" : 6,
    "name" : "test003"
}

现在平均分成三个文件导出。

# id 不大于2的
mongoexport --port 10510 -d test -c col -q '{id : {$lte:2}}' --out ./names1.txt

# id 在2和4之间的
mongoexport --port 10510 -d test -c col -q '{id : {$gt:2} , id :{$lte:4}}' --out ./names2.txt

# id大于4的
mongoexport --port 10510 -d test -c col -q '{id : {$gt:4}}' --out ./names3.txt

你可以写个脚本，试试。

说明：-d 是指定database ，-c 指定集合，-q 是查询，后面的文件格式支持json、cvs、txt 格式為

rrreee 🎜舉個栗子🎜 🎜資料庫為test,collections為col。裡面的數據如下🎜 rrreee 🎜現在平均分成三個文件導出。 🎜 rrreee 🎜你可以寫個腳本，試試看。 🎜 🎜說明：-d 是指定database ，-c 指定集合，-q 是查詢，後面的檔案格式支援json、cvs、txt🎜🎜

ringa_lee · Answer

180w的資料通常來說…不用分割。
Shard key的選取原則其實在官方文件中已經有詳細說明，想了解的話不妨參考一下。