python - 180万的MongoDB数据,如何分割?
PHPz
PHPz 2017-04-17 17:43:06
0
2
336

写了个采集爬虫,采集了大概180万条URL,现在要给它平均分割成多份,然后保存文件。
请问要怎么做比较好。

PHPz
PHPz

学习是最好的投资!

reply all(2)
黄舟

Use the tools that come with mongo

Use mongo’s own tools to export to json or cvs or txt format.json或者cvs或者txt格式.

格式为

mongoexport --db {dbname} --collection {collectionname} --out traffic.json

举个栗子

数据库为test,collectionscol。里面的数据如下

> db.col.find().pretty()
{
    "_id" : ObjectId("573173ce83358fa60470e0db"),
    "id" : 1,
    "name" : "adamweixuan"
}
{
    "_id" : ObjectId("573173e983358fa60470e0dc"),
    "id" : 2,
    "name" : "nicholas"
}
{ "_id" : ObjectId("573173f383358fa60470e0dd"), "id" : 3, "name" : "test" }
{
    "_id" : ObjectId("5731740383358fa60470e0de"),
    "id" : 4,
    "name" : "test001"
}
{
    "_id" : ObjectId("5731740a83358fa60470e0df"),
    "id" : 5,
    "name" : "test002"
}
{
    "_id" : ObjectId("5731741283358fa60470e0e0"),
    "id" : 6,
    "name" : "test003"
}

现在平均分成三个文件导出。

# id 不大于2的
mongoexport --port 10510 -d test -c col -q '{id : {$lte:2}}' --out ./names1.txt

# id 在2和4之间的
mongoexport --port 10510 -d test -c col -q '{id : {$gt:2} , id :{$lte:4}}' --out ./names2.txt

# id大于4的
mongoexport --port 10510 -d test -c col -q '{id : {$gt:4}}' --out ./names3.txt

你可以写个脚本,试试。

说明:-d 是指定database ,-c 指定 集合 ,-q 是查询,后面的文件格式支持json、cvs、txt The format is

rrreee 🎜Give me a chestnut🎜 🎜The database is test, and collections is col. The data inside is as follows🎜 rrreee 🎜Now exported in three equal files. 🎜 rrreee 🎜You can write a script and try it. 🎜 🎜Note: -d specifies database, -c specifies collection, -q specifies query, and the following file formats support json, cvs, txt🎜🎜
左手右手慢动作

Generally speaking, data of 180w... does not need to be divided.
The principles for selecting Shard keys are actually explained in detail in the official documentation. If you want to know more, you may wish to refer to it.

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template