mongodb去重
高洛峰
高洛峰 2017-05-02 09:18:56
0
2
615

现在的数据是使用爬虫抓取的。现在有些数据是重复的。
然后现在我想请教应该怎么做才能去重呢?
我想的是只要能查到相对应的name也是ok的
打个比方。我现在有个community_name字段。
我想查询一下,community_name重复次数超过1次的name列表
我应该怎么去查询。
谢谢。
文档格式:

{
    "_id" : ObjectId("5732e6f884e079abfa783703"),
    "buildings_num" : "4",
    "community_name" : "江和城",
    "address" : "新安江洋安新城,南临洋安大道、北临滨江路",
    "lat" : "29.511485",
    "building_year" : "2014年建成",
    "lng" : " 119.329673",
    "house_num" : 224,
    "id" : 84453,
    "category" : "建德商圈",
    "city" : "杭州",
    "lj_id" : "187467387072819",
    "area" : "建德",
    "average_price" : 8408,
    "property_cost" : "2 元/平米/月",
    "property_company" : "金管家",
    "volume_rate" : "1.98",
    "greening_rate" : 0.33,
    "developers" : "杭州和谐置业有限公司"
}
高洛峰
高洛峰

拥有18年软件开发和IT教学经验。曾任多家上市公司技术总监、架构师、项目经理、高级软件工程师等职务。 网络人气名人讲师,...

reply all(2)
刘奇

It seems you want to achieve something similar to RDBMS

SELECT community_name, COUNT(*)
FROM table
GROUP BY community_name
HAVING COUNT(*) > 1

I don’t know if I understand it correctly. If this is the case, the corresponding approach should be to use the aggregation framework.

db.coll.aggregate([
    {$group: {_id: "$community_name", count: {$sum: 1}}}, //统计community_name重复出现的次数
    {$match: {count: {$gt: 1}}} //从中找出重复多于1次的记录
]);

This query can get results faster with the following indexes:

db.coll.createIndex({community_name: 1});

But even so, this query will traverse all records, and the speed will not be too fast.
In fact, it is wasteful to count all the records every time. It is best to cache the results after getting the results. How to cache depends on how you want to use the collected data.
A better way is to make a judgment before inserting. If the same community_name already exists, record it, such as

db.community_name_stat.update({
    community_name: 'xxx'
}, {
    '$set': {
        count: {'$inc': 1}
    },
    '$setOnInsert': {
        community_name: 'xxx',
        count: 1
    }
}, {
    upsert: true
});

In this way, you can directly get a community_name_stat collection to get how many times each community_name_stat集合得到每个community_name appears. Of course, the final approach depends on your needs. MongoDB is a very flexible thing, which is one of the important features that distinguishes it from relational databases. Understanding its various functions and customizing a most cost-effective solution for your needs is one of the biggest challenges in using MongoDB.

phpcn_u1582

If you understand it correctly, you can use upsert directly: if the system already has a record with the same conditions, only update it, otherwise create a new record.

db.collection.update(query, update, {upsert: True, multi: <boolean>})

And you can also modify multiple records, if multi is set to true.

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template