from pymongo import MongoClient
client = MongoClient()
db = client.dbname
documentname = db.documentname
keys = {}
for k in documentname.find():
key = k['field']
if keys.has_key(key):
print 'duplicate key %s' % key
documentname.remove({'_id':k['_id']})
else:
print 'first record key %s' % key
keys[key]=1
The idea is very simple, traverse and store it in dict, and delete it when it is encountered for the second time.
But this way you cannot control the deleted and retained objects. You can adjust the script according to your scenario
When there are more than 100,000 pieces of data, can it be processed quickly through scripts? How does the script handle when there is a lot of concurrency?
Duplicates can be removed via python script
The idea is very simple, traverse and store it in dict, and delete it when it is encountered for the second time.
But this way you cannot control the deleted and retained objects. You can adjust the script according to your scenario
I have also encountered this situation. I don’t know how to solve it. Can you give me some advice?
When there are more than 100,000 pieces of data, can it be processed quickly through scripts? How does the script handle when there is a lot of concurrency?
mongoDB3.0 abandons the dropDups parameter. Duplicate data cannot be deleted through this in the future.
http://blog.chinaunix.net/xmlrpc.php?r=blog/article&id=4865696&uid=15795819