Because the hash has 40 digits and is a hexadecimal number, I replaced the letters with numbers and then converted them into numbers to save. This should save memory and the efficiency should be lower than O(n).
My code:
#!/usr/bin/env python
#-*- coding:utf-8 -*-
SHIFT = 5 # 如果计算机为32位,SHIFT为5;如果计算机为64位,SHIFT为6
MASK = 0x1F # 如果计算机为32位,MASK为0x1F;如果计算机为64位,MASK为0x3F
class BitBucket(object):
def __init__(self):
self._unique_key_count = 0 # 唯一的key有多少个
self._total_key_count = 0 # 加入的key有多少个
self._bit = {}
self._map = {'a': '1', 'b': '2', 'c': '3', 'd': '4', 'e': '5', 'f':'6'}
def set(self, value):
"""return last bit"""
value = self._translate(value)
self._total_key_count += 1
if not self._has_key(value):
self._unique_key_count += 1
key = value >> SHIFT
self._bit[key] = self._bit.get(key, 0) | (1 << (value & MASK))
return 0
return 1
def exist(self, value):
value = self._translate(value)
if self._has_key(value):
return True
return False
def clear(self, value):
value = self._translate(value)
if self._has_key(value):
self._unique_key_count -= 1
self._total_key_count -= 1
key = value >> SHIFT
self._bit[key] = self._bit[key] & (~(1 << (value & MASK)))
return True
return False
def get_total_count(self):
return self._total_key_count
def get_bit_count(self):
return self._unique_key_count
def _has_key(self, value):
key = value >> SHIFT
return self._bit.get(key, 0) & (1 << (value & MASK))
def _translate(self, value):
value = value.lower()
return long(''.join([self._map.get(c, c) for c in value]))
if __name__ == '__main__':
bitBucket = BitBucket()
bitBucket.set("a"*40)
print bitBucket.exist("a" * 40)
print bitBucket.exist("b" * 40)
bitBucket.clear("a" * 40)
import hashlib
for i in range(1, 27):
a = chr(i)
sha1 = hashlib.sha1()
sha1.update(a)
bitBucket.set(sha1.hexdigest())
print bitBucket.get_total_count()
print bitBucket.get_bit_count()
count = 0
for i in range(1, 30):
a = chr(i)
sha1 = hashlib.sha1()
sha1.update(a)
if bitBucket.exist(sha1.hexdigest()):
count += 1
assert count == bitBucket.get_bit_count()
Or you can consider using a dictionary tree to do it. It is best to use C++ to do it, but the efficiency and memory can be improved!
Assume that the data with a length of 5 million is a dictionary source_dict and what needs to be judged is a list hash_list, then: result = [item for item in hash_list if item in source_dict]
source_dict must be loaded into memory first. If it occupies memory, you can first source_dict.keys() get the key list. Assuming it is source_keys, then: result = [item for item in hash_list if item in source_keys].
Considering that the dictionary traversal speed is O(1), the list is O(n), and the amount of data here is 5 million, so method one is recommended.
Because the hash has 40 digits and is a hexadecimal number, I replaced the letters with numbers and then converted them into numbers to save. This should save memory and the efficiency should be lower than O(n).
My code:
Or you can consider using a dictionary tree to do it. It is best to use C++ to do it, but the efficiency and memory can be improved!
If you use bloomfilter, it will introduce a certain error rate. It depends on whether your project can be accepted. If so, this is the best choice.
If that doesn’t work, just get a trie tree. Marisa is recommended to save space.
The first reaction is to use tuples, but I don’t know how efficient it is. Can you try it?
t
should be a unique hash key tuple.Decisive bloom filter, simple to implement, small memory, and most importantly, high efficiency
Java version
The method in the link below is for reference: https://github.com/qiwsir/algorithm/blob/master/same_element_in_list.md
Assume that the data with a length of 5 million is a dictionary
source_dict
and what needs to be judged is a listhash_list
, then:result = [item for item in hash_list if item in source_dict]
source_dict
must be loaded into memory first. If it occupies memory, you can firstsource_dict.keys()
get the key list. Assuming it issource_keys
, then:result = [item for item in hash_list if item in source_keys]
.Considering that the dictionary traversal speed is O(1), the list is O(n), and the amount of data here is 5 million, so method one is recommended.
You can try to use MapReduce to solve it, please refer to:
Implementing MapReduce with multiprocessing
Use the bsddb module. Although it is not a standard library, it is still a common python module.
or
When using a disk, the storage object can also be pickled and directly used as a key
Idea: python’s object mechanism determines that python will definitely not save as much memory as C. Each str will occupy an extra part of the memory
In the final analysis, what needs to be considered is the architecture. In this era, there is almost no need to operate the algorithm yourself
If it is a 40-digit hexadecimal hash (I guess it may be sha1), it is a bit wasteful for 5 million data.
In other words, instead of indexing a 40-digit hexadecimal string, it is better to consider how to index a 5 million-scale string.