算法 - Python检测是否已有数据

Question

现在的实现是一个字典类型，拥有500万条数据，KEY是40位的Hash 做的是从里面确定某个Hash是否存在，但是这样的方法内存占用太多了 准备尝试bloomfilter替换但是感觉增加数据有点麻烦，是否有其他类似的算法可以用...

天蓬老师 · Answer

Because the hash has 40 digits and is a hexadecimal number, I replaced the letters with numbers and then converted them into numbers to save. This should save memory and the efficiency should be lower than O(n).
My code:

#!/usr/bin/env python
#-*- coding:utf-8 -*-

SHIFT = 5  # 如果计算机为32位,SHIFT为5;如果计算机为64位,SHIFT为6
MASK = 0x1F  # 如果计算机为32位,MASK为0x1F;如果计算机为64位,MASK为0x3F

class BitBucket(object):
    def __init__(self):
        self._unique_key_count = 0   # 唯一的key有多少个
        self._total_key_count = 0    # 加入的key有多少个
        self._bit = {}
        self._map = {'a': '1', 'b': '2', 'c': '3', 'd': '4', 'e': '5', 'f':'6'}

    def set(self, value):
        """return last bit"""
        value = self._translate(value)
        self._total_key_count += 1

        if not self._has_key(value):
            self._unique_key_count += 1
            key = value >> SHIFT
            self._bit[key] = self._bit.get(key, 0) | (1 << (value & MASK))
            return 0
        return 1

    def exist(self, value):
        value = self._translate(value)
        if self._has_key(value):
            return True
        return False

    def clear(self, value):
        value = self._translate(value)
        if self._has_key(value):
            self._unique_key_count -= 1
            self._total_key_count -= 1

            key = value >> SHIFT
            self._bit[key] = self._bit[key] & (~(1 << (value & MASK)))
            return True
        return False

    def get_total_count(self):
        return self._total_key_count

    def get_bit_count(self):
        return self._unique_key_count

    def _has_key(self, value):
        key = value >> SHIFT
        return self._bit.get(key, 0) & (1 << (value & MASK))

    def _translate(self, value):
        value = value.lower()
        return long(''.join([self._map.get(c, c) for c in value]))

if __name__ == '__main__':
    bitBucket = BitBucket()
    bitBucket.set("a"*40)
    print bitBucket.exist("a" * 40)
    print bitBucket.exist("b" * 40)

    bitBucket.clear("a" * 40)

    import hashlib

    for i in range(1, 27):
        a = chr(i)
        sha1 = hashlib.sha1()
        sha1.update(a)
        bitBucket.set(sha1.hexdigest())

    print bitBucket.get_total_count() 
    print bitBucket.get_bit_count()

    count = 0
    for i in range(1, 30):
        a = chr(i)
        sha1 = hashlib.sha1()
        sha1.update(a)
        if bitBucket.exist(sha1.hexdigest()):
            count += 1

    assert count == bitBucket.get_bit_count()

Or you can consider using a dictionary tree to do it. It is best to use C++ to do it, but the efficiency and memory can be improved!

PHP中文网 · Answer

If you use bloomfilter, it will introduce a certain error rate. It depends on whether your project can be accepted. If so, this is the best choice.

If that doesn’t work, just get a trie tree. Marisa is recommended to save space.

怪我咯 · Answer

The first reaction is to use tuples, but I don’t know how efficient it is. Can you try it?

#!/usr/bin/env python3
data = {"a":1, "b":2, "c":3, "d":4, "a":5, "c":6}

data.keys()

t should be a unique hash key tuple.

天蓬老师 · Answer

Decisive bloom filter, simple to implement, small memory, and most importantly, high efficiency
Java version

大家讲道理 · Answer

The method in the link below is for reference: https://github.com/qiwsir/algorithm/blob/master/same_element_in_list.md

伊谢尔伦 · Answer

Assume that the data with a length of 5 million is a dictionary source_dict and what needs to be judged is a list hash_list, then:
result = [item for item in hash_list if item in source_dict]

source_dict must be loaded into memory first. If it occupies memory, you can first source_dict.keys() get the key list. Assuming it is source_keys, then:
result = [item for item in hash_list if item in source_keys].

Considering that the dictionary traversal speed is O(1), the list is O(n), and the amount of data here is 5 million, so method one is recommended.

天蓬老师 · Answer

You can try to use MapReduce to solve it, please refer to:
Implementing MapReduce with multiprocessing

高洛峰 · Answer

Use the bsddb module. Although it is not a standard library, it is still a common python module.

bucket = bsddb.btopen(None)

or

bucket = bsddb.hashopen(dbfilename)

When using a disk, the storage object can also be pickled and directly used as a key

大家讲道理 · Answer

Idea: python’s object mechanism determines that python will definitely not save as much memory as C. Each str will occupy an extra part of the memory

If it must be stored in memory, consider redis, which is a good choice regardless of algorithm or memory
If it can be placed on disk, bsddb should be a good choice

In the final analysis, what needs to be considered is the architecture. In this era, there is almost no need to operate the algorithm yourself

迷茫 · Answer

If it is a 40-digit hexadecimal hash (I guess it may be sha1), it is a bit wasteful for 5 million data.

In other words, instead of indexing a 40-digit hexadecimal string, it is better to consider how to index a 5 million-scale string.