算法 - Python检测是否已有数据
大家讲道理
大家讲道理 2017-04-17 13:17:05
0
10
717

现在的实现是一个字典类型,拥有500万条数据,KEY是40位的Hash
做的是从里面确定某个Hash是否存在,但是这样的方法内存占用太多了

准备尝试bloomfilter替换但是感觉增加数据有点麻烦,是否有其他类似的算法可以用?

==== 另一种介绍 ===
每次拿到一个HASH在列表中寻找,如果有,则停止执行,如果没有,则将该HASH添加到列表,继续重复执行。

问题在:内存/效率

大家讲道理
大家讲道理

光阴似箭催人老,日月如移越少年。

reply all(10)
Peter_Zhu

Because the hash has 40 digits and is a hexadecimal number, I replaced the letters with numbers and then converted them into numbers to save. This should save memory and the efficiency should be lower than O(n).
My code:

#!/usr/bin/env python
#-*- coding:utf-8 -*-

SHIFT = 5  # 如果计算机为32位,SHIFT为5;如果计算机为64位,SHIFT为6
MASK = 0x1F  # 如果计算机为32位,MASK为0x1F;如果计算机为64位,MASK为0x3F

class BitBucket(object):
    def __init__(self):
        self._unique_key_count = 0   # 唯一的key有多少个
        self._total_key_count = 0    # 加入的key有多少个
        self._bit = {}
        self._map = {'a': '1', 'b': '2', 'c': '3', 'd': '4', 'e': '5', 'f':'6'}

    def set(self, value):
        """return last bit"""
        value = self._translate(value)
        self._total_key_count += 1

        if not self._has_key(value):
            self._unique_key_count += 1
            key = value >> SHIFT
            self._bit[key] = self._bit.get(key, 0) | (1 << (value & MASK))
            return 0
        return 1

    def exist(self, value):
        value = self._translate(value)
        if self._has_key(value):
            return True
        return False

    def clear(self, value):
        value = self._translate(value)
        if self._has_key(value):
            self._unique_key_count -= 1
            self._total_key_count -= 1

            key = value >> SHIFT
            self._bit[key] = self._bit[key] & (~(1 << (value & MASK)))
            return True
        return False

    def get_total_count(self):
        return self._total_key_count

    def get_bit_count(self):
        return self._unique_key_count

    def _has_key(self, value):
        key = value >> SHIFT
        return self._bit.get(key, 0) & (1 << (value & MASK))

    def _translate(self, value):
        value = value.lower()
        return long(''.join([self._map.get(c, c) for c in value]))

if __name__ == '__main__':
    bitBucket = BitBucket()
    bitBucket.set("a"*40)
    print bitBucket.exist("a" * 40)
    print bitBucket.exist("b" * 40)

    bitBucket.clear("a" * 40)

    import hashlib

    for i in range(1, 27):
        a = chr(i)
        sha1 = hashlib.sha1()
        sha1.update(a)
        bitBucket.set(sha1.hexdigest())

    print bitBucket.get_total_count() 
    print bitBucket.get_bit_count()

    count = 0
    for i in range(1, 30):
        a = chr(i)
        sha1 = hashlib.sha1()
        sha1.update(a)
        if bitBucket.exist(sha1.hexdigest()):
            count += 1

    assert count == bitBucket.get_bit_count()

Or you can consider using a dictionary tree to do it. It is best to use C++ to do it, but the efficiency and memory can be improved!

Ty80

If you use bloomfilter, it will introduce a certain error rate. It depends on whether your project can be accepted. If so, this is the best choice.

If that doesn’t work, just get a trie tree. Marisa is recommended to save space.

刘奇

The first reaction is to use tuples, but I don’t know how efficient it is. Can you try it?

#!/usr/bin/env python3
data = {"a":1, "b":2, "c":3, "d":4, "a":5, "c":6}

data.keys()

t should be a unique hash key tuple.

Peter_Zhu

Decisive bloom filter, simple to implement, small memory, and most importantly, high efficiency
Java version

大家讲道理

The method in the link below is for reference: https://github.com/qiwsir/algorithm/blob/master/same_element_in_list.md

伊谢尔伦

Assume that the data with a length of 5 million is a dictionary source_dict and what needs to be judged is a list hash_list, then:
result = [item for item in hash_list if item in source_dict]

source_dict must be loaded into memory first. If it occupies memory, you can first source_dict.keys() get the key list. Assuming it is source_keys, then:
result = [item for item in hash_list if item in source_keys].

Considering that the dictionary traversal speed is O(1), the list is O(n), and the amount of data here is 5 million, so method one is recommended.

Peter_Zhu

You can try to use MapReduce to solve it, please refer to:
Implementing MapReduce with multiprocessing

小葫芦

Use the bsddb module. Although it is not a standard library, it is still a common python module.

bucket = bsddb.btopen(None)

or

bucket = bsddb.hashopen(dbfilename)

When using a disk, the storage object can also be pickled and directly used as a key

大家讲道理

Idea: python’s object mechanism determines that python will definitely not save as much memory as C. Each str will occupy an extra part of the memory

  • If it must be stored in memory, consider redis, which is a good choice regardless of algorithm or memory
  • If it can be placed on disk, bsddb should be a good choice

In the final analysis, what needs to be considered is the architecture. In this era, there is almost no need to operate the algorithm yourself

迷茫

If it is a 40-digit hexadecimal hash (I guess it may be sha1), it is a bit wasteful for 5 million data.

In other words, instead of indexing a 40-digit hexadecimal string, it is better to consider how to index a 5 million-scale string.

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template