Bloom filter is a magical data structure. This article will give you an in-depth understanding of Bloom filter and introduce the method of using Bloom filter in Redis.
The Bloom filter is a magical data structure,can Used to determine whether an element is in a collection. A very commonly used function is to remove duplicates. A common requirement among crawlers: There are thousands of target website URLs. How to determine whether a crawler has favored a certain URL? To put it simply, every time the crawler collects a URL, it can store the URL in the database. Every time a new URL comes over, it will go to the database to query whether it has been accessed before. [Related recommendations: Redis Video Tutorial]
select id from table where url = 'https://jaychen.cc'
But as the crawler crawls more and more URLs, the database must be accessed once before each request, and for this kind of string SQL query efficiency is not high. In addition to the database, using the set structure of Redis can also meet this requirement, and its performance is better than that of the database. But Redis also has a problem: it consumes too much memory. At this time, the Bloom filter appears very horizontally: let me answer this question.
Compared with databases and Redis, using Bloom filters can effectively avoid performance and memory usage problems.
The Bloom filter is essentially a bit array. A bit array means that each element of the array only occupies 1 bit. Each element can only be 0 or 1. In this way, applying for a bit array of 10000 elements only takes up 10000 / 8 = 1250 B of space. In addition to a bit array, the Bloom filter also has K hash functions. When an element is added to the Bloom filter, the following operations will be performed:
For example, assume that the Bloom filter has 3 hash functions: f1, f2, f3 and a bit array arr
. Now we need to insert https://jaychen.cc
into the Bloom filter:
When you want to determine whether a value is in the Bloom filter, perform a hash calculation on the element again. After getting the value, determine whether each element in the bit array is 1. If the values are all 1, then it means that this value is in the Bloom filter. If there is a value that is not 1, it means that the element is not in the Bloom filter.
If you can’t understand the text, please look at the explanation of the soul painter’s picture below
After reading the above explanation, you will definitely come up with a Problem: When more elements are inserted, the more positions in the bit array are set to 1. When an element is not in the Bloom filter, after hash calculation, the value obtained is queried in the bit array, and there is Perhaps these locations are also set to 1. Such an object that does not exist in the Bloom filter may also be misjudged as being in the Bloom filter. But if the Bloom filter determines that an element is not in the Bloom filter, then this value must not be in the Bloom filter. To put it simply:
The defect of this Bloom filter is put into the requirements of the crawler above. There may be some unvisited URLs that may be misjudged as visited, but if they are visited URLs, they must be It will not be mistakenly judged as not visited.
redis added the module function in version 4.0. Bloom filters can be added to redis in the form of modules. Therefore, if you use redis 4.0 or above, you can use the bloom filter in redis by loading module. But this is not the simplest way. You can use docker to experience bloom filters directly in redis.
> docker run -d -p 6379:6379 --name bloomfilter redislabs/rebloom > docker exec -it bloomfilter redis-cli
redis Bloom filter mainly has two commands:
bf.add
Add elements to the Bloom filter: bf. add urls https://jaychen.cc
. bf.exists
Determine whether an element is in the filter: bf.exists urls https://jaychen.cc
. As mentioned above, there are misjudgments in the Bloom filter. There are two values in redis that determine the accuracy of the Bloom filter:
error_rate
:允许布隆过滤器的错误率,这个值越低过滤器的位数组的大小越大,占用空间也就越大。initial_size
:布隆过滤器可以储存的元素个数,当实际存储的元素个数超过这个值之后,过滤器的准确率会下降。redis 中有一个命令可以来设置这两个值:
bf.reserve urls 0.01 100
三个参数的含义:
error_rate
的值。initial_size
的值。使用这个命令要注意一点:执行这个命令之前过滤器的名字应该不存在,如果执行之前就存在会报错:(error) ERR item exists
更多编程相关知识,请访问:编程入门!!
The above is the detailed content of What is a bloom filter? How to use it in Redis?. For more information, please follow other related articles on the PHP Chinese website!