Client: Does this key exist?
Server: Does not exist/don’t know
The Bloom filter is a relatively clever probabilistic data structure, and its essence is a data structure. It features efficient insertion and querying. But when we want to check whether a key exists in a certain structure, by using a Bloom filter, we can quickly learn that "this key must not exist or may exist." Compared with traditional data structures such as List, Set, and Map, it is more efficient and takes up less space, but the results it returns are probabilistic and inaccurate.
Bloom filters are only used to test membership in a collection. The classic Bloom filter example is to improve efficiency by reducing expensive disk (or network) lookups for non-existent keys. As we can see, a Bloom filter can search for a key in O(k) constant time, where k is the number of hash functions, and testing for the non-existence of a key will be very fast.
In order to improve access efficiency, we will put some data in the Redis cache. When performing data query, you can first obtain the data from the cache without reading the database. This can effectively improve performance.
When querying data, first determine whether there is data in the cache. If there is data, obtain the data directly from the cache.
But if there is no data, you need to get the data from the database and then put it into the cache. If a large number of accesses fail to hit the cache, it will put a lot of pressure on the database, causing the database to crash. Using Bloom filters, when accessing a non-existent cache, you can quickly return to avoid cache or DB crash.
HBase stores a very large amount of data. To determine whether a certain ROWKEYS or a certain column exists, use a Bloom filter. You can quickly get whether a certain data exists. But there is a certain misjudgment rate. But if a key does not exist, it must be accurate.
To determine whether an element exists, the efficiency of using HashMap is very high. HashMap can achieve O(1) constant time complexity by mapping values to HashMap Keys.
However, if the amount of data stored is very large (for example: hundreds of millions of data), HashMap will consume a very large amount of memory. And it is simply impossible to read massive amounts of data into memory at once.
:
The Bloom filter is a bit array or a bit binary vector
The elements in this array are either 0 or 1
k hash functions are independent of each other, and the calculated result of each hash function is modulo the length m of the array , and set the corresponding bit to 1 (blue cell)
We set each key to the cell in this way, which is the "Bloom filter"
Assume that a key is entered, we use the previous k hash functions to find the hash, and get k values
Determine whether the k values are all blue, if one is not Blue, then the key must not exist
If both are blue, then the key may exist (Bloom filter will cause misjudgment)
Because if there are many input objects and the set is relatively small, it will As a result, most positions in the collection will be painted blue. Then when a certain key is checked to be blue, a certain position happens to be set to blue. At this time, it will be mistakenly believed that the key is in the collection.
Example:
Traditional bloom filters do not support deletion operations. However, a variant called Counting Bloom filter can be used to test whether the number of element counts is absolutely less than a certain threshold, and it supports element deletion. The principle and implementation of the article Counting Bloom Filter is written in great detail and you can read it in detail.
Obviously, if the Bloom filter is too small, all bits will soon be 1, then any value can be queried All will return "may exist", which defeats the purpose of filtering. As the length of a Bloom filter increases, its false positive rate decreases.
In addition, the number of hash functions also needs to be weighed. The more the number, the faster the Bloom filter bit position is set to 1, and the lower the efficiency of the Bloom filter; but if there are too few If so, our false alarm rate will become higher.
As can be seen from the above figure, increasing the number of hash functions k will greatly reduce the error rate p.
Don’t worry, actually we need to confirm the values of m and k. Then, if we specify the fault tolerance p and the number of elements n, these parameters can be calculated using the following formula:
We can calculate these parameters based on the size of the filter m, the number of hash functions k and the number of inserted elements n To calculate the false alarm rate p, the formula is as follows: Based on the above, how to choose the k and m values suitable for the business?
Formula:
k is the number of hash functions, m is the Bloom filter length, n is the number of inserted elements, and p is the false positive rate.
As for how to derive this formula, I have published an article on Zhihu about it. If you are interested, you can read it. If you are not interested, just remember the formula above.
I would also like to mention another important point here. Since the only purpose of using a Bloom filter is to search faster, we can't use a slow hash function, right? Cryptographic hash functions (e.g. Sha-1, MD5) are not a good choice for bloom filters because they are a bit slow. So, better choices from faster hash function implementations are murmur, fnv family hashing, Jenkins hashing and HashMix.
In the given example you have seen that we can use this to warn the user for entering a weak password.
You can use bloom filters to prevent users from visiting malicious websites.
Instead of querying a SQL database to check if a user with a specific email exists, you can first use the Bloom Bloom filter to do a cheap lookup check. If the email doesn't exist, great! If it does exist, you may have to make additional queries to the database. You can also do the same thing to search for "username already taken."
You can keep a Bloom filter based on the IP address of your website visitor to check whether the user of your website is a "returning user" or a "new user". A few false positives from “returning users” can’t hurt you, right?
You can also do spell checking by tracking dictionary words using Bloom filters.
The above is the detailed content of What is the algorithm formula for Redis bloom filter size?. For more information, please follow other related articles on the PHP Chinese website!