This article mainly introduces the implementation of the Bloom Filter algorithm in PHP. This article directly gives the implementation code, and gives detailed comments in the code. Bloom Filter For algorithm introduction and other content, friends who need it can refer to it
?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
/*Bloom Filter algorithm de-filters.
Introducing the basic processing idea of Bloom Filter: apply for a batch of space to store 0 1 information, and then determine the corresponding position of the element based on a batch of hash functions. If the value of the corresponding position of each hash function is all 1, Indicates that this element exists. On the contrary, if it is 0, the value of the corresponding position is set to 1. Since different elements may have the same hash value, that is, the information of multiple elements may be stored in the same location, resulting in a certain misjudgment rate.
If the application space is too small, as the number of elements increases, there will be more and more 1s, and the chance of conflict between each element will increase, resulting in an increasing misjudgment rate. In addition, the selection and number of hash functions must be well balanced. Although multiple hash functions can provide accuracy of judgment, they will reduce the processing speed of the program, and the increase of hash functions requires more space. Store location information.
Application of Bloom-Filter. Bloom-Filter is generally used to determine whether an element exists in a large data set. For example, spam filters in mail servers. In the field of search engines, Bloom-Filter is most commonly used for URL filtering by web spiders. Web spiders usually have a URL list that saves the URLs of web pages to be downloaded and those that have been downloaded. When a web spider downloads a web page, it extracts the URL from the web page. After extracting the new URL, you need to determine whether the URL already exists in the list. At this time, the Bloom-Filter algorithm is the best choice. For example, a public email (email) provider like Yahoo, Hotmail and Gmai always needs to filter spam from spammers. One way is to record the email addresses from which spam is sent. Since senders are constantly registering new addresses, there are at least billions of spam addresses in the world, and storing them all would require a large number of network servers.
The Bloom filter was proposed by Barton Bloom in 1970. It's actually a long binary vector and a series of random mapping functions. We use the above example to illustrate the working principle.
Suppose we store 100 million email addresses. We first create a 1.6 billion binary bits (bits), which is a 200 million byte vector, and then set all 1.6 billion binary bits to zero. For each email address X, we use eight different random number generators (F1, F2, ..., F8) to generate eight message fingerprints (f1, f2, ..., f8). Then use a random number generator G to map these eight information fingerprints to eight natural numbers g1, g2, ..., g8 from 1 to 1.6 billion. Now we set all the binary bits in these eight positions to one. When we perform this processing on all 100 million email addresses. A bloom filter for these email addresses is built. (See image below) Now, let us see how to use Bloom filter to detect whether a suspicious email address Y is in the blacklist. We use the same eight random number generators (F1, F2, ..., F8) to generate eight information fingerprints s1, s2, ..., s8 for this address, and then map these eight fingerprints to Bloom filtering The eight binary bits of the device are t1, t2,...,t8. If Y is in the blacklist, obviously, the eight binary numbers corresponding to t1, t2,...,t8 must be one. In this way, we can accurately find any email address on the blacklist. Bloom filter will never miss any suspicious addresses in the blacklist. However, it has one drawback. That is, it has a very small possibility of determining that an email address that is not on the blacklist is on the blacklist, because it is possible that a good email address happens to correspond to eight binary bits that are all set to one. Fortunately, this possibility is very small. We call this the probability of misrecognition. In the above example, the probability of misidentification is less than 1 in 10,000. The advantage of Bloom filter is that it is fast and saves space. But there is a certain misrecognition rate. A common remedy is to create a small whitelist of email addresses that might otherwise be misidentified.
*/
//Use php program to describe the above algorithm
$set = array(1,2,3,4,5,6); // Determine whether 5 is in $set
$bloomFiter = array(0,0,0,0,0,0,0,0,0,0,0);
// Change the $bloomFiter median array to represent the set through some algorithm. Here we use a simple algorithm to change the corresponding value in the set to the position in bloom to 1
//The algorithm is as follows
foreach($set as $key){
$bloomFiter[$key] = 1 ; }
var_dump($bloomFiter) ;
//At this time $bloomFiter = array(1,1,1,1,1,1);
//Judge whether it is in the set
if($bloomFiter[9] ==1){ echo 'in set'; }else{ echo 'not in set' ; }
// The above is just a simple example. In fact, several hash algorithms are needed, but on the other hand, if the number of hash functions is small, then there will be more 0s in the bit array
class bloom_filter {
function __construct($hash_func_num=1, $space_group_num=1) { $max_length = pow(2, 25); $binary = pack('C', 0);
//1 byte occupies 8 bits $this->one_num = 8;
//Default 32m*1 $this->space_group_num = $space_group_num; $this->hash_space_assoc = array();
//Allocate space for($i=0; $i<$this->space_group_num; $i ){ $this->hash_space_assoc[$i] = str_repeat($binary, $max_length); }
$this->pow_array = array( 0 => 1, 1 => 2, 2 => 4, 3 => 8, 4 => 16, 5 => 32, 6 => 64, 7 => 128, ); $this->chr_array = array(); $this->ord_array = array(); for($i=0; $i<256; $i ){ $chr = chr($i); $this->chr_array[$i] = $chr; $this->ord_array[$chr] = $i; }
$this->hash_func_pos = array( 0 => array(0, 7, 1), 1 => array(7, 7, 1), 2 => array(14, 7, 1), 3 => array(21, 7, 1), 4 => array(28, 7, 1), 5 => array(33, 7, 1), 6 => array(17, 7, 1), );
$this->write_num = 0; $this->ext_num = 0;
if(!$hash_func_num){ $this->hash_func_num = count($this->hash_func_pos); } else{ $this->hash_func_num = $hash_func_num; } }
function add($key) { $hash_bit_set_num = 0; // Discrete key $hash_basic = sha1($key); //Intercept the first 4 digits, and then convert hexadecimal to decimal $hash_space = hexdec(substr($hash_basic, 0, 4)); //Module $hash_space = $hash_space % $this->space_group_num;
for($hash_i=0; $hash_i<$this->hash_func_num; $hash_i ){ $hash = hexdec(substr($hash_basic, $this->hash_func_pos[$hash_i][0], $this->hash_func_pos[$hash_i][1])); $bit_pos = $hash >> 3; $max = $this->ord_array[$this->hash_space_assoc[$hash_space][$bit_pos]]; $num = $hash - $bit_pos * $this->one_num; $bit_pos_value = ($max >> $num) & 0x01; if(!$bit_pos_value){ $max = $max | $this->pow_array[$num]; $this->hash_space_assoc[$hash_space][$bit_pos] = $this->chr_array[$max]; $this->write_num ; } else{ $hash_bit_set_num ; } } if($hash_bit_set_num == $this->hash_func_num){ $this->ext_num ; return true; } return false; }
function get_stat() { return array( 'ext_num' => $this->ext_num, 'write_num' => $this->write_num, ); } }
//test //Get 6 hash values, currently up to 7 $hash_func_num = 6;
//Allocate 1 storage space, each space is 32M. Theoretically, the larger the space, the lower the false positive rate. Pay attention to the usable memory limit in php.ini $space_group_num = 1;
$bf = new bloom_filter($hash_func_num, $space_group_num);
$list = array( 'http://test/1', 'http://test/2', 'http://test/3', 'http://test/4', 'http://test/5', 'http://test/6', 'http://test/1', 'http://test/2', ); foreach($list as $k => $v){
if($bf->add($v)){ echo $v, "n"; } } |