Implementing Bloom Filter algorithm in PHP_PHP tutorial-PHP Tutorial-php.cn

Implementing the Bloom Filter algorithm in PHP

This article mainly introduces the implementation of the Bloom Filter algorithm in PHP. This article directly gives the implementation code, and gives detailed comments in the code. Bloom Filter For algorithm introduction and other content, friends who need it can refer to it

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

/*Bloom Filter algorithm de-filters.

Introducing the basic processing idea of Bloom Filter: apply for a batch of space to store 0 1 information, and then determine the corresponding position of the element based on a batch of hash functions. If the value of the corresponding position of each hash function is all 1, Indicates that this element exists. On the contrary, if it is 0, the value of the corresponding position is set to 1. Since different elements may have the same hash value, that is, the information of multiple elements may be stored in the same location, resulting in a certain misjudgment rate.

If the application space is too small, as the number of elements increases, there will be more and more 1s, and the chance of conflict between each element will increase, resulting in an increasing misjudgment rate. In addition, the selection and number of hash functions must be well balanced. Although multiple hash functions can provide accuracy of judgment, they will reduce the processing speed of the program, and the increase of hash functions requires more space. Store location information.

Application of Bloom-Filter.

Bloom-Filter is generally used to determine whether an element exists in a large data set. For example, spam filters in mail servers. In the field of search engines, Bloom-Filter is most commonly used for URL filtering by web spiders. Web spiders usually have a URL list that saves the URLs of web pages to be downloaded and those that have been downloaded. When a web spider downloads a web page, it extracts the URL from the web page. After extracting the new URL, you need to determine whether the URL already exists in the list. At this time, the Bloom-Filter algorithm is the best choice.

For example, a public email (email) provider like Yahoo, Hotmail and Gmai always needs to filter spam from spammers. One way is to record the email addresses from which spam is sent. Since senders are constantly registering new addresses, there are at least billions of spam addresses in the world, and storing them all would require a large number of network servers.

The Bloom filter was proposed by Barton Bloom in 1970. It's actually a long binary vector and a series of random mapping functions. We use the above example to illustrate the working principle.

Suppose we store 100 million email addresses. We first create a 1.6 billion binary bits (bits), which is a 200 million byte vector, and then set all 1.6 billion binary bits to zero. For each email address X, we use eight different random number generators (F1, F2, ..., F8) to generate eight message fingerprints (f1, f2, ..., f8). Then use a random number generator G to map these eight information fingerprints to eight natural numbers g1, g2, ..., g8 from 1 to 1.6 billion. Now we set all the binary bits in these eight positions to one. When we perform this processing on all 100 million email addresses. A bloom filter for these email addresses is built. (See image below) Now, let us see how to use Bloom filter to detect whether a suspicious email address Y is in the blacklist. We use the same eight random number generators (F1, F2, ..., F8) to generate eight information fingerprints s1, s2, ..., s8 for this address, and then map these eight fingerprints to Bloom filtering The eight binary bits of the device are t1, t2,...,t8. If Y is in the blacklist, obviously, the eight binary numbers corresponding to t1, t2,...,t8 must be one. In this way, we can accurately find any email address on the blacklist.

Bloom filter will never miss any suspicious addresses in the blacklist. However, it has one drawback. That is, it has a very small possibility of determining that an email address that is not on the blacklist is on the blacklist, because it is possible that a good email address happens to correspond to eight binary bits that are all set to one. Fortunately, this possibility is very small. We call this the probability of misrecognition. In the above example, the probability of misidentification is less than 1 in 10,000.

The advantage of Bloom filter is that it is fast and saves space. But there is a certain misrecognition rate. A common remedy is to create a small whitelist of email addresses that might otherwise be misidentified.

//Use php program to describe the above algorithm

$set = array(1,2,3,4,5,6);

// Determine whether 5 is in $set

$bloomFiter = array(0,0,0,0,0,0,0,0,0,0,0);

// Change the $bloomFiter median array to represent the set through some algorithm. Here we use a simple algorithm to change the corresponding value in the set to the position in bloom to 1

//The algorithm is as follows

foreach($set as $key){

$bloomFiter[$key] = 1 ;

}

var_dump($bloomFiter) ;

//At this time $bloomFiter = array(1,1,1,1,1,1);

//Judge whether it is in the set

if($bloomFiter[9] ==1){

echo 'in set';

}else{

echo 'not in set' ;

}

// The above is just a simple example. In fact, several hash algorithms are needed, but on the other hand, if the number of hash functions is small, then there will be more 0s in the bit array

class bloom_filter {

function __construct($hash_func_num=1, $space_group_num=1) {

$max_length = pow(2, 25);

$binary = pack('C', 0);

//1 byte occupies 8 bits

$this->one_num = 8;

//Default 32m*1

$this->space_group_num = $space_group_num;

$this->hash_space_assoc = array();

//Allocate space

for($i=0; $i<$this->space_group_num; $i ){

$this->hash_space_assoc[$i] = str_repeat($binary, $max_length);

}

$this->pow_array = array(

0 => 1,

1 => 2,

2 => 4,

3 => 8,

4 => 16,

5 => 32,

6 => 64,

7 => 128,

);

$this->chr_array = array();

$this->ord_array = array();

for($i=0; $i<256; $i ){

$chr = chr($i);

$this->chr_array[$i] = $chr;

$this->ord_array[$chr] = $i;

}

$this->hash_func_pos = array(

0 => array(0, 7, 1),

1 => array(7, 7, 1),

2 => array(14, 7, 1),

3 => array(21, 7, 1),

4 => array(28, 7, 1),

5 => array(33, 7, 1),

6 => array(17, 7, 1),

);

$this->write_num = 0;

$this->ext_num = 0;

if(!$hash_func_num){

$this->hash_func_num = count($this->hash_func_pos);

}

else{

$this->hash_func_num = $hash_func_num;

}

function add($key) {

$hash_bit_set_num = 0;

// Discrete key

$hash_basic = sha1($key);

//Intercept the first 4 digits, and then convert hexadecimal to decimal

$hash_space = hexdec(substr($hash_basic, 0, 4));

//Module

$hash_space = $hash_space % $this->space_group_num;

for($hash_i=0; $hash_i<$this->hash_func_num; $hash_i ){

$hash = hexdec(substr($hash_basic, $this->hash_func_pos[$hash_i][0], $this->hash_func_pos[$hash_i][1]));

$bit_pos = $hash >> 3;

$max = $this->ord_array[$this->hash_space_assoc[$hash_space][$bit_pos]];

$num = $hash - $bit_pos * $this->one_num;

$bit_pos_value = ($max >> $num) & 0x01;

if(!$bit_pos_value){

$max = $max | $this->pow_array[$num];

$this->hash_space_assoc[$hash_space][$bit_pos] = $this->chr_array[$max];

$this->write_num ;

}

else{

$hash_bit_set_num ;

}

if($hash_bit_set_num == $this->hash_func_num){

$this->ext_num ;

return true;

}

return false;

}

function get_stat() {

return array(

'ext_num' => $this->ext_num,

'write_num' => $this->write_num,

);

}

//test

//Get 6 hash values, currently up to 7

$hash_func_num = 6;

//Allocate 1 storage space, each space is 32M. Theoretically, the larger the space, the lower the false positive rate. Pay attention to the usable memory limit in php.ini

$space_group_num = 1;

$bf = new bloom_filter($hash_func_num, $space_group_num);

$list = array(

'http://test/1',

'http://test/2',

'http://test/3',

'http://test/4',

'http://test/5',

'http://test/6',

'http://test/1',

'http://test/2',

);