Table of Contents

Bloom Filter

BitSet

算法的实现

算法小结

Home

Database

Mysql Tutorial

海量数据处理算法之BloomFilter

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 07, 2016 pm 04:13 PM

b bloomfilter introduce data processing Massive algorithm

算法介绍 Bloom Filter的中文名称叫做布隆过滤器，因为他最早的提出者叫做布隆(Bloom)，因而而得此名。布隆过滤器简单的说就是为了检索一个元素是否存在于某个集合当中，以此实现数据的过滤。也许你会想，这还不简单，判断元素是否存在某集合中，遍历集合，

算法介绍

Bloom Filter的中文名称叫做布隆过滤器，因为他最早的提出者叫做布隆(Bloom)，因而而得此名。布隆过滤器简单的说就是为了检索一个元素是否存在于某个集合当中，以此实现数据的过滤。也许你会想，这还不简单，判断元素是否存在某集合中，遍历集合，一个个去比较不就能得出结果，当然这没有任何的问题，但是当你面对的是海量数据的时候，在空间和时间上的代价是非常恐怖的，显然需要更好的办法来解决这个问题，而Bloom Filter就是一个不错的算法。具体怎么实现，接着往下看。

Bloom Filter

先来说说传统的元素检索的办法，比如说事先在内存中存储了一堆的url字符数组，然后给定一个指定的url，判断是否存在于之前的集合中，我们肯定是加载整个数组到内存中，然后一个个去比较，假设每条url字符平均所占的量只有几个字节，但是当数据变为海量的时候，也足以撑爆整个内存，这是空间上的一个局限。再者，逐次遍历的方式本身就是一种暴力搜索，搜索的时间将会随着集合容量的本身而线性扩展，一旦数据量变大，查询时间上的开销也是非常恐怖的。针对时间和空间上的问题，Bloom Filter都给出了完美的解决办法。首先第一个空间的问题，原本的数据占用的是字符，在这里我们用1个位占据，也就是说1个元素我用1/8的字节表示，不管你的url长度是10个字符，100字符，统统用一个位表示，所以在这里我们需要能够保证每个字符所代表的位不能冲突。因为用到了位的存储，我们需要对数据进行一个hash映射，以此得到他的位置，然后将此位置上的位置标为1(默认都是为0)。所以说白了，Bloom Filter就是由一个很长的位数组和一些随机的哈希函数构成。位数组你可以想象成下面的这种形式：

你可以想象这个长度非常长，反正1个单位就占据1个位，1k的空间就已经能够表示1024*8=8192位了。所以说内存空间得到了巨大的节约。现在一个问题来了，为什么我刚刚用了一些随机的哈希函数这个词而不是说一个呢，因为会有哈希碰撞，再好的哈希函数也不能保证不会发生哈希冲突，所以这里需要采用多个哈希函数，所以元素是否存在的判断条件就变为了只有所有的哈希函数映射的位置的值都是true的情况下，此元素才是存在于集合中的，这样判断的准确率就会大大提升了，哈希映射之后的效果图如下：

假设我们的程序采用了如上图所示的3个随机独立的哈希函数，1个元素需要进行3次不同的哈希函数的映射算法，对3个位置进行标记，对此元素的误判概率我们做个计算，要使此元素误判，就是说，他的这3个位置都有人占据了，就是说都与别的哈希函数有冲突，这最糟糕的情况就是他的3个映射位置与某个其他的元素通过哈希函数计算完全重叠，假设位空间长度1W位。每个位置被映射的概率就为1/1w，所以最糟糕的情况的冲突概率就是1/1w*1/1w*1/1w=1/10的12次方，如果最大的冲突概率的可能性呢，就是每个位置都与其中的某个哈希函数映射冲突，那误差概率就是叠加的情况1/1w+1/1w+1/1w=0.0003。结果已经非常明显了，通过3个哈希函数就已经能够保证足够低的误判率了，更别说当你用4个，5个哈希函数做映射的情况。下面问题又转移到了我们用什么方式去作为位数组呢，int数组，字符char数组，答案都不是。结果在下面。

BitSet

这个是java中的某个数据类型，C,C++我目前不清楚有没有这样的类，为什么选用这个而不是前面说的int，或char数组，首先int当然不行，1个int本身就有32位，占了4个字节，用他做出0，1的存储显然相当于没省下空间，自然我们就想到了用字符数组char[]，在C语言中1个char占一个字节，而在java中由于编码方式的不同，一个char占2个字节，用char做存储也只是稍稍比int介绍了一半的空间，并没有真正的做到一个元素用一个位来表示，后来查了一下，java里面就有内置了BitSet专门就是做位存储的，还能够进行位相关的许多操作，他的操作其实就是和数组一样，也是从0开始的。不熟悉的同学可以自行上网查阅相关资料，其实int数组也可以实现类似的功能，不过自己要做转换，把int当成32位来算，之前我写过相关的文章，是关于位示图法存储大数据。

算法的实现

算法其实非常的简单，我这里用一组少量的数据进行模拟。

输入数据input.txt：

mike
study
day
get
last
exam
think
fish
he

Copy after login

然后是测试数据，用于查询操作testInput.txt:

play
mike
study
day
get
Axis
last
exam
think
fish
he

Copy after login

其实就是我随便组合的一些词语。

算法的工具类BloomFilterTool.java:

package BloomFilter;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.BitSet;
import java.util.HashMap;
import java.util.Map;

/**
 * 布隆过滤器算法工具类
 * 
 * @author lyq
 * 
 */
public class BloomFilterTool {
	// 位数组设置为10w位的长度
	public static final int BIT_ARRAY_LENGTH = 100000;

	// 原始文档地址
	private String filePath;
	// 测试文档地址
	private String testFilePath;
	// 用于存储的位数组,一个单元用1个位存储
	private BitSet bitStore;
	// 原始数据
	private ArrayList<String> totalDatas;
	// 测试的查询数据
	private ArrayList<String> queryDatas;

	public BloomFilterTool(String filePath, String testFilePath) {
		this.filePath = filePath;
		this.testFilePath = testFilePath;

		this.totalDatas = readDataFile(this.filePath);
		this.queryDatas = readDataFile(this.testFilePath);
	}

	/**
	 * 从文件中读取数据
	 */
	public ArrayList<String> readDataFile(String path) {
		File file = new File(path);
		ArrayList<String> dataArray = new ArrayList<String>();

		try {
			BufferedReader in = new BufferedReader(new FileReader(file));
			String str;
			String[] tempArray;
			while ((str = in.readLine()) != null) {
				tempArray = str.split(" ");
				for(String word: tempArray){
					dataArray.add(word);
				}
			}
			in.close();
		} catch (IOException e) {
			e.getStackTrace();
		}

		return dataArray;
	}
	
	/**
	 * 获取查询总数据
	 * @return
	 */
	public ArrayList<String> getQueryDatas(){
		return this.queryDatas;
	}

	/**
	 * 用位存储数据
	 */
	private void bitStoreData() {
		long hashcode = 0;
		bitStore = new BitSet(BIT_ARRAY_LENGTH);

		for (String word : totalDatas) {
			// 对每个词进行3次哈希求值，减少哈希冲突的概率
			hashcode = BKDRHash(word);
			hashcode %= BIT_ARRAY_LENGTH;

			
			bitStore.set((int) hashcode, true);

			hashcode = SDBMHash(word);
			hashcode %= BIT_ARRAY_LENGTH;

			bitStore.set((int) hashcode, true);

			hashcode = DJBHash(word);
			hashcode %= BIT_ARRAY_LENGTH;

			bitStore.set((int) hashcode, true);
		}
	}

	/**
	 * 进行数据的查询，判断原数据中是否存在目标查询数据
	 */
	public Map<String, Boolean> queryDatasByBF() {
		boolean isExist;
		long hashcode;
		int pos1;
		int pos2;
		int pos3;
		// 查询词的所属情况图
		Map<String, Boolean> word2exist = new HashMap<String, Boolean>();

		hashcode = 0;
		isExist = false;
		bitStoreData();
		for (String word : queryDatas) {
			isExist = false;
			
			hashcode = BKDRHash(word);
			pos1 = (int) (hashcode % BIT_ARRAY_LENGTH);

			hashcode = SDBMHash(word);
			pos2 = (int) (hashcode % BIT_ARRAY_LENGTH);

			hashcode = DJBHash(word);
			pos3 = (int) (hashcode % BIT_ARRAY_LENGTH);

			// 只有在3个哈希位置都存在才算真的存在
			if (bitStore.get(pos1) && bitStore.get(pos2) && bitStore.get(pos3)) {
				isExist = true;
			}

			// 将结果存入map
			word2exist.put(word, isExist);
		}

		return word2exist;
	}

	/**
	 * 进行数据的查询采用普通的过滤器方式就是，逐个查询
	 */
	public Map<String, Boolean> queryDatasByNF() {
		boolean isExist = false;
		// 查询词的所属情况图
		Map<String, Boolean> word2exist = new HashMap<String, Boolean>();

		// 遍历的方式去查找
		for (String qWord : queryDatas) {
			isExist = false;
			for (String word : totalDatas) {
				if (qWord.equals(word)) {
					isExist = true;
					break;
				}
			}

			word2exist.put(qWord, isExist);
		}

		return word2exist;
	}

	/**
	 * BKDR字符哈希算法
	 * 
	 * @param str
	 * @return
	 */
	private long BKDRHash(String str) {
		int seed = 31; /* 31 131 1313 13131 131313 etc.. */
		long hash = 0;
		int i = 0;

		for (i = 0; i < str.length(); i++) {
			hash = (hash * seed) + (str.charAt(i));
		}

		hash = Math.abs(hash);
		return hash;
	}

	/**
	 * SDB字符哈希算法
	 * 
	 * @param str
	 * @return
	 */
	private long SDBMHash(String str) {
		long hash = 0;
		int i = 0;
		
		for (i = 0; i < str.length(); i++) {
			hash = (str.charAt(i)) + (hash << 6) + (hash << 16) - hash;
		}

		hash = Math.abs(hash);
		return hash;
	}

	/**
	 * DJB字符哈希算法
	 * 
	 * @param str
	 * @return
	 */
	private long DJBHash(String str) {
		long hash = 5381;
		int i = 0;

		for (i = 0; i < str.length(); i++) {
			hash = ((hash << 5) + hash) + (str.charAt(i));
		}

		hash = Math.abs(hash);
		return hash;
	}

}

Copy after login

场景测试类Client.java:

package BloomFilter;

import java.text.MessageFormat;
import java.util.ArrayList;
import java.util.Map;

/**
 * BloomFileter布隆过滤器测试类
 * 
 * @author lyq
 * 
 */
public class Client {
	public static void main(String[] args) {
		String filePath = "C:\\Users\\lyq\\Desktop\\icon\\input.txt";
		String testFilePath = "C:\\Users\\lyq\\Desktop\\icon\\testInput.txt";
		// 总的查询词数
		int totalCount;
		// 正确的结果数
		int rightCount;
		long startTime = 0;
		long endTime = 0;
		// 布隆过滤器查询结果
		Map<String, Boolean> bfMap;
		// 普通过滤器查询结果
		Map<String, Boolean> nfMap;
		//查询总数据
		ArrayList<String> queryDatas;

		BloomFilterTool tool = new BloomFilterTool(filePath, testFilePath);

		// 采用布隆过滤器的方式进行词的查询
		startTime = System.currentTimeMillis();
		bfMap = tool.queryDatasByBF();
		endTime = System.currentTimeMillis();
		System.out.println("BloomFilter算法耗时" + (endTime - startTime) + "ms");

		// 采用普通过滤器的方式进行词的查询
		startTime = System.currentTimeMillis();
		nfMap = tool.queryDatasByNF();
		endTime = System.currentTimeMillis();
		System.out.println("普通遍历查询操作耗时" + (endTime - startTime) + "ms");

		boolean isExist;
		boolean isExist2;

		rightCount = 0;
		queryDatas = tool.getQueryDatas();
		totalCount = queryDatas.size();
		for (String qWord: queryDatas) {
			// 以遍历的查询的结果作为标准结果
			isExist = nfMap.get(qWord);
			isExist2 = bfMap.get(qWord);

			if (isExist == isExist2) {
				rightCount++;
			}else{
				System.out.println("预判错误的词语：" + qWord);
			}
		}
		System.out.println(MessageFormat.format(
				"Bloom Filter的正确个数为{0}，总查询数为{1}个，正确率{2}", rightCount,
				totalCount, 1.0 * rightCount / totalCount));
	}
}

Copy after login

在算法的测试类中我对于Bloom Filter和普通的遍历搜索方式进行了时间上的性能比较，当数据量比较小的时候，其实是看不出什么差距，甚至有可能布隆过滤器所花的时间可能更长比如我下面的某次测试结果：

BloomFilter算法耗时2ms
普通遍历查询操作耗时0ms
Bloom Filter的正确个数为11，总查询数为11个，【本文来自鸿网互联 (http://www.68idc.cn)】正确率1

Copy after login

但是当我用真实的测试数据进行测试，我把原始数据缓存了一篇标准的文档，然后把查询的结果词语数量进行了翻倍，然后执行同样的程序结果变为了下面这个样子：

BloomFilter算法耗时16ms
普通遍历查询操作耗时47ms
Bloom Filter的正确个数为2,743，总查询数为2,743个，正确率1

Copy after login

其实这还不足以模拟海量数据的场景，对于这个结果也不难理解，普通的暴力搜寻，是和原始数据的总量相关，时间复杂度为O(n)的，而Bloom Filter，则是常量级别，做一个哈希映射就OK 了，时间复杂度O(l),

算法小结

算法在实现的过程中遇到了一些小问题，第一就是在使用哈希函数的时候，因为我是随机的选了3个字符哈希函数，后来发现老是会越界，一越界数值就会变为负的再通过BitSet就会报错，原本在C语言中可以用unsigned int来解决，java中没有这个概念，于是就直接取hash绝对值了。Bloom Filter算法的一个特点是数据可能会出现误判，但是绝对不会漏判，误判就是把不是存在集合中的元素判定成有，理由是哈希冲突可能造成此结果，而漏判指的是存在的元素判定成了不存在集合中，这个是绝对不可能的，因为如果你存在，你所代表的位置就一定会有被哈希映射到，一旦映射到了，在你再去查找就不会漏掉。算法的应用范围其实挺多的，典型的比如垃圾邮箱地址的过滤。

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

1 months ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

1 months ago By DDD

How to fix KB5055523 fails to install in Windows 11?

2 weeks ago By DDD

InZoi: How To Apply To School And University

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7750

Java Tutorial

1643

CakePHP Tutorial

1397

Laravel Tutorial

1293

PHP Tutorial

1234

Related knowledge

CLIP-BEVFormer: Explicitly supervise the BEVFormer structure to improve long-tail detection performance Mar 26, 2024 pm 12:41 PM

Written above & the author’s personal understanding: At present, in the entire autonomous driving system, the perception module plays a vital role. The autonomous vehicle driving on the road can only obtain accurate perception results through the perception module. The downstream regulation and control module in the autonomous driving system makes timely and correct judgments and behavioral decisions. Currently, cars with autonomous driving functions are usually equipped with a variety of data information sensors including surround-view camera sensors, lidar sensors, and millimeter-wave radar sensors to collect information in different modalities to achieve accurate perception tasks. The BEV perception algorithm based on pure vision is favored by the industry because of its low hardware cost and easy deployment, and its output results can be easily applied to various downstream tasks.

Implementing Machine Learning Algorithms in C++: Common Challenges and Solutions Jun 03, 2024 pm 01:25 PM

Common challenges faced by machine learning algorithms in C++ include memory management, multi-threading, performance optimization, and maintainability. Solutions include using smart pointers, modern threading libraries, SIMD instructions and third-party libraries, as well as following coding style guidelines and using automation tools. Practical cases show how to use the Eigen library to implement linear regression algorithms, effectively manage memory and use high-performance matrix operations.

Explore the underlying principles and algorithm selection of the C++sort function Apr 02, 2024 pm 05:36 PM

The bottom layer of the C++sort function uses merge sort, its complexity is O(nlogn), and provides different sorting algorithm choices, including quick sort, heap sort and stable sort.

Improved detection algorithm: for target detection in high-resolution optical remote sensing images Jun 06, 2024 pm 12:33 PM

01 Outlook Summary Currently, it is difficult to achieve an appropriate balance between detection efficiency and detection results. We have developed an enhanced YOLOv5 algorithm for target detection in high-resolution optical remote sensing images, using multi-layer feature pyramids, multi-detection head strategies and hybrid attention modules to improve the effect of the target detection network in optical remote sensing images. According to the SIMD data set, the mAP of the new algorithm is 2.2% better than YOLOv5 and 8.48% better than YOLOX, achieving a better balance between detection results and speed. 02 Background & Motivation With the rapid development of remote sensing technology, high-resolution optical remote sensing images have been used to describe many objects on the earth’s surface, including aircraft, cars, buildings, etc. Object detection in the interpretation of remote sensing images

Application of algorithms in the construction of 58 portrait platform May 09, 2024 am 09:01 AM

1. Background of the Construction of 58 Portraits Platform First of all, I would like to share with you the background of the construction of the 58 Portrait Platform. 1. The traditional thinking of the traditional profiling platform is no longer enough. Building a user profiling platform relies on data warehouse modeling capabilities to integrate data from multiple business lines to build accurate user portraits; it also requires data mining to understand user behavior, interests and needs, and provide algorithms. side capabilities; finally, it also needs to have data platform capabilities to efficiently store, query and share user profile data and provide profile services. The main difference between a self-built business profiling platform and a middle-office profiling platform is that the self-built profiling platform serves a single business line and can be customized on demand; the mid-office platform serves multiple business lines, has complex modeling, and provides more general capabilities. 2.58 User portraits of the background of Zhongtai portrait construction

How does Golang improve data processing efficiency? May 08, 2024 pm 06:03 PM

Golang improves data processing efficiency through concurrency, efficient memory management, native data structures and rich third-party libraries. Specific advantages include: Parallel processing: Coroutines support the execution of multiple tasks at the same time. Efficient memory management: The garbage collection mechanism automatically manages memory. Efficient data structures: Data structures such as slices, maps, and channels quickly access and process data. Third-party libraries: covering various data processing libraries such as fasthttp and x/text.

What is Dogecoin Apr 01, 2024 pm 04:46 PM

Dogecoin is a cryptocurrency created based on Internet memes, with no fixed supply cap, fast transaction times, low transaction fees, and a large meme community. Uses include small transactions, tips, and charitable donations. However, its unlimited supply, market volatility, and status as a joke coin also bring risks and concerns. What is Dogecoin? Dogecoin is a cryptocurrency created based on internet memes and jokes. Origin and History: Dogecoin was created in December 2013 by two software engineers, Billy Markus and Jackson Palmer. Inspired by the then-popular "Doge" meme, a comical photo featuring a Shiba Inu with broken English. Features and Benefits: Unlimited Supply: Unlike other cryptocurrencies such as Bitcoin

How do the data processing capabilities in Laravel and CodeIgniter compare? Jun 01, 2024 pm 01:34 PM

Compare the data processing capabilities of Laravel and CodeIgniter: ORM: Laravel uses EloquentORM, which provides class-object relational mapping, while CodeIgniter uses ActiveRecord to represent the database model as a subclass of PHP classes. Query builder: Laravel has a flexible chained query API, while CodeIgniter’s query builder is simpler and array-based. Data validation: Laravel provides a Validator class that supports custom validation rules, while CodeIgniter has less built-in validation functions and requires manual coding of custom rules. Practical case: User registration example shows Lar

See all articles