倒排索引构建算法BSBI和SPIMI
算法介绍 在信息搜索领域,构建索引一直是是一种非常有效的方式,但是当搜索引擎面对的是海量数据的时候,你如果要从茫茫人海的数据中去找出数据,显然这不是一个很好的办法。于是倒排索引这个概念就被提了出来。再说倒排索引概念之前,先要理解一下,一般的
算法介绍
在信息搜索领域,构建索引一直是是一种非常有效的方式,但是当搜索引擎面对的是海量数据的时候,你如果要从茫茫人海的数据中去找出数据,显然这不是一个很好的办法。于是倒排索引这个概念就被提了出来。再说倒排索引概念之前,先要理解一下,一般的索引检索信息的方式。比如原始的数据源假设都是以文档的形式被分开,文档1拥有一段内容,文档2也富含一段内容,文档3同样如此。然后给定一个关键词,要搜索出与此关键词相关的文档,自然而然我们联想到的办法就是一个个文档的内容去比较,判断是否含有此关键词,如果含有则返回这个文档的索引地址,如果不是接着用后面的文档去比,这就有点类似于字符串的匹配类似。很显然,当数据量非常巨大的时候,这种方式并不适用。原来的这种方式可以理解为是索引-->关键词,而倒排索引的形式则是关键词--->索引位置,也就是说,给出一个关键词信息,我能立马根据倒排索引的信息得出他的位置。当然,这里说的是倒排索引最后要达到的效果,至于是用什么方式实现,就不止一种了,本文所述的就是其中比较出名的BSBI和SPIMI算法。
算法的原理
这里首先给出一个具体的实例来了解一般的构造过程,先避开具体的实现方式,给定下面一组词句。
Doc1:Mike spoken English Frequently at home.And he can write English every day.
Doc2::Mike plays football very well.
首先我们必须知道,我们需要的是一些关键的信息,诸如一些修饰词等等都需要省略,动词的时态变化等都需要还原,如果代词指的是同个人也能够省略,于是上面的句子可以简化成
Doc1:Mike spoken English home.write English.
Doc2:Mike play football.
下面进行索引的倒排构建,因为Mike出现在文档1和文档2 中,所以Mike:{1, 2}后面的词的构造同样的道理。最后的关系就会构成词对应于索引位置的映射关系。理解了这个过程之后呢,可以介绍一下本文主要要说的BSBI(基于磁盘的外部排序构建索引)和SPIMI(内存单遍扫描构建索引)算法了,一般来说,后者比前者常用。
BSBI
此算法的主要步骤如下:
1、将文档中的词进行id的映射,这里可以用hash的方法去构造
2、将文档分割成大小相等的部分。
3、将每部分按照词ID对上文档ID的方式进行排序
4、将每部分排序好后的结果进行合并,最后写出到磁盘中。
5、然后递归的执行,直到文档内容全部完成这一系列操作。
这里有一张示意图:
在算法的过程中会用到读缓冲区和写缓冲区,至于期间的大小多少如何配置都是看个人的,我在后面的代码实现中也有进行设置。至于其中的排序算法的选择,一般建议使用效果比较好的快速排序算法,但是我在后面为了方便,直接用了自己更熟悉的冒泡排序算法,这个也看个人。
SPIMI
接下来说说SPIMI算法,就是内存单遍扫描算法,这个算法与上面的算法一上来就有直接不同的特点就是他无须做id的转换,还是采用了词对索引的直接关联。还有1个比较大的特点是他不经过排序,直接按照先后顺序构建索引,算法的主要步骤如下:
1、对每个块构造一个独立的倒排索引。
2、最后将所有独立的倒排索引进行合并就OK了。
本人为了方便就把这个算法的实现简洁化了,直接在内存中完成所有的构建工作。望读者稍加注意。SPIMI相对比较的简单,这里就不给出截图了。
算法的代码实现
首先是文档的输入数据,采用了2个一样的文档,我也是实在想不出有更好的测试数据了
doc1.txt:
Mike studyed English hardly yesterday He got the 100 at the last exam He thinks English is very interesting
doc2.txt:
Mike studyed English hardly yesterday He got the 100 at the last exam He thinks English is very interesting
package InvertedIndex; import java.io.BufferedReader; import java.io.File; import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.FileReader; import java.io.IOException; import java.io.PrintStream; import java.util.ArrayList; import java.util.regex.Matcher; import java.util.regex.Pattern; /** * 文档预处理工具类 * * @author lyq * */ public class PreTreatTool { // 一些无具体意义的过滤词 public static String[] FILTER_WORDS = new String[] { "at", "At", "The", "the", "is", "very" }; // 批量文档的文件地址 private ArrayList<String> docFilePaths; // 输出的有效词的存放路径 private ArrayList<String> effectWordPaths; public PreTreatTool(ArrayList<String> docFilePaths) { this.docFilePaths = docFilePaths; } /** * 获取文档有效词文件路径 * * @return */ public ArrayList<String> getEFWPaths() { return this.effectWordPaths; } /** * 从文件中读取数据 * * @param filePath * 单个文件 */ private ArrayList<String> readDataFile(String filePath) { File file = new File(filePath); ArrayList<String[]> dataArray = new ArrayList<String[]>(); ArrayList<String> words = new ArrayList<>(); try { BufferedReader in = new BufferedReader(new FileReader(file)); String str; String[] tempArray; while ((str = in.readLine()) != null) { tempArray = str.split(" "); dataArray.add(tempArray); } in.close(); } catch (IOException e) { e.getStackTrace(); } // 将每行词做拆分加入到总列表容器中 for (String[] array : dataArray) { for (String word : array) { words.add(word); } } return words; } /** * 对文档内容词汇进行预处理 */ public void preTreatWords() { String baseOutputPath = ""; int endPos = 0; ArrayList<String> tempWords = null; effectWordPaths = new ArrayList<>(); for (String filePath : docFilePaths) { tempWords = readDataFile(filePath); filterWords(tempWords, true); // 重新组装出新的输出路径 endPos = filePath.lastIndexOf("."); baseOutputPath = filePath.substring(0, endPos); writeOutOperation(tempWords, baseOutputPath + "-efword.txt"); effectWordPaths.add(baseOutputPath + "-efword.txt"); } } /** * * 对文档中的词语进行过滤操作 * * @param words * 待处理文档词语 * @param canRepeated * 有效词是否可以重复 */ private void filterWords(ArrayList<String> words, boolean canRepeated) { boolean isFilterWord; // 做形容词匹配 Pattern adjPattern; // 做动词时态的匹配 Pattern formerPattern; // 数字匹配 Pattern numberPattern; Matcher adjMatcher; Matcher formerMatcher; Matcher numberMatcher; ArrayList<String> deleteWords = new ArrayList<>(); adjPattern = Pattern.compile(".*(ly$|ful$|ing$)"); formerPattern = Pattern.compile(".*ed$"); numberPattern = Pattern.compile("[0-9]+(.[0-9]+)?"); String w; for (int i = 0; i < words.size(); i++) { w = words.get(i); isFilterWord = false; for (String fw : FILTER_WORDS) { if (fw.equals(w)) { deleteWords.add(w); isFilterWord = true; break; } } if (isFilterWord) { continue; } adjMatcher = adjPattern.matcher(w); formerMatcher = formerPattern.matcher(w); numberMatcher = numberPattern.matcher(w); // 将词语统一小写字母化 w = w.toLowerCase(); // 如果是形容词,副词形式的或是纯数字的词,则进行过滤 if (adjMatcher.matches() || numberMatcher.matches()) { deleteWords.add(w); } else if (formerMatcher.matches()) { // 如果是ed结尾表明是动词的在时态方面的变化,进行变化,转为原有动词的形式,截去最末尾2个额外添加的后缀词 w = w.substring(0, w.length() - 2); } words.set(i, w); } // 进行无效词的过滤 words.removeAll(deleteWords); deleteWords.clear(); String s1; String s2; // 进行词语的去重 for (int i = 0; i < words.size() - 1; i++) { s1 = words.get(i); for (int j = i + 1; j < words.size(); j++) { s2 = words.get(j); // 找到存在相同的词了,就挑出循环 if (s1.equals(s2)) { deleteWords.add(s1); break; } } } // 删除多余重复的词语 words.removeAll(deleteWords); words.addAll(deleteWords); } /** * 将数据写出到磁盘文件操作,如果文件已经存在,则在文件尾部进行内容追加 * * @param buffer * 当前写缓冲中的数据 * @param filePath * 输出地址 */ private void writeOutOperation(ArrayList<String> buffer, String filePath) { StringBuilder strBuilder = new StringBuilder(); // 将缓冲中的数据组成字符写入到文件中 for (String word : buffer) { strBuilder.append(word); strBuilder.append("\n"); } try { File file = new File(filePath); PrintStream ps = new PrintStream(new FileOutputStream(file)); ps.print(strBuilder.toString());// 往文件里写入字符串 } catch (FileNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); } } }
package InvertedIndex; import java.util.ArrayList; /** * 文档类 * @author lyq * */ public class Document { //文档的唯一标识 int docId; //文档的文件地址 String filePath; //文档中的有效词 ArrayList<String> effectWords; public Document(ArrayList<String> effectWords, String filePath){ this.effectWords = effectWords; this.filePath = filePath; } public Document(ArrayList<String> effectWords, String filePath, int docId){ this(effectWords, filePath); this.docId = docId; } }
package InvertedIndex; import java.io.BufferedReader; import java.io.File; import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.FileReader; import java.io.IOException; import java.io.PrintStream; import java.util.ArrayList; import java.util.HashMap; import java.util.Map; /** * BSBI基于磁盘的外部排序算法 * * @author lyq * */ public class BSBITool { // 文档唯一标识ID public static int DOC_ID = 0; // 读缓冲区的大小 private int readBufferSize; // 写缓冲区的大小 private int writeBufferSize; // 读入的文档的有效词文件地址 private ArrayList<String> effectiveWordFiles; // 倒排索引输出文件地址 private String outputFilePath; // 读缓冲 1 private String[][] readBuffer1; // 读缓冲2 private String[][] readBuffer2; // 写缓冲区 private String[][] writeBuffer; // 有效词与hashcode的映射 private Map<String, String> code2word; public BSBITool(ArrayList<String> effectiveWordFiles, int readBufferSize, int writeBufferSize) { this.effectiveWordFiles = effectiveWordFiles; this.readBufferSize = readBufferSize; this.writeBufferSize = writeBufferSize; initBuffers(); } /** * 初始化缓冲区的设置 */ private void initBuffers() { readBuffer1 = new String[readBufferSize][2]; readBuffer2 = new String[readBufferSize][2]; writeBuffer = new String[writeBufferSize][2]; } /** * 从文件中读取有效词并进行编码替换 * * @param filePath * 返回文档 */ private Document readEffectWords(String filePath) { long hashcode = 0; String w; Document document; code2word = new HashMap<String, String>(); ArrayList<String> words; words = readDataFile(filePath); for (int i = 0; i < words.size(); i++) { w = words.get(i); hashcode = BKDRHash(w); hashcode = hashcode % 10000; // 将有效词的hashcode取模值作为对应的代表 code2word.put(hashcode + "", w); w = hashcode + ""; words.set(i, w); } document = new Document(words, filePath, DOC_ID); DOC_ID++; return document; } /** * 将字符做哈希值的转换 * * @param str * 待转换字符 * @return */ private long BKDRHash(String str) { int seed = 31; /* 31 131 1313 13131 131313 etc.. */ long hash = 0; int i = 0; for (i = 0; i < str.length(); i++) { hash = (hash * seed) + (str.charAt(i)); } return hash; } /** * 根据输入的有效词输出倒排索引文件 */ public void outputInvertedFiles() { int index = 0; String baseFilePath = ""; outputFilePath = ""; Document doc; ArrayList<String> tempPaths; ArrayList<String[]> invertedData1; ArrayList<String[]> invertedData2; tempPaths = new ArrayList<>(); for (String filePath : effectiveWordFiles) { doc = readEffectWords(filePath); writeOutFile(doc); index = doc.filePath.lastIndexOf("."); baseFilePath = doc.filePath.substring(0, index); writeOutOperation(writeBuffer, baseFilePath + "-temp.txt"); tempPaths.add(baseFilePath + "-temp.txt"); } outputFilePath = baseFilePath + "-bsbi-inverted.txt"; // 将中间产生的倒排索引数据进行总的合并并输出到一个文件中 for (int i = 1; i < tempPaths.size(); i++) { if (i == 1) { invertedData1 = readInvertedFile(tempPaths.get(0)); } else { invertedData1 = readInvertedFile(outputFilePath); } invertedData2 = readInvertedFile(tempPaths.get(i)); mergeInvertedData(invertedData1, invertedData2, false, outputFilePath); writeOutOperation(writeBuffer, outputFilePath, false); } } /** * 将文档的最终的倒排索引结果写出到文件 * * @param doc * 待处理文档 */ private void writeOutFile(Document doc) { // 在读缓冲区中是否需要再排序 boolean ifSort = true; int index = 0; String baseFilePath; String[] temp; ArrayList<String> tempWords = (ArrayList<String>) doc.effectWords .clone(); ArrayList<String[]> invertedData1; ArrayList<String[]> invertedData2; invertedData1 = new ArrayList<>(); invertedData2 = new ArrayList<>(); // 将文档的数据平均拆分成2份,用于读入后面的2个缓冲区中 for (int i = 0; i < tempWords.size() / 2; i++) { temp = new String[2]; temp[0] = tempWords.get(i); temp[1] = doc.docId + ""; invertedData1.add(temp); temp = new String[2]; temp[0] = tempWords.get(i + tempWords.size() / 2); temp[1] = doc.docId + ""; invertedData2.add(temp); } // 如果是奇数个,则将最后一个补入 if (tempWords.size() % 2 == 1) { temp = new String[2]; temp[0] = tempWords.get(tempWords.size() - 1); temp[1] = doc.docId + ""; invertedData2.add(temp); } index = doc.filePath.lastIndexOf("."); baseFilePath = doc.filePath.substring(0, index); mergeInvertedData(invertedData1, invertedData2, ifSort, baseFilePath + "-temp.txt"); } /** * 合并读缓冲区数据写到写缓冲区中,用到了归并排序算法 * * @param outputPath * 写缓冲区的写出的路径 */ private void mergeWordBuffers(String outputPath) { int i = 0; int j = 0; int num1 = 0; int num2 = 0; // 写缓冲区下标 int writeIndex = 0; while (readBuffer1[i][0] != null && readBuffer2[j][0] != null) { num1 = Integer.parseInt(readBuffer1[i][0]); num2 = Integer.parseInt(readBuffer2[j][0]); // 如果缓冲1小,则优先存缓冲1到写缓冲区中 if (num1 < num2) { writeBuffer[writeIndex][0] = num1 + ""; writeBuffer[writeIndex][1] = readBuffer1[i][1]; i++; } else if (num2 < num1) { writeBuffer[writeIndex][0] = num2 + ""; writeBuffer[writeIndex][1] = readBuffer1[j][1]; j++; } else if (num1 == num2) { // 如果两个缓冲区中的数字一样,说明是同个有效词,先进行合并再写入 writeBuffer[writeIndex][0] = num1 + ""; writeBuffer[writeIndex][1] = readBuffer1[i][1] + ":" + readBuffer2[j][1]; i++; j++; } // 写的指针往后挪一位 writeIndex++; // 如果写满写缓冲区时,进行写出到文件操作 if (writeIndex >= writeBufferSize) { writeOutOperation(writeBuffer, outputPath); writeIndex = 0; } } if (readBuffer1[i][0] == null) { writeRemainReadBuffer(readBuffer2, j, outputPath); } if (readBuffer2[j][0] == null) { writeRemainReadBuffer(readBuffer1, j, outputPath); } } /** * 将数据写出到磁盘文件操作,如果文件已经存在,则在文件尾部进行内容追加 * * @param buffer * 当前写缓冲中的数据 * @param filePath * 输出地址 */ private void writeOutOperation(String[][] buffer, String filePath) { String word; StringBuilder strBuilder = new StringBuilder(); // 将缓冲中的数据组成字符写入到文件中 for (String[] array : buffer) { if (array[0] == null) { continue; } word = array[0]; strBuilder.append(word); strBuilder.append(" "); strBuilder.append(array[1]); strBuilder.append("\n"); } try { File file = new File(filePath); PrintStream ps = new PrintStream(new FileOutputStream(file)); ps.print(strBuilder.toString());// 往文件里写入字符串 } catch (FileNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); } } /** * 将数据写出到磁盘文件操作,如果文件已经存在,则在文件尾部进行内容追加 * * @param buffer * 当前写缓冲中的数据 * @param filePath * 输出地址 * @param isCoded * 是否以编码的方式输出 */ private void writeOutOperation(String[][] buffer, String filePath, boolean isCoded) { String word; StringBuilder strBuilder = new StringBuilder(); // 将缓冲中的数据组成字符写入到文件中 for (String[] array : buffer) { if (array[0] == null) { continue; } if(!isCoded){ word = code2word.get(array[0]); }else{ word = array[0]; } strBuilder.append(word); strBuilder.append(" "); strBuilder.append(array[1]); strBuilder.append("\n"); } try { File file = new File(filePath); PrintStream ps = new PrintStream(new FileOutputStream(file)); ps.print(strBuilder.toString());// 往文件里写入字符串 } catch (FileNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); } } /** * 将剩余的读缓冲区中的数据读入写缓冲区中 * * @param remainBuffer * 读缓冲区的剩余缓冲 * @param currentReadPos * 当前的读取位置 * @param outputPath * 写缓冲区的写出文件路径 */ private void writeRemainReadBuffer(String[][] remainBuffer, int currentReadPos, String outputPath) { while (remainBuffer[currentReadPos][0] != null && currentReadPos < readBufferSize) { removeRBToWB(remainBuffer[currentReadPos]); currentReadPos++; // 如果写满写缓冲区时,进行写出到文件操作 if (writeBuffer[writeBufferSize - 1][0] != null) { writeOutOperation(writeBuffer, outputPath); } } } /** * 将剩余读缓冲区中的数据通过插入排序的方式插入写缓冲区 * * @param record */ private void removeRBToWB(String[] record) { int insertIndex = 0; int endIndex = 0; long num1; long num2; long code = Long.parseLong(record[0]); // 如果写缓冲区目前为空,则直接加入 if (writeBuffer[0][0] == null) { writeBuffer[0] = record; return; } // 寻找待插入的位置 for (int i = 0; i < writeBufferSize - 1; i++) { if (writeBuffer[i][0] == null) { endIndex = i; break; } num1 = Long.parseLong(writeBuffer[i][0]); if (writeBuffer[i + 1][0] == null) { if (code > num1) { endIndex = i + 1; insertIndex = i + 1; } } else { num2 = Long.parseLong(writeBuffer[i + 1][0]); if (code > num1 && code < num2) { insertIndex = i + 1; } } } // 进行插入操作,相关数据进行位置迁移 for (int i = endIndex; i > insertIndex; i--) { writeBuffer[i] = writeBuffer[i - 1]; } writeBuffer[insertIndex] = record; } /** * 将磁盘中的2个倒排索引数据进行合并 * * @param invertedData1 * 倒排索引为文件数据1 * @param invertedData2 * 倒排索引文件数据2 * @param isSort * 是否需要对缓冲区中的数据进行排序 * @param outputPath * 倒排索引输出文件地址 */ private void mergeInvertedData(ArrayList<String[]> invertedData1, ArrayList<String[]> invertedData2, boolean ifSort, String outputPath) { int rIndex1 = 0; int rIndex2 = 0; // 重新初始化缓冲区 initBuffers(); while (invertedData1.size() > 0 && invertedData2.size() > 0) { readBuffer1[rIndex1][0] = invertedData1.get(0)[0]; readBuffer1[rIndex1][1] = invertedData1.get(0)[1]; readBuffer2[rIndex2][0] = invertedData2.get(0)[0]; readBuffer2[rIndex2][1] = invertedData2.get(0)[1]; invertedData1.remove(0); invertedData2.remove(0); rIndex1++; rIndex2++; if (rIndex1 == readBufferSize) { if (ifSort) { wordBufferSort(readBuffer1); wordBufferSort(readBuffer2); } mergeWordBuffers(outputPath); initBuffers(); } } if (ifSort) { wordBufferSort(readBuffer1); wordBufferSort(readBuffer2); } mergeWordBuffers(outputPath); readBuffer1 = new String[readBufferSize][2]; readBuffer2 = new String[readBufferSize][2]; if (invertedData1.size() == 0 && invertedData2.size() > 0) { readRemainDataToRB(invertedData2, outputPath); } else if (invertedData1.size() > 0 && invertedData2.size() == 0) { readRemainDataToRB(invertedData1, outputPath); } } /** * 剩余的有效词数据读入读缓冲区 * * @param remainData * 剩余数据 * @param outputPath * 输出文件路径 */ private void readRemainDataToRB(ArrayList<String[]> remainData, String outputPath) { int rIndex = 0; while (remainData.size() > 0) { readBuffer1[rIndex][0] = remainData.get(0)[0]; readBuffer1[rIndex][1] = remainData.get(0)[1]; remainData.remove(0); rIndex++; // 读缓冲 区写满,进行写入到写缓冲区中 if (readBuffer1[readBufferSize - 1][0] != null) { wordBufferSort(readBuffer1); writeRemainReadBuffer(readBuffer1, 0, outputPath); initBuffers(); } } wordBufferSort(readBuffer1); writeRemainReadBuffer(readBuffer1, 0, outputPath); } /** * 缓冲区数据进行排序 * * @param buffer * 缓冲空间 */ 【本文来自鸿网互联 (http://www.68idc.cn)】 private void wordBufferSort(String[][] buffer) { String[] temp; int k = 0; long num1 = 0; long num2 = 0; for (int i = 0; i < buffer.length - 1; i++) { // 缓冲区可能没填满 if (buffer[i][0] == null) { continue; } k = i; for (int j = i + 1; j < buffer.length; j++) { // 缓冲区可能没填满 if (buffer[j][0] == null) { continue; } // 获取2个缓冲区小块的起始编号值 num1 = Long.parseLong(buffer[k][0]); num2 = Long.parseLong(buffer[j][0]); if (num2 < num1) { k = j; } } if (k != i) { temp = buffer[k]; buffer[k] = buffer[i]; buffer[i] = temp; } } } /** * 从文件中读取倒排索引数据 * * @param filePath * 单个文件 */ private ArrayList<String[]> readInvertedFile(String filePath) { File file = new File(filePath); ArrayList<String[]> dataArray = new ArrayList<String[]>(); try { BufferedReader in = new BufferedReader(new FileReader(file)); String str; String[] tempArray; while ((str = in.readLine()) != null) { tempArray = str.split(" "); dataArray.add(tempArray); } in.close(); } catch (IOException e) { e.getStackTrace(); } return dataArray; } /** * 从文件中读取数据 * * @param filePath * 单个文件 */ private ArrayList<String> readDataFile(String filePath) { File file = new File(filePath); ArrayList<String[]> dataArray = new ArrayList<String[]>(); ArrayList<String> words = new ArrayList<>(); try { BufferedReader in = new BufferedReader(new FileReader(file)); String str; String[] tempArray; while ((str = in.readLine()) != null) { tempArray = str.split(" "); dataArray.add(tempArray); } in.close(); } catch (IOException e) { e.getStackTrace(); } // 将每行词做拆分加入到总列表容器中 for (String[] array : dataArray) { for (String word : array) { if (!word.equals("")) { words.add(word); } } } return words; } }
package InvertedIndex; import java.io.BufferedReader; import java.io.File; import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.FileReader; import java.io.IOException; import java.io.PrintStream; import java.util.ArrayList; /** * SPIMI内存式单边扫描构建算法 * @author lyq * */ public class SPIMITool { //倒排索引输出文件地址 private String outputFilePath; // 读入的文档的有效词文件地址 private ArrayList<String> effectiveWordFiles; // 内存缓冲区,不够还能够在增加空间 private ArrayList<String[]> buffers; public SPIMITool(ArrayList<String> effectiveWordFiles){ this.effectiveWordFiles = effectiveWordFiles; } /** * 从文件中读取数据 * * @param filePath * 单个文件 */ private ArrayList<String> readDataFile(String filePath) { File file = new File(filePath); ArrayList<String[]> dataArray = new ArrayList<String[]>(); ArrayList<String> words = new ArrayList<>(); try { BufferedReader in = new BufferedReader(new FileReader(file)); String str; String[] tempArray; while ((str = in.readLine()) != null) { tempArray = str.split(" "); dataArray.add(tempArray); } in.close(); } catch (IOException e) { e.getStackTrace(); } // 将每行词做拆分加入到总列表容器中 for (String[] array : dataArray) { for (String word : array) { words.add(word); } } return words; } /** * 根据已有的文档数据进行倒排索引文件的构建 * @param docs * 文档集合 */ private void writeInvertedIndex(ArrayList<Document> docs){ ArrayList<String> datas; String[] recordData; buffers = new ArrayList<>(); for(Document tempDoc: docs){ datas = tempDoc.effectWords; for(String word: datas){ recordData = new String[2]; recordData[0] = word; recordData[1] = tempDoc.docId + ""; addRecordToBuffer(recordData); } } //最后将数据写出到磁盘中 writeOutOperation(buffers, outputFilePath); } /** * 将新读入的数据记录读入到内存缓冲中,如果存在则加入到倒排记录表中 * @param insertedData * 待插入的数据 */ private void addRecordToBuffer(String[] insertedData){ boolean isContained = false; String wordName; wordName = insertedData[0]; for(String[] array: buffers){ if(array[0].equals(wordName)){ isContained = true; //添加倒排索引记录,以:隔开 array[1] += ":" + insertedData[1]; break; } } //如果没有包含,则说明是新的数据,直接添加 if(!isContained){ buffers.add(insertedData); } } /** * 将数据写出到磁盘文件操作,如果文件已经存在,则在文件尾部进行内容追加 * @param buffer * 当前写缓冲中的数据 * @param filePath * 输出地址 */ private void writeOutOperation(ArrayList<String[]> buffer, String filePath) { StringBuilder strBuilder = new StringBuilder(); //将缓冲中的数据组成字符写入到文件中 for(String[] array: buffer){ strBuilder.append(array[0]); strBuilder.append(" "); strBuilder.append(array[1]); strBuilder.append("\n"); } try { File file = new File(filePath); PrintStream ps = new PrintStream(new FileOutputStream(file)); ps.println(strBuilder.toString());// 往文件里写入字符串 } catch (FileNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); } } /** * 构造倒排索引文件 */ public void createInvertedIndexFile(){ int docId = 1; String baseFilePath; String fileName; String p; int index1 = 0; int index2 = 0; Document tempDoc; ArrayList<String> words; ArrayList<Document> docs; outputFilePath = "spimi"; docs = new ArrayList<>(); p = effectiveWordFiles.get(0); //提取文件名称 index1 = p.lastIndexOf("\\"); baseFilePath = p.substring(0, index1+1); outputFilePath = baseFilePath + "spimi"; for(String path: effectiveWordFiles){ //获取文档有效词 words = readDataFile(path); tempDoc = new Document(words, path, docId); docId++; docs.add(tempDoc); //提取文件名称 index1 = path.lastIndexOf("\\"); index2 = path.lastIndexOf("."); fileName = path.substring(index1+1, index2); outputFilePath += "-" + fileName; } outputFilePath += ".txt"; //根据文档数据进行倒排索引文件的创建 writeInvertedIndex(docs); } }
package InvertedIndex; import java.util.ArrayList; /** * 倒排索引测试类 * @author lyq * */ public class Client { public static void main(String[] args){ //读写缓冲区的大小 int readBufferSize; int writeBufferSize; String baseFilePath; PreTreatTool preTool; //BSBI基于磁盘的外部排序算法 BSBITool bTool; //SPIMI内存式单边扫描构建算法 SPIMITool sTool; //有效词文件路径 ArrayList<String> efwFilePaths; ArrayList<String> docFilePaths; readBufferSize = 10; writeBufferSize = 20; baseFilePath = "C:\\Users\\lyq\\Desktop\\icon\\"; docFilePaths = new ArrayList<>(); docFilePaths.add(baseFilePath + "doc1.txt"); docFilePaths.add(baseFilePath + "doc2.txt"); //文档预处理工具类 preTool = new PreTreatTool(docFilePaths); preTool.preTreatWords(); //预处理完获取有效词文件路径 efwFilePaths = preTool.getEFWPaths(); bTool = new BSBITool(efwFilePaths, readBufferSize, writeBufferSize); bTool.outputInvertedFiles(); sTool = new SPIMITool(efwFilePaths); sTool.createInvertedIndexFile(); } }
为了模拟出真实性,算法的输出都是以文件的形式。
首先是预处理类处理之后的有效词文件doc1-efword.txt和doc2-efword.txt:
mike study yesterday got last exam thinks english he
下面是BSBI算法生成的中间文件,就是映射成编码的文件,也许你看了这些数值真实表示的是什么词语:
1426 0 1542 0 2540 0 3056 0 3325 0 4326 0 4897 0 6329 0 7327 0
1426 1 1542 1 2540 1 3056 1 3325 1 4326 1 4897 1 6329 1 7327 1
yesterday 0:1 mike 0:1 got 0:1 english 0:1 he 0:1 last 0:1 thinks 0:1 study 0:1 exam 0:1
mike 1:2 study 1:2 yesterday 1:2 got 1:2 last 1:2 exam 1:2 thinks 1:2 english 1:2 he 1:2
算法小结
我在实现算法的过程中无疑低估了此算法的难度,尤其是BSBI的实现,因为中间读写缓冲区在做数据操作的时候,各种情况需要判断,诸如写缓冲区满了的时候要刷出到磁盘上,读缓冲区满的时候要通过归并排序移入读缓冲区中,这里面的判断实在过多,加上之前早期没有想到这个问题,导致算法可读性不是很好,就索性把缓冲区设大,先走通这个流程,所以这个算法大家还是以理解为主,就不要拿来实际运用了,同样对于SPIMI算法一样的道理,算法实现在这里帮助大家更好的理解吧,还有很多不足的地方。还有1点是文档内容预处理的时候,我只是象征性的进行过滤,真实的信息过滤实现复杂程度远远超过我所写的,这里包括了修饰词,时态词的变化,副词等等,这些有时还需要语义挖掘的一些知识来解决,大家意会即可。

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

Written above & the author’s personal understanding: At present, in the entire autonomous driving system, the perception module plays a vital role. The autonomous vehicle driving on the road can only obtain accurate perception results through the perception module. The downstream regulation and control module in the autonomous driving system makes timely and correct judgments and behavioral decisions. Currently, cars with autonomous driving functions are usually equipped with a variety of data information sensors including surround-view camera sensors, lidar sensors, and millimeter-wave radar sensors to collect information in different modalities to achieve accurate perception tasks. The BEV perception algorithm based on pure vision is favored by the industry because of its low hardware cost and easy deployment, and its output results can be easily applied to various downstream tasks.

Common challenges faced by machine learning algorithms in C++ include memory management, multi-threading, performance optimization, and maintainability. Solutions include using smart pointers, modern threading libraries, SIMD instructions and third-party libraries, as well as following coding style guidelines and using automation tools. Practical cases show how to use the Eigen library to implement linear regression algorithms, effectively manage memory and use high-performance matrix operations.

The bottom layer of the C++sort function uses merge sort, its complexity is O(nlogn), and provides different sorting algorithm choices, including quick sort, heap sort and stable sort.

The convergence of artificial intelligence (AI) and law enforcement opens up new possibilities for crime prevention and detection. The predictive capabilities of artificial intelligence are widely used in systems such as CrimeGPT (Crime Prediction Technology) to predict criminal activities. This article explores the potential of artificial intelligence in crime prediction, its current applications, the challenges it faces, and the possible ethical implications of the technology. Artificial Intelligence and Crime Prediction: The Basics CrimeGPT uses machine learning algorithms to analyze large data sets, identifying patterns that can predict where and when crimes are likely to occur. These data sets include historical crime statistics, demographic information, economic indicators, weather patterns, and more. By identifying trends that human analysts might miss, artificial intelligence can empower law enforcement agencies

01 Outlook Summary Currently, it is difficult to achieve an appropriate balance between detection efficiency and detection results. We have developed an enhanced YOLOv5 algorithm for target detection in high-resolution optical remote sensing images, using multi-layer feature pyramids, multi-detection head strategies and hybrid attention modules to improve the effect of the target detection network in optical remote sensing images. According to the SIMD data set, the mAP of the new algorithm is 2.2% better than YOLOv5 and 8.48% better than YOLOX, achieving a better balance between detection results and speed. 02 Background & Motivation With the rapid development of remote sensing technology, high-resolution optical remote sensing images have been used to describe many objects on the earth’s surface, including aircraft, cars, buildings, etc. Object detection in the interpretation of remote sensing images

PyCharm is a powerful Python integrated development environment with rich functions and tools that can greatly improve development efficiency. Among them, the replacement function is one of the functions frequently used in the development process, which can help developers quickly modify the code and improve the code quality. This article will introduce PyCharm's replacement function in detail, combined with specific code examples, to help novices better master and use this function. Introduction to the replacement function PyCharm's replacement function can help developers quickly replace specified text in the code

In Hua Yishan Heart Moon, Lu Shu is an SSR celebrity. He is positioned as a single-target backline player and has a very impressive critical hit rate. Many players don’t know much about Lu Shu. Here’s what I’ve brought you. Come and take a look at the introduction to the skills and attributes of Hua Yishan Heart of the Moon Lu Shu. Celebrity Attributes Celebrity Skills 1. Lu Ming Shuzhong Skill Description: Lu Shu was born in Qiongqihui in Shuzhong. He has practiced martial arts since he was a child and has outstanding martial arts skills. Causes basic attack damage equal to 100% of the enemy's back row attack power, and reduces the target's rage by 10 points. Skill attributes: Level 2: Basic attack damage increased to 105%. Level 2: Basic attack damage is increased to 110%, and the target's rage is reduced by 15 points. Level 2: Basic attack damage increased to 115%. Level 2: Basic attack damage is increased to 120%, and the target's rage is reduced by 20 points. Level 2: Basic attack

1. Background of the Construction of 58 Portraits Platform First of all, I would like to share with you the background of the construction of the 58 Portrait Platform. 1. The traditional thinking of the traditional profiling platform is no longer enough. Building a user profiling platform relies on data warehouse modeling capabilities to integrate data from multiple business lines to build accurate user portraits; it also requires data mining to understand user behavior, interests and needs, and provide algorithms. side capabilities; finally, it also needs to have data platform capabilities to efficiently store, query and share user profile data and provide profile services. The main difference between a self-built business profiling platform and a middle-office profiling platform is that the self-built profiling platform serves a single business line and can be customized on demand; the mid-office platform serves multiple business lines, has complex modeling, and provides more general capabilities. 2.58 User portraits of the background of Zhongtai portrait construction
