How to implement relevance scoring function for full-text search in JavaScript

Home

Web Front-end

JS Tutorial

How to implement relevance scoring function for full-text search in JavaScript_Basic knowledge

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

May 16, 2016 pm 03:53 PM

javascript research all

Full-text search, unlike most other problems in the field of machine learning, is a problem that web programmers often encounter in their daily work. The customer may ask you to provide a search box somewhere, and then you will write a SQL statement like WHERE title LIKE %:query% to implement the search function. In the beginning, this was no problem, until one day, the customer came to you and said, "There was an error in the search!"

Of course, there is nothing actually “wrong” with the search, it’s just that the search results are not what the customer wants. Ordinary users don't know how to do exact matching, so the search results they get are of poor quality. To solve the problem, you decide to use full-text search. After a period of boring learning, you enabled MySQL's FULLTEXT index and used more advanced query syntax, such as "MATCH() ... AGAINST()".

Okay, the problem is solved and the flowers are finished! This is no problem when the database size is small.

But when you have more and more data, you will find that your database is getting slower and slower. MySQL is not a very easy-to-use full-text search tool. So you decide to use ElasticSearch, refactor your code, and deploy a Lucene-powered full-text search cluster. You will find that it works very well, fast and accurately.

At this point you can’t help but wonder: Why is Lucene so awesome?

This article (mainly introducing TF-IDF, Okapi BM-25 and ordinary relevance scoring) and the next article (mainly introducing indexing) will tell you the basic concepts behind full-text search.

Relevance

For each search query, we easily define a "relevance score" for each document. When users search, we can use relevance scores to sort instead of document appearance time. This way, the most relevant document will be ranked first, no matter how long ago it was created (of course, sometimes it is also related to the creation time of the document).

There are many, many ways to calculate the correlation between words, but we will start with the simplest, statistics-based method. This method does not require an understanding of the language itself, but determines a "relevance score" by counting word usage, matching, and weighting based on the prevalence of unique words in the document.

This algorithm does not care whether the word is a noun or a verb, nor does it care about the meaning of the word. The only thing it cares about is which words are common and which are rare. If a search query contains both common words and rare words, you'd better give the document containing the rare words a higher score and lower the weight of the common words.

This algorithm is called Okapi BM25. It contains two basic concepts: term frequency (term frequency), abbreviated as term frequency ("TF"), and inverse document frequency (abbreviated as "IDF"). Putting them together, it is called "TF-IDF" , which is a statistical measure used to indicate how important a term is in a document.

TF-IDF

Term Frequency, or “TF” for short, is a very simple metric: the number of times a specific word appears in a document. You can divide this value by the total number of words in the document to get a score. For example, if there are 100 words in the document and the word 'the' appears 8 times, then the TF of 'the' is 8 or 8/100 or 8% (depending on how you want to express it).

Inverse Document Frequency, or "IDF" for short, is a bit more complicated: the rarer a word, the higher this value. It is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of the quotient. The rarer the word, the higher the "IDF" it will produce.

If you multiply these two numbers together (TF*IDF), you will get the weight of a word in the document. "Weight" is defined as: How rare is the word and how often does it appear in the document?

You can use this concept for search queries on documents. For each keyword in the query, calculate their TF-IDF score and add them together. The document with the highest score is the document that best matches the query statement.

How cool!

Okapi BM25

The above algorithm is a usable algorithm, but it is not perfect. It gives a statistically based correlation score algorithm, which we can further improve.

Okapi BM25 is considered one of the most advanced ranking algorithms so far (hence the name ElasticSearch ). Okapi BM25 adds two adjustable parameters, k1 and b, based on TF-IDF, which represent "term frequency saturation" and "field length specification" respectively. What the hell is this?

To understand “word frequency saturation” intuitively, please imagine two articles of similar length discussing baseball. Also, let's assume that all documents (except these two) don't have much baseball-related content, so the word "baseball" will have a high IDF - it's extremely rare and important. Both articles discuss baseball, and both devote considerable space to it, but one uses the word "baseball" more than the other. So in this case, does one article really have a much different score than another article? Since both documents discuss baseball at large, it doesn't matter whether the word "baseball" appears 40 times or 80 times. In fact, 30 times should be the cap!

This is "word frequency saturation. The native TF-IDF algorithm has no concept of saturation, so a document with "baseball" appearing 80 times has a score twice as high as one that appears 40 times. Sometimes, this is what we want, but Sometimes we don’t want that.

In addition, Okapi BM25 also has a k1 parameter, which is used to adjust the rate of saturation change. The value of the k1 parameter is generally between 1.2 and 2.0. The lower the value, the faster the saturation process. (meaning the two documents above have the same score because they both contain a lot of the word "baseball")

Field-length normalization reduces the length of the document to the average length of all documents. This is useful for single-field collections (such as ours), to unify documents of different lengths to the same comparison criteria. It makes more sense for two-field collections (such as "title" and "body"), which also unifies the title and body fields to the same comparison criteria. Field length reduction is represented by b, which has a value between 0 and 1, with 1 meaning full reduction and 0 meaning no reduction.

Algorithm

You can learn about the formula of Okapi algorithm in Okapi BM25 Wikipedia . Now that we know what each term in the formula is, it must be easy to understand. So let’s not mention the formula and go directly to the code:

BM25.Tokenize = function(text) {
  text = text
    .toLowerCase()
    .replace(/\W/g, ' ')
    .replace(/\s+/g, ' ')
    .trim()
    .split(' ')
    .map(function(a) { return stemmer(a); });
 
  // Filter out stopStems
  var out = [];
  for (var i = 0, len = text.length; i < len; i++) {
    if (stopStems.indexOf(text[i]) === -1) {
      out.push(text[i]);
    }
  }
 
  return out;
};

Copy after login

We define a simple static method Tokenize() to parse strings into an array of tokens. That's it, we lowercase all tokens (to reduce entropy). We run the Porter Stemmer algorithm to reduce the amount of entropy while also improving the degree of matching ("walking" and "walk" matching are the same). And we also filter out stop words (very common words) to further reduce entropy. Before I get too deep into the concepts I write about, please bear with me if I overexplain this section.

BM25.prototype.addDocument = function(doc) {
  if (typeof doc.id === 'undefined') { throw new Error(1000, 'ID is a required property of documents.'); };
  if (typeof doc.body === 'undefined') { throw new Error(1001, 'Body is a required property of documents.'); };
 
  // Raw tokenized list of words
  var tokens = BM25.Tokenize(doc.body);
 
  // Will hold unique terms and their counts and frequencies
  var _terms = {};
 
  // docObj will eventually be added to the documents database
  var docObj = {id: doc.id, tokens: tokens, body: doc.body};
 
  // Count number of terms
  docObj.termCount = tokens.length;
 
  // Increment totalDocuments
  this.totalDocuments++;
 
  // Readjust averageDocumentLength
  this.totalDocumentTermLength += docObj.termCount;
  this.averageDocumentLength = this.totalDocumentTermLength / this.totalDocuments;
 
  // Calculate term frequency
  // First get terms count
  for (var i = 0, len = tokens.length; i < len; i++) {
    var term = tokens[i];
    if (!_terms[term]) { 
      _terms[term] = {
        count: 0,
        freq: 0
      }; 
    };
    _terms[term].count++;
  }
 
  // Then re-loop to calculate term frequency.
  // We'll also update inverse document frequencies here.
  var keys = Object.keys(_terms);
  for (var i = 0, len = keys.length; i < len; i++) {
    var term = keys[i];
    // Term Frequency for this document.
    _terms[term].freq = _terms[term].count / docObj.termCount;
 
    // Inverse Document Frequency initialization
    if (!this.terms[term]) {
      this.terms[term] = {
        n: 0, // Number of docs this term appears in, uniquely
        idf: 0
      };
    }
 
    this.terms[term].n++;
  };
 
  // Calculate inverse document frequencies
  // This is SLOWish so if you want to index a big batch of documents,
  // comment this out and run it once at the end of your addDocuments run
  // If you're only indexing a document or two at a time you can leave this in.
  // this.updateIdf();
 
  // Add docObj to docs db
  docObj.terms = _terms;
  this.documents[docObj.id] = docObj;
};

Copy after login

This is where the addDocument() method magically appears. We basically build and maintain two similar data structures: this.documents. and this.terms.

This.documentsis is a database that stores all documents. It stores all the original text of the document, the length information of the document, and a list that stores all the words in the document and the number and frequency of occurrence of the words. Using this data structure, we can answer the following questions easily and quickly (yes, very fast, only requiring a hash table query time of O(1) time complexity): In document #3, 'walk' this How many times does the word appear?

We also use another data structure, this.terms. It represents all words in the corpus. Through this data structure, we can answer the following questions in O(1) time: How many documents does the word 'walk' appear in? What is their id?

Finally, we record the length of each document and record the average length of documents across the entire corpus.

Note that in the above code, idf is initialized to 0 and the updateidf() method is commented out. This is because this method runs very slowly and only needs to be run once, after indexing. Since running it once can meet the needs, there is no need to run it 5,000 times. Commenting it out first and running it after a large batch of indexing operations can save a lot of time. Here is the code for this function:

BM25.prototype.updateIdf = function() {
  var keys = Object.keys(this.terms);
  for (var i = 0, len = keys.length; i < len; i++) {
    var term = keys[i];
    var num = (this.totalDocuments - this.terms[term].n + 0.5);
    var denom = (this.terms[term].n + 0.5);
    this.terms[term].idf = Math.max(Math.log10(num / denom), 0.01);
  }
};

Copy after login

This is a very simple function, but because it needs to traverse all the words in the entire corpus and update the values of all words, it works a bit slowly. This method is implemented using the standard formula for inverse document frequency (you can find this formula on Wikipedia) - divide the total number of documents by the number of documents containing the word, and then take the logarithm of the quotient get. I made some modifications so that the return value is always greater than 0.

BM25.prototype.search = function(query) {
 
  var queryTerms = BM25.Tokenize(query);
  var results = [];
 
  // Look at each document in turn. There are better ways to do this with inverted indices.
  var keys = Object.keys(this.documents);
  for (var j = 0, nDocs = keys.length; j < nDocs; j++) {
    var id = keys[j];
    // The relevance score for a document is the sum of a tf-idf-like
    // calculation for each query term.
    this.documents[id]._score = 0;
 
    // Calculate the score for each query term
    for (var i = 0, len = queryTerms.length; i < len; i++) {
      var queryTerm = queryTerms[i];
 
      // We've never seen this term before so IDF will be 0.
      // Means we can skip the whole term, it adds nothing to the score
      // and isn't in any document.
      if (typeof this.terms[queryTerm] === 'undefined') {
        continue;
      }
 
      // This term isn't in the document, so the TF portion is 0 and this
      // term contributes nothing to the search score.
      if (typeof this.documents[id].terms[queryTerm] === 'undefined') {
        continue;
      }
 
      // The term is in the document, let's go.
      // The whole term is :
      // IDF * (TF * (k1 + 1)) / (TF + k1 * (1 - b + b * docLength / avgDocLength))
 
      // IDF is pre-calculated for the whole docset.
      var idf = this.terms[queryTerm].idf;
      // Numerator of the TF portion.
      var num = this.documents[id].terms[queryTerm].count * (this.k1 + 1);
      // Denomerator of the TF portion.
      var denom = this.documents[id].terms[queryTerm].count 
        + (this.k1 * (1 - this.b + (this.b * this.documents[id].termCount / this.averageDocumentLength)));
 
      // Add this query term to the score
      this.documents[id]._score += idf * num / denom;
    }
 
    if (!isNaN(this.documents[id]._score) && this.documents[id]._score > 0) {
      results.push(this.documents[id]);
    }
  }
 
  results.sort(function(a, b) { return b._score - a._score; });
  return results.slice(0, 10);
};

Copy after login

Finally, the search() method traverses all documents and gives the BM25 score of each document, and then sorts them in descending order. Of course, it would be unwise to go through every document in the corpus during the search. This issue is addressed in Part Two (Inverted Indexes and Performance).

The code above is well commented and the gist of it is as follows: Calculate the BM25 score for each document and each word. The idf score of the word has been pre-calculated, and you only need to query it when using it. Word frequencies are also pre-computed as part of the document properties. After that, only four simple arithmetic operations are needed. Finally, add a temporary variable _score to each document, then sort it in descending order according to score and return the top 10 results.
Examples, source code, notes and next steps

There are many ways to optimize the above example, we will introduce them in the second part of "Full Text Search", please stay tuned. I hope I can finish it in a few weeks. Below is a list of things that will be mentioned next time:

Reverse index and fast search
Quick index
Better search results

For this demonstration, I wrote a small Wikipedia crawler that crawled to the first paragraph of quite a few (85,000) Wikipedia articles. Since indexing all 85K files takes around 90 seconds, on my computer I've cut that in half. I don’t want you to waste your laptop power just for a simple full text presentation.

Because indexing is a heavy, modular CPU operation, I treat it as a network worker. Indexing runs on a background thread --You can find the full source code here. You will also find source code references to the stemming algorithm and my stop word list. As for the code license, it is still free for educational purposes and not for any commercial purposes.

Finally, the demo. Once the index is complete, try looking for random things and phrases and Wikipedia will know about it. Note that there are only 40,000 segments indexed, so you may want to try some new topics.

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Repo: How To Revive Teammates

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hello Kitty Island Adventure: How To Get Giant Seeds

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

3 weeks ago By DDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7337

Java Tutorial

1627

CakePHP Tutorial

1352

Laravel Tutorial

1265

PHP Tutorial

1209

Related knowledge

How to implement an online speech recognition system using WebSocket and JavaScript Dec 17, 2023 pm 02:54 PM

How to use WebSocket and JavaScript to implement an online speech recognition system Introduction: With the continuous development of technology, speech recognition technology has become an important part of the field of artificial intelligence. The online speech recognition system based on WebSocket and JavaScript has the characteristics of low latency, real-time and cross-platform, and has become a widely used solution. This article will introduce how to use WebSocket and JavaScript to implement an online speech recognition system.

WebSocket and JavaScript: key technologies for implementing real-time monitoring systems Dec 17, 2023 pm 05:30 PM

WebSocket and JavaScript: Key technologies for realizing real-time monitoring systems Introduction: With the rapid development of Internet technology, real-time monitoring systems have been widely used in various fields. One of the key technologies to achieve real-time monitoring is the combination of WebSocket and JavaScript. This article will introduce the application of WebSocket and JavaScript in real-time monitoring systems, give code examples, and explain their implementation principles in detail. 1. WebSocket technology

How to implement an online reservation system using WebSocket and JavaScript Dec 17, 2023 am 09:39 AM

How to use WebSocket and JavaScript to implement an online reservation system. In today's digital era, more and more businesses and services need to provide online reservation functions. It is crucial to implement an efficient and real-time online reservation system. This article will introduce how to use WebSocket and JavaScript to implement an online reservation system, and provide specific code examples. 1. What is WebSocket? WebSocket is a full-duplex method on a single TCP connection.

How to use JavaScript and WebSocket to implement a real-time online ordering system Dec 17, 2023 pm 12:09 PM

Introduction to how to use JavaScript and WebSocket to implement a real-time online ordering system: With the popularity of the Internet and the advancement of technology, more and more restaurants have begun to provide online ordering services. In order to implement a real-time online ordering system, we can use JavaScript and WebSocket technology. WebSocket is a full-duplex communication protocol based on the TCP protocol, which can realize real-time two-way communication between the client and the server. In the real-time online ordering system, when the user selects dishes and places an order

Simple JavaScript Tutorial: How to Get HTTP Status Code Jan 05, 2024 pm 06:08 PM

JavaScript tutorial: How to get HTTP status code, specific code examples are required. Preface: In web development, data interaction with the server is often involved. When communicating with the server, we often need to obtain the returned HTTP status code to determine whether the operation is successful, and perform corresponding processing based on different status codes. This article will teach you how to use JavaScript to obtain HTTP status codes and provide some practical code examples. Using XMLHttpRequest

JavaScript and WebSocket: Building an efficient real-time weather forecasting system Dec 17, 2023 pm 05:13 PM

JavaScript and WebSocket: Building an efficient real-time weather forecast system Introduction: Today, the accuracy of weather forecasts is of great significance to daily life and decision-making. As technology develops, we can provide more accurate and reliable weather forecasts by obtaining weather data in real time. In this article, we will learn how to use JavaScript and WebSocket technology to build an efficient real-time weather forecast system. This article will demonstrate the implementation process through specific code examples. We

How to get HTTP status code in JavaScript the easy way Jan 05, 2024 pm 01:37 PM

Introduction to the method of obtaining HTTP status code in JavaScript: In front-end development, we often need to deal with the interaction with the back-end interface, and HTTP status code is a very important part of it. Understanding and obtaining HTTP status codes helps us better handle the data returned by the interface. This article will introduce how to use JavaScript to obtain HTTP status codes and provide specific code examples. 1. What is HTTP status code? HTTP status code means that when the browser initiates a request to the server, the service

How to use insertBefore in javascript Nov 24, 2023 am 11:56 AM

Usage: In JavaScript, the insertBefore() method is used to insert a new node in the DOM tree. This method requires two parameters: the new node to be inserted and the reference node (that is, the node where the new node will be inserted).

See all articles