Build a Search Engine with Node.js and Elasticsearch-JS Tutorial-php.cn

Home

Web Front-end

JS Tutorial

Build a Search Engine with Node.js and Elasticsearch

尊渡假赌尊渡假赌尊渡假赌

Feb 17, 2025 am 08:47 AM

Build a Search Engine with Node.js and Elasticsearch

This article was peer-reviewed by Mark Brown, Vildan Softic and Moritz Kröger. Thanks to all the peer reviewers at SitePoint for making SitePoint’s content perfect!

Elasticsearch is an open source search engine that is becoming increasingly popular due to its high performance and distributed architecture. This article will explore its key features and guide you how to use it to create a Node.js search engine.

Key Points

Elasticsearch is a high-performance distributed search engine built on Apache Lucene, mainly used for real-time indexing and searching data.
The system is pattern-free, can automatically detect data structures and types, and supports a large number of operations by using JSON's RESTful API on HTTP.
Elasticsearch can be easily installed on the main operating system using a package manager such as ZIP files or Homebrew, and requires a Java runtime environment to run.
Node.js' official Elasticsearch module facilitates the integration of Elasticsearch's functionality with Node.js applications, allowing efficient data indexing and querying.
The key concepts of Elasticsearch include indexing, types and searches, which enable complex queries, filters and aggregations to refine and analyze data.
Elasticsearch supports real-time search, which means that the newly indexed data can be searched almost immediately, thereby improving the response speed of applications that rely on the latest information.
This tutorial provides practical examples and code snippets about setting up Elasticsearch with Node.js, performing various types of searches, and using advanced features such as data aggregation and suggestions.

Introduction to Elasticsearch

Elasticsearch is built on top of Apache Lucene, a high-performance text search engine library. Although Elasticsearch can perform data storage and retrieval, its main purpose is not to act as a database, but a search engine (server), whose main goal is to index, search and provide real-time statistics of data.

Elasticsearch has a distributed architecture that can achieve horizontal scaling by adding more nodes and leveraging additional hardware. It supports thousands of nodes to process petabytes of data. Its horizontal scaling also means it has high availability and can rebalance the data if any node fails.

After importing the data, it will be used immediately for searching. Elasticsearch is schema-free, stores data in JSON documents and can automatically detect data structures and types.

Elasticsearch is also completely API-powered. This means that almost all operations can be done through a simple RESTful API using JSON data on HTTP. It provides many client libraries for almost any programming language, including Node.js. In this tutorial, we will use the official client library.

Elasticsearch is very flexible in terms of hardware and software requirements. While the recommended production environment is 64GB of RAM and as many CPU cores as possible, you can still run it on a resource-constrained system and get decent performance (assuming your dataset isn't very large). To follow the examples in this article, a system with 2GB of memory and a single CPU core is enough.

You can run Elasticsearch on all major operating systems (Linux, Mac OS, and Windows). To do this, you need to install the latest version of the Java runtime environment (see the "Installing Elasticsearch" section). To follow the examples in this article, you also need to install Node.js (any version after v0.11.0 will do), as well as npm.

Elasticsearch terminology

Elasticsearch uses its own terminology, which is different in some cases from a typical database system. Here is a list of commonly used terms and their meanings in Elasticsearch.

Index: This term has two meanings in the context of Elasticsearch. The first is the operation of adding data. When data is added, the text is broken down into tags (such as words), and each tag is indexed. However, indexes also refer to the location where all index data is stored. Basically, when you import the data, it gets indexed into an index. Every time you want to perform anything on your data, you need to specify its index name.

Type: Elasticsearch provides a more detailed classification of documents in the index, called types. Each document in the index should also have a type. For example, we can define a library index and then index multiple types of data (such as articles, books, reports, and presentations) into it. Since indexes have almost fixed overhead, it is recommended to use fewer indexes and more types instead of more indexes and fewer types.

Search: This term may mean the same as you think. You can search for data in different indexes and types. Elasticsearch provides many types of search queries, such as terms, phrases, ranges, fuzzy, and even geographic data queries.

Filter: Elasticsearch allows you to filter search results based on different criteria to further narrow the range of results. If you add a new search query to a set of documents, it may change the order according to the dependency, but if you add the same query as a filter, the order remains the same.

Aggregation: These provide different types of statistics for aggregated data, such as minimum, maximum, average, sum, histogram, and more.

Suggestions: Elasticsearch provides different types of suggestions for input text. These suggestions can be based on terms or phrases, and even suggestions can be completed.

Install Elasticsearch

Elasticsearch is available under the Apache 2 license; it can be downloaded, used and modified for free. Before installing it, you need to make sure that the Java Runtime Environment (JRE) is installed on your computer. Elasticsearch is written in Java and depends on Java libraries to run. To check if Java is installed on your system, you can type the following in the command line.

1	`<code>java -version</code>`

Copy after login

The latest stable version of Java is recommended (1.8 at the time of writing). You can find a guide on installing Java on your system here.

Next, to download the latest version of Elasticsearch (2.4.0 at the time of writing), visit the download page and download the ZIP file. Elasticsearch does not require installation, and a single zip file contains a complete set of files that run programs on all supported operating systems. Unzip the downloaded file and it's done! There are several other ways to run Elasticsearch, such as getting TAR files or packages for different Linux distributions (see here).

If you are running Mac OS X and have Homebrew installed, you can use brew install elasticsearch to install Elasticsearch. Homebrew will automatically add the executable to your path and install the required services. It also helps you update your application with a single command: brew upgrade elasticsearch.

To run Elasticsearch on Windows, run binelasticsearch.bat from the decompressed directory. For all other operating systems, run ./bin/elasticsearch from the terminal. At this point it should run on your system.

As I mentioned earlier, almost everything you can do with Elasticsearch can be done via the REST API. Elasticsearch uses port 9200 by default. To make sure you run it correctly, visit http://localhost:9200/ in your browser, which should show some basic information about the instance you are running.

For more information on installation and troubleshooting, you can access the documentation.

Graphic User Interface

Elasticsearch provides almost all features through the REST API and does not come with a graphical user interface (GUI). While I've covered how to do all the necessary operations through the API and Node.js, there are some GUI tools that provide visual information about indexes and data, and even some advanced analytics.

Kibana, developed by the same company, provides a real-time summary of the data, as well as some custom visualization and analysis options. Kibana is free and has detailed documentation.

The community has also developed other tools, including elasticsearch-head, the Elasticsearch GUI, and even a Chrome extension called ElasticSearch Toolbox. These tools can help you browse indexes and data in your browser, and even try different searches and aggregate queries. All of these tools provide walkthroughs for installation and use.

Set Node.js environment

Elasticsearch provides an official module for Node.js called elasticsearch. First, you need to add the module to your project folder and save the dependencies for future use.

1	`<code>npm install elasticsearch --save</code>`

Copy after login

You can then import the module in the script as follows:

1	`<code>java -version</code>`

Copy after login

Finally, you need to set up a client that handles communication with Elasticsearch. In this example, I assume you are running Elasticsearch on your local machine with an IP address of 127.0.0.1 and a port of 9200 (default setting).

1	`<code>npm install elasticsearch --save</code>`

Copy after login

The

log option ensures that all errors are logged. For the rest of this article, I will use the same esClient object to communicate with Elasticsearch. The complete documentation of the node module is provided here.

Note: All source code for this tutorial is available on GitHub. The easiest way to follow is to clone the repository to your PC and run the example from there:

1	`<code>const` `elasticsearch =` `require('elasticsearch');</code>`

Copy after login

Import data

In this tutorial, I will use a dataset of academic articles with randomly generated content. The data is provided in JSON format, and there are 1,000 articles in the dataset. To display the style of the data, an item in the dataset is shown below.

<code>const esClient = new elasticsearch.Client({
  host: '127.0.0.1:9200',
  log: 'error'
});</code>

Copy after login

The field name is self-explanatory. The only thing to note is that the body field is not shown here, as it contains a complete randomly generated document (including 100 to 200 paragraphs). You can find the complete dataset here.

While Elasticsearch provides methods for indexing, updating, and deleting individual data points, we will use Elasticserch's batch method to import data, which is used to perform operations on large datasets more efficiently:

<code>git clone https://github.com:sitepoint-editors/node-elasticsearch-tutorial.git
cd node-elasticsearch-tutorial
npm install</code>

Copy after login

Here, we call the bulkIndex function, passing the library as the index name, article as the type, and the JSON data we want to index. The bulkIndex function calls the bulk method on the esClient object in turn. This method takes an object with the body attribute as a parameter. The value provided to the body property is an array with two entries per operation. In the first entry, the type of the operation is specified as a JSON object. In this object, the index property determines the operation to be performed (in this case the index document), as well as the index name, type name, and document ID. The next entry corresponds to the document itself.

Please note that in the future you can add other types of documents (such as books or reports) to the same index in this way. We can also assign a unique ID to each document, but this is optional – if you don't provide one, Elasticsearch will assign you a unique randomly generated ID to each document.

Assuming you have cloned the repository, you can now import the data into Elasticsearch by executing the following command from the project root:

<code>{
    "_id": "57508457f482c3a68c0a8ab3",
    "title": "Nostrud anim proident cillum non.",
    "journal": "qui ea",
    "volume": 54,
    "number": 11,
    "pages": "109-117",
    "year": 2014,
    "authors": [
      {
        "firstname": "Allyson",
        "lastname": "Ellison",
        "institution": "Ronbert",
        "email": "Allyson@Ronbert.tv"
      },
      ...
    ],
    "abstract": "Do occaecat reprehenderit dolore ...",
    "link": "http://mollit.us/57508457f482c3a68c0a8ab3.pdf",
    "keywords": [
      "sunt",
      "fugiat",
      ...
    ],
    "body": "removed to save space"
  }</code>

Copy after login

Check whether the data is indexed correctly

A major feature of Elasticsearch is its near-real-time search. This means that once the documents are indexed, they are available for searching within a second (see here). Once the data is indexed, you can check the index information by running indices.js (linked to the source code):

// index.js
 
const bulkIndex = function bulkIndex(index, type, data) {
  let bulkBody = [];
 
  data.forEach(item => {
    bulkBody.push({
      index: {
        _index: index,
        _type: type,
        _id: item.id
      }
    });
 
    bulkBody.push(item);
  });
 
  esClient.bulk({body: bulkBody})
  .then(response => {
    console.log('here');
    let errorCount = 0;
    response.items.forEach(item => {
      if (item.index && item.index.error) {
        console.log(++errorCount, item.index.error);
      }
    });
    console.log(
      `Successfully indexed ${data.length - errorCount}
       out of ${data.length} items`
    );
  })
  .catch(console.err);
};
 
const test = function test() {
  const articlesRaw = fs.readFileSync('data.json');
  bulkIndex('library', 'article', articles);
};

Copy after login

The method in the client's cat object provides different information about the currently running instance. The indices method lists all indexes, their health, the number of documents, and their size on disk. The v option adds a title to the cat method's response.

When you run the above code snippet, you will notice that it outputs a color code to indicate the health of the cluster. Red indicates that there is a problem with the cluster and it is not running. Yellow indicates the cluster is running, but there is a warning, and green indicates everything is fine. Most likely (depending on your settings) you get a yellow state when running on your local machine. This is because the default settings contain five nodes of the cluster, but only one instance is running on your local machine. While you should always strive for a green state in production, for the purposes of this tutorial, you can continue to use Elasticsearch in the yellow state.

1	`<code>java -version</code>`

Copy after login

Dynamic and custom mapping

As I mentioned earlier, Elasticsearch is patternless. This means that you don't have to define the structure of the data before importing it (similar to defining a table in a SQL database), but Elasticsearch will automatically detect it for you. However, despite being called schemaless, data structures have some limitations.

Elasticsearch calls the structure of the data a mapping. If no mapping exists, Elasticsearch looks at each field of the JSON data when indexing the data and automatically defines the mapping based on its type. If the field already has a mapping entry, it ensures that the new data added follows the same format. Otherwise, it will throw an error.

For example, if {"key1": 12} has been indexed, Elasticsearch will automatically map field key1 to long. Now, if you try to index {"key1": "value1", "key2": "value2"}, it throws an error because it expects the type of field key1 to be long. Meanwhile, the object {"key1": 13, "key2": "value2"} will be indexed without problems and key2 of type string will be added to the map.

Map is beyond the scope of this article, and in most cases, automatic mapping works well. I recommend checking out the elasticsearch documentation, which provides an in-depth discussion of mappings.

Build a search engine

Once the data is indexed, we can implement search engines. Elasticsearch provides an intuitive full-text search query structure called Query DSL—it is based on JSON—to define queries. There are many types of search queries available, but in this article we will look at several of the more common queries. The complete documentation for Query DSL can be found here.

Remember, I have provided a code link for each example shown here. Once the environment is set up and the test data is indexed, you can clone the repository and run any examples on your machine. To do this, just run node filename.js from the command line.

Return all documents in one or more indexes

To perform a search, we will use various search methods provided by the client. The easiest query is match_all, which returns all documents in one or more indexes. The following example shows how we get all stored documents in the index (link to source code).

1	`<code>java -version</code>`

Copy after login

The main search query is contained in the query object. As we will see later, we can add different types of search queries to this object. For each query, we add a key that contains the query type (match_all in this example) with the value of the object containing the search options. There is no option in this example because we want to return all documents in the index.

In addition to the query object, the search body can also contain other optional properties, including size and from. The size attribute determines the number of documents to be included in the response. If this value does not exist, ten documents are returned by default. The from property determines the starting index of the returned document. This is useful for pagination.

Understand the search API response

If you are recording the response of the search API (results in the above example), it may initially look complicated because it contains a lot of information.

1	`<code>npm install elasticsearch --save</code>`

Copy after login

At the highest level, the response includes a take attribute that represents the number of milliseconds it took to find the result, timed_out, which is true only if the result is not found within the maximum allowed time, _shards, which is used to provide relevant nodes Information about status (if deployed as a node cluster), and hits, which contain search results.

In the hits property, we have an object with the following properties:

total—Indicates the total number of matches
max_score—The highest score for the item found
hits - an array containing the items found. In each document in the hits array, we have the index, type, document ID, score, and the document itself (in the _source element).

It's very complex, but the good news is that once you implement a way to extract the results, you will always get the results in the same format no matter what your search query is.

Also note that one of the advantages of Elasticsearch is that it automatically assigns a score to each matching document. This score is used to quantify the relevance of the document, and by default, the results are returned in order of decreasing scores. In the case where we use match_all to retrieve all documents, the scores make no sense and all scores are calculated as 1.0.

The document with a specific value in the match field

Now, let's look at some more interesting examples. To match documents that contain specific values in the field, we can use match query. Below is a simple search body with a match query (link to source code).

1	`<code>const` `elasticsearch =` `require('elasticsearch');</code>`

Copy after login

As I mentioned earlier, we first add an entry to a query object that contains the search type, match in this example. Within the search type object, we identify the document field to search, here is the title. In it, we put data related to the search, including the query attribute. I hope that after testing the above example, you will start to be amazed at the search speed.

The above search query returns a document whose title field matches any word in the query attribute. We can set the minimum number of matches as shown below.

1	`<code>java -version</code>`

Copy after login

This query matches documents with at least three specified words in its title. If there are fewer than three words in the query, all words must appear in the title to match the document. Another useful feature added to search queries is ambiguity. This is useful if the user enters an error while writing a query, as fuzzy matches will look for terms with similar spellings. For strings, the ambiguity value is based on the maximum Levinstein distance allowed for each term. Here is an example with ambiguity.

1	`<code>npm install elasticsearch --save</code>`

Copy after login

Search in multiple fields

If you want to search in multiple fields, you can use multi_match to search for types. It's similar to match, except that instead of using the field as a key in the search query object, we add a field key, which is an array of fields to search for. Here we search in the title, authors.firstname and authors.lastname fields. (Link to source code)

1	`<code>const` `elasticsearch =` `require('elasticsearch');</code>`

Copy after login

multi_match query supports other search properties such as minimum_should_match and fuzziness. Elasticsearch supports wildcards (such as *) to match multiple fields, so we can shorten the above example to ['title', 'authors.*name'].

Match the complete phrase

Elasticsearch can also match entered phrases exactly without matching at the term level. This query is an extension of a regular match query called match_phrase. Here is an example of match_phrase. (Link to source code)

<code>const esClient = new elasticsearch.Client({
  host: '127.0.0.1:9200',
  log: 'error'
});</code>

Copy after login

Combining multiple queries

So far, in the example, we have used only one query per request. However, Elasticsearch allows you to combine multiple queries. The most common composite query is bool. The bool query accepts four types of keys: must, should, must_not, and filter. As the name implies, the documents in the result must match the query in must, cannot match the query in must_not, if they match the query in should, you will get a higher score. Each of the above elements can receive multiple queries in the form of a query array. Below, we use bool query with a new query type query_string. This allows you to write more advanced queries using keywords like AND and OR. The complete documentation of query_string syntax can be found here. Additionally, we use a scope query (documented here), which allows us to limit fields to a given range. (Link to source code)

In the above example, the query returns the name of the author that contains term1

<code>git clone https://github.com:sitepoint-editors/node-elasticsearch-tutorial.git
cd node-elasticsearch-tutorial
npm install</code>

Copy after login

Their last name contains term2,

and Their title contains term3, and they were not in 2011, 2012 Or documents published in 2013. Additionally, documents containing a given phrase in their body will get higher scores and will be displayed at the top of the result (because the match query is in the should clause). Filters, Aggregations and Recommendations

In addition to its advanced search capabilities, Elasticsearch offers additional features. Here, we look at three more common features.

Filter

Usually, you may want to refine your search results based on specific criteria. Elasticsearch provides this functionality through filters. In our article data, let's say your search returns several articles from which you want to select only those published in five specific years. You can simply filter out anything that doesn't match your criteria from your search results without changing the search order.

The difference between a filter and the same query in the must clause of a bool query is that the filter does not affect the search score, and the must query will affect. When search results are returned and users filter based on certain criteria, they do not want to change the original result order, but rather just want to delete irrelevant documents from the result. Filters follow the same format as searches, but more commonly, they are defined on fields with definite values rather than text strings. Elasticsearch recommends adding filters through the filter clause of the bool composite search query.

Continue with the example above, suppose we want to limit our search results to articles published from 2011 to 2015. To do this, we just need to add a range query to the filter part of the original search query. This will remove any mismatched documents from the result. Here is an example of filtering queries. (Link to source code)

1	`<code>java -version</code>`

Copy after login

Polymerization

The aggregation framework provides various aggregated data and statistics based on search queries. The two main types of aggregation are metrics and grouping, where metric aggregation tracks and computes metrics on a set of documents, while grouped aggregation builds buckets, each associated with a key and a document condition. Examples of metric aggregation include mean, minimum, maximum, sum and value count. Examples of grouped aggregation include ranges, date ranges, histograms, and terminology. An in-depth explanation of the aggregator can be found here.

Aggregations are placed inside the aggregates object, which itself is placed directly in the body of the search object. In the aggregates object, each key is the name assigned by the user to the aggregator. The aggregator type and options should be placed as the value of that key. Next, we look at two different aggregators, a metric aggregator and a bucket aggregator. As a metric aggregator, we try to find the minimum year value (oldest post) in the dataset, and for the bucket aggregator, we try to find the number of occurrences of each keyword. (Link to source code)

1	`<code>npm install elasticsearch --save</code>`

Copy after login

In the above example, we name the metric aggregator min_year (this name can be any name), which is an aggregation of min type on the year field. The bucket aggregator is named keywords, which is an aggregation of terms type performed on the keywords field. The results of the aggregate are contained in the aggregations element in the response, and at a deeper level they contain each defined aggregator (min_year and keywords here) and their results. Here is a partial response to this example.

1	`<code>const` `elasticsearch =` `require('elasticsearch');</code>`

Copy after login

By default, up to 10 buckets are returned in the response. You can add a size key next to the field in the request to determine the maximum number of buckets returned. If you want to receive all buckets, set this value to 0.

Suggestions

Elasticsearch has multiple types of suggestioners that can provide replacement or completion suggestions for input terms (documented here). We will check out the term and phrase suggestion here. The term suggester provides advice for each term in the input text (if any), while the phrase suggester treats the input text as a complete phrase (rather than breaking it down into terms) and provides other phrase suggestions (if any) . To use the suggestion API, we need to call the suggest method on the Node.js client. Here is an example of a term suggester. (Link to source code)

1	`<code>java -version</code>`

Copy after login

In the request body, consistent with all other client methods, we have an index field that determines the index of the search. In the body property, we add the text we are looking for suggestions, and (as with the aggregate object), we assign a name to each suggester (in this case titleSuggester). Its value determines the type and options of the suggestor. In this case, we are using a term suggester for the title field and limiting the maximum number of suggestions per tag to five (size: 5).

The response of the suggest API contains a key for each suggestor you request, which is an array of sizes that are the same as the number of terms in the text field. For each object in this array, there is an options object whose text field contains suggestions. The following is a partial response to the above request.

1	`<code>npm install elasticsearch --save</code>`

Copy after login

To get phrase suggestions, we can follow the same format as above, just replace the suggestor type with phrase. In the following example, the response follows the same format as explained above. (Link to source code)