What is Elasticsearch? Where can Elasticsearch be used?

零下一度
Release: 2017-06-23 16:10:36
Original
4123 people have browsed it
  • Elasticsearch Version: 5.4

  • Elasticsearch Quick Start Part 1: Getting Started with Elasticsearch

  • Elasticsearch Quick Start Part 1 2 articles: Elasticsearch and Kibana installation

  • Elasticsearch quick start article 3: Elasticsearch index and document operations

  • Elasticsearch quick start article 4: Elasticsearch document query

Elasticsearch is a highly scalable open source full-text search and analysis engine. It can store, search and analyze large-scale data quickly and in near real-time. It is generally used as the underlying engine/technology to provide strong support for applications with complex search functions and requirements.

Elasticsearch can be used in these places:

  1. Suppose there is an online store website, in order to allow customers to search for products on sale. In this case, you can use Elasticsearch to store your entire product catalog and inventory and provide searches and automatically give them some suggestions.

  2. Suppose you want to collect logs or transaction data and find trends, statistics, summaries or anomalies through analysis and mining. In this case, you can use LogStash (part of the Elasticsearch/Logstash/Kibana stack) to collect, aggregate and parse your data, and then use LogStash Submit this data to Elasticsearch . Once Elasticsearch has obtained the data, you can search and aggregate the information that interests you.

  3. Suppose you run a price alert platform and let price-savvy customers specify a rule such as “I am interested in purchasing a specific electronic gadget if, within the next month, there is a seller Price is less than $x, I want to be notified". In this case, you can submit the seller's price to Elasticsearch , use a reverse search (filter), match the price changes to the customer query, and notify the customer once a match is found.

  4. Suppose you have an analytical (business intelligence) need and want to quickly investigate, analyze, visualize and find an ad-hoc problem in large amounts of data (think millions or billions of records) . In this case, you can use Elasticsearch to store the data, and then use Kibana (part of the Elasticsearch stack) to build custom dashboards that can be visualized for you important data. In addition, you can use the Elasticsearch aggregation function to perform complex business intelligence queries based on data.

For the rest of this tutorial, I will guide you through the startup and running process of Elasticsearch , and show you some basic operations, such as: indexing, Search and modify data. By the end of this tutorial, you will have a deeper understanding of what Elasticsearch is and how it works. Hopefully you'll be inspired to use it to both build sophisticated search applications and discover useful things from your data.

Basic Concepts (Basic Concepts)

There are some concepts that are the core of Elasticsearch . Understanding these concepts from the beginning will greatly aid later learning.

Near Real Time (NRT)

Elasticsearch is a near real-time search platform. This means there is only a slight delay (usually 1 second) from the time a document is indexed to the time it becomes searchable.

Cluster (Cluster)

A cluster is a collection of one or more nodes (servers) that unite to save all data, and Indexing and search operations can be performed on all nodes. Clusters are identified by a unique name, which defaults to "elasticsearch". Since a node can only belong to one cluster and join the cluster according to the cluster name. So the name is important.

Do not use the same cluster name in different environments, otherwise the wrong cluster may be added. For example, you can use cluster names, logging-dev , logging-stage and logging-prod in development, staging, and production environments respectively.

Note that a cluster with only one node is valid and perfect. It is also possible to have multiple independent clusters, each with its own unique cluster name.

Node (Node)

A node is a single server that is part of the cluster, stores data, and participates in the indexing and search of the cluster. Like the cluster, nodes are also distinguished by unique names. The default name is a random UUID (Universally Unique IDentifier), which will be set to the node when the server starts. You can also customize the node name if you don't want to use the default value. Names are very important to administrators, as they help you identify which nodes correspond to each server in the cluster.

Nodes can join the specified cluster by configuring the cluster name. By default, nodes join a cluster called elasticsearch , which means that if you start a large number of nodes in the network and if they can all communicate with each other, they will automatically be added to a cluster. The cluster named elasticsearch .

Index

Index is a collection of documents with certain similar characteristics. For example, customer data index, product catalog index, and order data index. An index is identified by a name (which must be all lowercase) that is used when indexing, searching, updating, and deleting documents. Within a single cluster, you can define as many indexes as needed.

Type (Type)

An index can define one or more types. A type is a logical category/partition of an index, whatever you want to understand it to be. Typically, a type is defined for documents that have a common set of fields. For example, a blogging platform might store all data in a single index. In this index, you can define user data types, blog data types, and comment data types.

Document (document)

Document is the basic unit that can be indexed. For example, use a document to save data about a customer, or save data about a single product, or save data about a single order. Documents are represented using JSON. A large number of documents can be stored in an index/type. It is worth noting that although the document is essentially stored in the index, it is actually indexed/assigned to a type in the index.

Shards & replicas

An index may store massive amounts of data, which may exceed the hard disk capacity of a single node. For example, an index stores 1 billion documents and occupies 1 TB of hard disk space. The hard disk of a single node may not be enough to store such a large amount of data. Even if it can be stored, it may slow down the server's processing speed of search requests.

In order to solve this problem, elasticsearch provides the sharding function, which is to subdivide the index. When creating an index, you can simply define the number of shards required. Each shard itself has all the functions of an index and can be stored on any node in the cluster.

Sharding is important for two main reasons:

  • It allows you to split/scale your content volume horizontally

  • It allows you to distribute operations to shards on multiple nodes in parallel, thereby improving performance or throughput.

# The mechanism of shard distribution, and how its documents are aggregated back into search requests, is completely managed by Elasticsearch and is transparent to the user.

In a network/cloud environment where failure can occur at any time, sharding can be very useful and a failover mechanism is highly recommended to prevent the shard/node from going offline or disappearing. To do this, elasticsearch allows you to make one or more copies of the index's shards, which are so-called replicated shards, or simply replicas.

Replicas are important for two main reasons:

  • #To provide high availability if a shard/node fails. Therefore, it is important to note that a replica cannot be allocated on the same node as the original/primary shard it is copied from.

  • It allows you to scale search volume/throughput since searches can be performed in parallel on all replicas.

In summary, each index can be divided into multiple shards. Each index can also be replicated zero times (meaning no copies) or multiple times. Once replicated, each index will have a primary shard (the original shard that was replicated) and a secondary shard (a copy of the primary shard). The number of shards and replicas can be defined per index when creating the index. After creating an index, you can dynamically change the number of replicas at any time, but you cannot change the number of shards afterwards.

By default, each index will be assigned 5 primary shards and 1 replica shard, which means that if you have two nodes in the cluster, your index will have 5 primary shards. shards and 5 replicated shards, for a total of 10 shards.

Each elasticsearch shard is a Lucene index. There can be many documents in a Lucene index. As of LUCENE-5843, up to 2,147,483,519 (= Integer.MAX_VALUE - 128) documents. You can use _cat/shards API monitors shard size.

Summary

1. Why not use a relational database for searching? Because the database is used to implement the search, the performance will be very poor and word segmentation search cannot be performed.

2. What are full-text search, inverted index and Lucene? Previous people have already summarized it, please refer to [Teaching you step-by-step full-text retrieval] A preliminary exploration of Apache Lucene

3. Characteristics of Elasticsearch

  • It can be distributed in clusters and handle massive data Perform near real-time processing;

  • is very simple for users to use out of the box. If the amount of data is not large, the operation will not be too complicated;

  • has functions that relational databases do not have, such as full-text search, synonym processing, relevance ranking, complex data analysis, and massive data processing Near real-time processing;

  • Based on Lucene, it hides complexity and provides simple and easy-to-use restful api interface and java api interface

4, The core concept of elasticsearch

  • Cluster: The cluster contains multiple nodes, and which cluster each node belongs to is determined by configuration (the default is elasticsearch)

  • Node: A node in the cluster. The node will automatically join the cluster named "elasticsearch" by default. An elasticsearch service is a node. For example, if a machine starts two es services, there will be two nodes.

  • Index: Index, equivalent to the mysql database, contains a bunch of document data with a similar structure.

  • Type: Type, equivalent to a mysql table, a logical data classification in the index.

  • Document: Document, equivalent to a row of records in the mysql table, is the smallest data unit in es.

  • shard: Sharding. A single machine cannot store a large amount of data. es can split the data in an index into multiple shards and distribute them for storage on multiple servers.

  • replica: Replica: In order to prevent downtime and shard loss, the minimum high availability configuration is 2 servers.

The above is the detailed content of What is Elasticsearch? Where can Elasticsearch be used?. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template