A highly scalable, open source full-text retrieval and analysis engine that can quickly store and search in quasi-real-time , analyze massive amounts of data.
Full-text retrieval means that the computer indexing program builds an index for each word by scanning each word in the article, indicating the number and location of the word in the article. When the user queries, the retrieval program will build an index based on the pre-established Search the index and feed the search results back to the user's search method. This process is similar to the process of looking up a word through the search word list in a dictionary. Full-text search of data in search engine database
Mysql only has the term dictionary layer, which is stored on disk in a b-tree sorting manner Up. Retrieving a term requires several random access disk operations. Lucene adds a term index to the term dictionary to speed up retrieval. The term index is cached in memory in the form of a tree. After finding the block location of the corresponding term dictionary from the term index, we then go to the disk to find the term, which greatly reduces the number of random accesses to the disk. In addition: the term index is saved in the form of FST (finite state transducers) in memory, which is very memory-saving. The term dictionary is saved on the disk in blocks. A block is compressed using a common prefix. For example, if all words start with Ab, Ab can be omitted. In this way, term dictionary can save more disk space than b-tree.
We use MySQL data storage, use MySQL's transaction features to maintain data consistency, and use ElasticSearch for data collection and query. At this time, es and the database The synchronization scheme is particularly important.
Process
First add the product into the database. After adding the product successfully, the product will be entered into ES. If the entry into ES fails, the failed product ID will be put into the redis cache queue, and the failed product ID will be added. Enter the log file (if redis hangs, you can get the abnormal product ID from the log and then enter it into ES). The task task refreshes the redis cache queue every second. If the product ID is obtained from the cache queue, it is retrieved from the database based on the product ID. Get the product data and then enter it into ES.
Use
logstash-input-jdbc plug-in to synchronize the database, installation and configuration: create a .conf file, configure the database to be synchronized and the .sql sql statement for execution, and finally put A jdbc driver is placed in this folder to connect to the mysql database
[Related learning recommendations: mysql tutorial]
elasticsearch data duplication and incremental synchronization
In the default configuration, the value of tracking_column is @timestamp , the _id value that exists in elasticsearch is the time when logstash is stored in elasticsearch. The main function of this value is similar to the primary key of mysql. It is unique, but our timestamp is actually changing all the time, so we use the select statement to query every time All data will be stored in elasticsearch, resulting in data duplication.
Solution
In the table to be queried, find the primary key or auto-increment field, and set it to the value of _id, because the _id value is unique, so when there is When the _id is repeated, the data will not be repeated
Frequent data synchronization affects the performance of the mysql database
The mysql statement we write to the jdbc.sql file is hard-coded, so every time There are many databases that are queried that no longer need to be queried, especially every time select * from table;, which puts a lot of pressure on the mysql database
Solution:
(1 ) According to business needs, the scheduled synchronization time can be modified appropriately. I have relatively high real-time requirements here, so I set a 10-minute schedule => “*/10 * * * *”
(2) Set the mysql query range To prevent a large number of queries from dragging down the database, set select * from WHERE autoid > :sql_last_value;
elasticsearch storage capacity continues to increase
elasticsearch For data security, after receiving the data, set select * from WHERE autoid > :sql_last_value; , first write the data to the memory and translog, and then create an index and write it to the disk. In this way, even if there is a sudden power outage and restart, it can be restored through translog. However, since there is a lot of duplicate data in each query, these The duplicate data was not written into the elasticsearch index, so it was accumulated, causing the elasticsearch capacity to continue to increase.
Solution:
The query official website says that it will be refreshed regularly and old logs will be automatically cleaned up, so no processing is required.
Incremental synchronization and mysql range query result in the inability to synchronize to the previous data when the mysql database is modified.
Solve the problem of small-scale query of mysql every time and solve the problem of database pressure, but it leads to the problem of being unable to synchronize the modification of old data
Solution:
Can be based on the business If your database is frequently modified, you can only do full updates. However, indexing by scanning the database with high frequency and wide range is worse than not doing indexing at all (because indexing is also costly), we do indexing. Mainly for situations where the amount of data is large, is not modified frequently, and consumes database performance. I have less data modifications here, and the modifications are generally recent data, because during synchronization, I slightly adjusted the mysql range
php composer install composer require elasticsearch/elasticsearch
Introduce the es file autoload.php file, set the IP address
Create index, index corresponds to the database in relational data (hereinafter referred to as MySQL), rather than corresponding to the index in MySQL
Having a database is not enough, you still need to create a table. The same is true for ES. The type in ES corresponds to the table in MySQL. type is not defined separately, but is defined together with the field, which is defined in the body; of course, ik word segmentation can also be used in the body field;
Use EsClient->search () to implement search;
Synonyms and The use of synonyms
[Related learning recommendations: php programming (video)]
Download the ik version package of es
Create the ik directory in the plugins under the es directory, and unzip all the files in the downloaded ik zip package Go in.
Go into the config directory of es, edit elasticsearch.yml, and add index.analysis.analyzer.default.type: "ik" in the blank space.
Pinyin word segmenter configuration: Use the compiled one: elasticsearch-analysis-pinyin-1.3.0
In the plugins directory of elasticsearch, create a new analysis-pinyin folder, unzip the compressed package, and put the jar package inside to the analysis-pinyin folder.
Configure the filter of the Pinyin tokenizer in elasticsearch.yml
Configure the filter of the synonym tokenizer in elasticsearch.yml
Configure the synonym database and create a new sysnonym.txt in the config directory of elasticsearch.
Configuring the tokenizer for ik pinying synonyms, mainly including the name, type, components of token segmentation, and processing of the segmented dimensions: Pinyin and synonyms are used here
ES By adding the html tag field to the field data after the query when querying, the document can be formatted by color or font when displayed on the web interface. , is to modify the highlighted field in highlight. This part contains the text fragment matched by the name attribute, and is encapsulated with HTML tags
The data in Elasticsearch Stored in shards, when performing a search, each shard is searched independently, and the data is then integrated and returned.
The general query process is
1) The client request is sent to a certain node
2) The node forwards it to each shard and queries the first 10 items on each shard
3) The results are returned to the node, the data is integrated, and the first 10 items are extracted
4) Returned to the requesting client
When we query the data from items 10 to 20, there are two ways, including deep paging (from-size) and snapshot paging (scroll);
Depth paging (from-size)
from defines the offset value of the target data, and size defines the number of events currently returned. The default from is 0 and size is 10, which means that all queries only return the first 10 pieces of data by default. Query the first 20 pieces of data, then truncate the first 10 pieces, and only return 10-20 pieces of data. The first 10 queries were wasted. The further back the paging is, the lower the execution efficiency will be. The larger the paging offset value, the longer it will take to execute the paging query
Compared to the paging of from and size, use scroll can simulate a traditional data cursor and record the position of the currently read document information. The usage of this paging is not to query data in real time, but to query a large amount of data (or even all data) at one time. Because this scroll is equivalent to maintaining a snapshot information of the current index segment, this snapshot information is the snapshot when you execute this scroll query. Any new indexed data after this query will not be queried in this snapshot. But compared with from and size, it does not query all the data and eliminate the unnecessary parts, but records a reading position to ensure that the next reading can continue quickly.
process:
column!
The above is the detailed content of The use of ES in MySQL and PHP. For more information, please follow other related articles on the PHP Chinese website!