Home > Database > Mysql Tutorial > body text

mysql function full text search function

伊谢尔伦
Release: 2016-11-23 11:56:31
Original
1575 people have browsed it

Syntax:

 MATCH (col1,col2,...) AGAINST (expr [IN BOOLEAN MODE | WITH QUERY EXPANSION])
Copy after login

MySQL supports full-text indexing and search functions. The full-text index type FULLTEXT index in MySQL. FULLTEXT indexes are only available on MyISAM tables; they can be created from CHAR, VARCHAR, or TEXT columns as part of a CREATE TABLE statement, or added later using ALTER TABLE or CREATE INDEX. For larger data sets, entering your data into a table that does not have a FULLTEXT index and then creating the index is faster than entering the data into an existing FULLTEXT index.

Full text search is performed with the MATCH() function. The

mysql> CREATE TABLE articles (    ->   id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,    ->   title VARCHAR(200),    ->   body TEXT,    ->   FULLTEXT (title,body)    -> );Query OK, 0 rows affected (0.00 sec)
mysql> INSERT INTO articles (title,body) VALUES    -> ('MySQL Tutorial','DBMS stands for DataBase ...'),    -> ('How To Use MySQL Well','After you went through a ...'),    -> ('Optimizing MySQL','In this tutorial we will show ...'),    -> ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),    -> ('MySQL vs. YourSQL','In the following database comparison ...'),    -> ('MySQL Security','When configured properly, MySQL ...');Query OK, 6 rows affected (0.00 sec)
Records: 6  Duplicates: 0  Warnings: 0
mysql> SELECT * FROM articles    -> WHERE MATCH (title,body) AGAINST ('database');
+----+-------------------+------------------------------------------+
| id | title             | body                                     |
+----+-------------------+------------------------------------------+
|  5 | MySQL vs. YourSQL | In the following database comparison ... |
|  1 | MySQL Tutorial    | DBMS stands for DataBase ...             |
+----+-------------------+------------------------------------------+
2 rows in set (0.00 sec)
Copy after login

MATCH() function performs a natural language search within the database for a string. A database is a set of 1 or 2 columns contained in FULLTEXT. The search string is given as a parameter to AGAINST(). For each row in the table, MATCH() returns a correlation value, that is, a similarity measure between the search string and the text in that row in the specified column in the MATCH() table.

By default, the search is performed in a case-insensitive manner. However, you can perform a case-sensitive full-text search by using a binary sort on the indexed columns. For example, you can give a latin1_bin sorting method to a column that uses the latin1 character set, making full-text searches case-sensitive.

As in the above example, when MATCH() is used in a WHERE statement, the relevant value is a non-negative floating point number. Zero correlation means no similarity. The correlation is calculated based on the number of words in the line, the number of uniques in the line, the total number of words in the database, and the number of files (lines) that contain the particular word.

For natural language full-text search, it is required that the columns named in the MATCH() function are the same as the columns contained in some FULLTEXT indexes in your table. For the above query, please note that the columns named in the MATCH() function (title and full text) are the same as the columns in the FULLTEXT index of the article table. If you want to search the title and full text separately, you should create a FULLTEXT index on each column.

Alternatively run a Boolean search or search using query expansion.

The above example basically shows how to use the MATCH() function that returns the correlation order of rows in decreasing order. The following example shows how to retrieve the relevant value explicitly. The order of returned rows is uncertain because the SELECT statement does not contain a WHERE or ORDER BY clause:

mysql> SELECT id, MATCH (title,body) AGAINST ('Tutorial')    -> FROM articles;
+----+-----------------------------------------+
| id | MATCH (title,body) AGAINST ('Tutorial') |
+----+-----------------------------------------+
|  1 |                        0.65545833110809 |
|  2 |                                       0 |
|  3 |                        0.66266459226608 |
|  4 |                                       0 |
|  5 |                                       0 |
|  6 |                                       0 |
+----+-----------------------------------------+
6 rows in set (0.00 sec)
Copy after login

The following example is more complicated. The query returns the relevant values, sorting the rows in order of decreasing relevance. To achieve this result, you should specify MATCH() twice: once in the SELECT list and once in the WHERE clause. This causes no additional housekeeping because the MySQL optimizer notices that the two MATCH() calls are identical and activates the full-text search code only once.

mysql> SELECT id, body, MATCH (title,body) AGAINST    -> ('Security implications of running MySQL as root') AS score    
-> FROM articles WHERE MATCH (title,body) AGAINST    -> ('Security implications of running MySQL as root');
+----+-------------------------------------+-----------------+
| id | body           | score           |
+----+-------------------------------------+-----------------+
|  4 | 1. Never run mysqld as root. 2. ... | 1.5219271183014 |
|  6 | When configured properly, MySQL ... | 1.3114095926285 |
+----+-------------------------------------+-----------------+
2 rows in set (0.00 sec)
Copy after login

There are 2 rows in the table (0.00 seconds)

MySQL FULLTEXT execution treats any sequence of single-word character prototypes (letters, numbers, and underscore parts) as a word. This sequence may also contain single quotes ('), but there will be no more than one on a line. This means aaa'bbb will be treated as one word, while aaa''bbb will be treated as 2 words. Single quotes before or after a word will be removed by the FULLTEXT parser; 'aaa'bbb' will become aaa'bbb.

The FULLTEXT parser determines where a word begins and ends by looking for certain delimiters, such as ' ' (space mark), , (comma), and . (period). If words are not separated by delimiters (such as in Chinese), the FULLTEXT parser cannot determine the start and end positions of a word. In order to be able to add words or other indexed terms to the FULLTEXT index in such a language, you have to preprocess them so that they are separated by some arbitrary delimiter like ".

Some words are in the full text Will be ignored in the search:

Any word that is too short will be ignored. The default minimum length of the word that can be found in the full-text search is 4 characters.

Words in stop words will be ignored. "the" or "some" are too common words to be considered unsemantic. There is a built-in stop word, but it can be overridden through a user-defined list.

Every correct word in the vocabulary and query is based on its In this way, a word that appears in many documents has a lower importance (and even many words have zero importance) because of the importance of this particular word. Its semantic value is lower in the library. On the contrary, if the word is rare, then it will get a higher importance. The importance of the word is then combined and used to calculate the relevance of the row.

Best used with large vocabularies (indeed, when it is carefully tuned), for very small tables the distribution of words does not fully reflect their semantic value, and this mode can sometimes produce strange results, e.g. , Although the word "MySQL" appears in every row in the articles table, a search for this word may not yield any results:

mysql> SELECT * FROM articles
-> WHERE MATCH (title,body) AGAINST ('MySQL');
找不到搜索的词(0.00 秒)
Copy after login

这个搜索的结果为空,原因是单词 “MySQL” 出现在至少全文的50%的行中。 因此, 它被列入停止字。对于大型数据集,使用这个操作最合适不过了----一个自然语言问询不会从一个1GB 的表每隔一行返回一次。对于小型数据集,它的用处可能比较小。

一个符合表中所有行的内容的一半的单词查找相关文档的可能性较小。事实上, 它更容易找到很多不相关的内容。我们都知道,当我们在因特网上试图使用搜索引擎寻找资料的时候,这种情况发生的频率颇高。可以推论,包含该单词的行因其所在特别数据集 而被赋予较低的语义价值。 一个给定的词有可能在一个数据集中拥有超过其50%的域值,而在另一个数据集却不然。

当你第一次尝试使用全文搜索以了解其工作过程时,这个50% 的域值提供重要的蕴涵操作:若你创建了一个表,并且只将文章的1、2行插入其中, 而文中的每个单词在所有行中出现的机率至少为 50% 。那么结果是你什么也不会搜索到。一定要插入至少3行,并且多多益善。需要绕过该50% 限制的用户可使用布尔搜索代码。

1. 布尔全文搜索

利用IN BOOLEAN MODE修改程序, MySQL 也可以执行布尔全文搜索:

mysql> SELECT * FROM articles WHERE MATCH (title,body)    -> AGAINST ('+MySQL -YourSQL' IN BOOLEAN MODE);
+----+-----------------------+-------------------------------------+
| id | title                 | body                                |
+----+-----------------------+-------------------------------------+
|  1 | MySQL Tutorial        | DBMS stands for DataBase ...        |
|  2 | How To Use MySQL Well | After you went through a ...        |
|  3 | Optimizing MySQL      | In this tutorial we will show ...   |
|  4 | 1001 MySQL Tricks     | 1. Never run mysqld as root. 2. ... |
|  6 | MySQL Security        | When configured properly, MySQL ... |
+----+-----------------------+-------------------------------------+
Copy after login

这个问询检索所有包含单词“MySQL”的行,但不检索包含单词“YourSQL”的行。

布尔全文搜索具有以下特点:

它们不使用 50% 域值。.

它们不会按照相关性渐弱的顺序将行进行分类。你可以从上述问询结果中看到这一点:相关性最高的行是一个包含两个“MySQL” 的行,但它被列在最后的位置,而不是开头位置。

即使没有FULLTEXT,它们仍然可以工作,尽管这种方式的搜索执行的速度非常之慢。

最小单词长度全文参数和最大单词长度全文参数均适用。

停止字适用。

布尔全文搜索的性能支持以下操作符:

+

一个前导的加号表示该单词必须 出现在返回的每一行的开头位置。

-

一个前导的减号表示该单词一定不能出现在任何返回的行中。

(无操作符)

在默认状态下(当没有指定 + 或–的情况下),该单词可有可无,但含有该单词的行等级较高。这和MATCH() ... AGAINST()不使用IN BOOLEAN MODE修改程序时的运作很类似。

> <

这两个操作符用来改变一个单词对赋予某一行的相关值的影响。 > 操作符增强其影响,而 <操作符则减弱其影响。请参见下面的例子。

( )

括号用来将单词分成子表达式。括入括号的部分可以被嵌套。

~

一个前导的代字号用作否定符, 用来否定单词对该行相关性的影响。 这对于标记“noise(无用信息)”的单词很有用。包含这类单词的行较其它行等级低,但因其可能会和-号同时使用,因而不会在任何时候都派出所有无用信息行。

*

星号用作截断符。于其它符号不同的是,它应当被追加到要截断的词上。

"

一个被括入双引号的短语 (‘"’) 只和字面上包含该短语输入格式的行进行匹配。全文引擎将短语拆分成单词,在FULLTEXT索引中搜索该单词。 非单词字符不需要严密的匹配:短语搜索只要求符合搜索短语包含的单词且单词的排列顺序相同的内容。例如, "test phrase" 符合 "test, phrase"。

若索引中不存在该短语包含的单词,则结果为空。例如,若所有单词都是禁用词,或是长度都小于编入索引单词的最小长度,则结果为空。

以下例子展示了一些使用布尔全文符号的搜索字符串:

'apple banana'

寻找包含至少两个单词中的一个的行。

'+apple +juice'

寻找两个单词都包含的行。

'+apple macintosh'

寻找包含单词“apple”的行,若这些行也包含单词“macintosh”, 则列为更高等级。

'+apple -macintosh'

寻找包含单词“apple” 但不包含单词 “macintosh”的行。

'+apple +(>turnover

寻找包含单词“apple”和“turnover” 的行,或包含“apple” 和“strudel”的行 (无先后顺序),然而包含 “apple turnover”的行较包含“apple strudel”的行排列等级更为高。

'apple*'

寻找包含“apple”、“apples”、“applesauce”或“applet”的行。

'"some words"'

寻找包含原短语“some words”的行 (例如,包含“some words of wisdom” 的行,而非包含 “some noise words”的行)。注意包围词组的‘"’ 符号是界定短语的操作符字符。它们不是包围搜索字符串本身的引号。

2. 全文搜索带查询扩展

全文搜索支持查询扩展功能 (特别是其多变的“盲查询扩展功能” )。若搜索短语的长度过短, 那么用户则需要依靠全文搜索引擎通常缺乏的内隐知识进行查询。这时,查询扩展功能通常很有用。例如, 某位搜索 “database” 一词的用户,可能认为“MySQL”、“Oracle”、“DB2” and “RDBMS”均为符合 “databases”的项,因此都应被返回。这既为内隐知识。

在下列搜索短语后添加WITH QUERY EXPANSION,激活盲查询扩展功能(即通常所说的自动相关性反馈)。它将执行两次搜索,其中第二次搜索的搜索短语是同第一次搜索时找到的少数顶层文件连接的原始搜索短语。这样,假如这些文件中的一个 含有单词 “databases” 以及单词 “MySQL”, 则第二次搜索会寻找含有单词“MySQL” 的文件,即使这些文件不包含单词 “database”。下面的例子显示了这个不同之处:

mysql> SELECT * FROM articles    -> WHERE MATCH (title,body) AGAINST (&#39;database&#39;);
+----+-------------------+------------------------------------------+
| id | title             | body                                     |
+----+-------------------+------------------------------------------+
|  5 | MySQL vs. YourSQL | In the following database comparison ... |
|  1 | MySQL Tutorial    | DBMS stands for DataBase ...             |
+----+-------------------+------------------------------------------+
2 rows in set (0.00 sec)

mysql> SELECT * FROM articles    -> WHERE MATCH (title,body)    -> AGAINST (&#39;database&#39; WITH QUERY EXPANSION);
+----+-------------------+------------------------------------------+
| id | title             | body                                     |
+----+-------------------+------------------------------------------+
|  1 | MySQL Tutorial    | DBMS stands for DataBase ...             |
|  5 | MySQL vs. YourSQL | In the following database comparison ... |
|  3 | Optimizing MySQL  | In this tutorial we will show ...        |
+----+-------------------+------------------------------------------+
3 rows in set (0.00 sec)
Copy after login

另一个例子是Georges Simenon 搜索关于Maigret的书籍, 这个用户不确定“Maigret”一词的拼法。若不使用查询扩展而搜索“Megre and the reluctant witnesses” 得到的结果只能是的“Maigret and the Reluctant Witnesses” 。 而带有查询扩展的搜索会在第二遍得到带有“Maigret”一词的所有书名。

注释:  盲查询扩展功能很容易返回非相关文件而增加无用信息,因此只有在查询一个长度很短的短语时才有必要使用这项功能。

3. 全文停止字

以下表列出了默认的全文停止字:

a's    able    about    above    according    

accordingly    across    actually    after    afterwards    

again    against    ain't    all    allow    

allows    almost    alone    along    already    

also    although    always    am    among    

amongst    an    and    another    any    

anybody    anyhow    anyone    anything    anyway    

anyways    anywhere    apart    appear    appreciate    

appropriate    are    aren't    around    as    

aside    ask    asking    associated    at    

available    away    awfully    be    became    

because    become    becomes    becoming    been    

before    beforehand    behind    being    believe    

below    beside    besides    best    better    

between    beyond    both    brief    but    

by    c'mon    c's    came    can    

can't    cannot    cant    cause    causes    

certain    certainly    changes    clearly    co    

com    come    comes    concerning    consequently    

consider    considering    contain    containing    contains    

corresponding    could    couldn't    course    currently    

definitely    described    despite    did    didn't    

different    do    does    doesn't    doing    

don't    done    down    downwards    during    

each    edu    eg    eight    either    

else    elsewhere    enough    entirely    especially    

et    etc    even    ever    every    

everybody    everyone    everything    everywhere    ex    

exactly    example    except    far    few    

fifth    first    five    followed    following    

follows    for    former    formerly    forth    

four    from    further    furthermore    get    

gets    getting    given    gives    go    

goes    going    gone    got    gotten    

greetings    had    hadn't    happens    hardly    

has    hasn't    have    haven't    having    

he    he's    hello    help    hence    

her    here    here's    hereafter    hereby    

herein    hereupon    hers    herself    hi    

him    himself    his    hither    hopefully    

how    howbeit    however    i'd    i'll    

i'm    i've    ie    if    ignored    

immediate    in    inasmuch    inc    indeed    

indicate    indicated    indicates    inner    insofar    

instead    into    inward    is    isn't    

it    it'd    it'll    it's    its    

itself    just    keep    keeps    kept    

know    knows    known    last    lately    

later    latter    latterly    least    less    

lest    let    let's    like    liked    

likely    little    look    looking    looks    

ltd    mainly    many    may    maybe    

me    mean    meanwhile    merely    might    

more    moreover    most    mostly    much    

must    my    myself    name    namely    

nd    near    nearly    necessary    need    

needs    neither    never    nevertheless    new    

next    nine    no    nobody    non    

none    noone    nor    normally    not    

nothing    novel    now    nowhere    obviously    

of    off    often    oh    ok    

okay    old    on    once    one    

ones    only    onto    or    other    

others    otherwise    ought    our    ours    

ourselves    out    outside    over    overall    

own    particular    particularly    per    perhaps    

placed    please    plus    possible    presumably    

probably    provides    que    quite    qv    

rather    rd    re    really    reasonably    

regarding    regardless    regards    relatively    respectively    

right    said    same    saw    say    

saying    says    second    secondly    see    

seeing    seem    seemed    seeming    seems    

seen    self    selves    sensible    sent    

serious    seriously    seven    several    shall    

she    should    shouldn't    since    six    

so    some    somebody    somehow    someone    

something    sometime    sometimes    somewhat    somewhere    

soon    sorry    specified    specify    specifying    

still    sub    such    sup    sure    

t's    take    taken    tell    tends    

th    than    thank    thanks    thanx    

that    that's    thats    the    their    

theirs    them    themselves    then    thence    

there    there's    thereafter    thereby    therefore    

therein    theres    thereupon    these    they    

they'd    they'll    they're    they've    think    

third    this    thorough    thoroughly    those    

though    three    through    throughout    thru    

thus    to    together    too    took    

toward    towards    tried    tries    truly    

try    trying    twice    two    un    

under    unfortunately    unless    unlikely    until    

unto    up    upon    us    use    

used    useful    uses    using    usually    

value    various    very    via    viz    

vs    want    wants    was    wasn't    

way    we    we'd    we'll    we're    

we've    welcome    well    went    were    

weren't    what    what's    whatever    when    

whence    whenever    where    where's    whereafter    

whereas    whereby    wherein    whereupon    wherever    

whether    which    while    whither    who    

who's    whoever    whole    whom    whose    

why    will    willing    wish    with    

within    without    won't    wonder    would    

would    wouldn't    yes    yet    you    

you'd    you'll    you're    you've    your    

yours    yourself    yourselves    zero    

4. 全文限定条件

全文搜索只适用于 MyISAM 表。

全文搜索可以同大多数多字节字符集一起使用。Unicode属于例外情况;  可使用utf8 字符集 , 而非ucs2字符集。

Ideographic languages ​​such as Chinese and Japanese do not have custom delimiters. Therefore, the FULLTEXT parser cannot determine where words begin and end in these or other such languages.

If multiple character sets are supported in a single table, all columns in the FULLTEXT index must use the same character set and library.

The MATCH() column list must exactly match the column list in some FULLTEXT index definitions in the table, unless MATCH() is in IN BOOLEAN MODE.

The argument to AGAINST() must be a constant string.

5. Fine-tuning MySQL full-text search

MySQL’s full-text search capacity has almost no user-adjustable parameters. If you have a MySQL source distribution, you can exercise more control over full-text search performance, since some changes require source code modifications.

Note that in order to be more effective, full-text search needs to be carefully tuned. In fact, modifying the default performance will only reduce its performance in most cases. Don't change the MySQL source unless you know what you are doing.

Most of the full-text variables described below must be set when the server starts. In order to change them, the server must be restarted; they will not be changed while the server is running.

Some variable changes require you to rebuild the FULLTEXT index in the table. The relevant operating instructions are given at the end of this chapter.

ft_min_word_len and ft_max_word_len system arguments specify the minimum and maximum length of indexed words. The default minimum value is four characters; the default maximum value depends on the version of MySQL used. If you change any value, you must rebuild your FULLTEXT index. For example, if you want a 3-character word to be searchable, you can set the ft_min_word_len variable by moving the following line into a selection file:

· [mysqld]

· ft_min_word_len=3

Then Restart the server and rebuild your FULLTEXT index. Also pay special attention to the comments about myisamchk in the description behind the table.

To override the default stopword, you can set the ft_stopword_file system variable. The variable value should be a file pathname containing stop words, or an empty string used to stop stop word filtering. Rebuild your FULLTEXT index after changing the value of this variable or the contents of the stopword file.

Stop words are free form, in other words, you can use any non-alphanumeric character such as newline, space or comma to separate stop words. Exceptions include the underscore character (_) and the single quote (') which are considered part of a word. The stop word character set is the server's default character set.

The 50% threshold for natural language queries is determined by the particular trade-off chosen. To prevent it, look for the following line in myisam/ftdefs.h:

· #define GWS_IN_USE GWS_PROB

Change the line to:

#define GWS_IN_USE GWS_FREQ

Then recompile MySQL. There is no need to rebuild the index at this time. Note: By doing this you will seriously reduce MySQL's ability to provide appropriate correlation values ​​for the MATCH() function. If you need to search for such common words, it is better to use IN BOOLEAN MODE instead because it does not follow the 50% threshold.

To change the operator used for Boolean full-text search, set the ft_boolean_syntax system variable. This variable can also be changed while the server is running, but you must have SUPER privileges to do so. There is no need to rebuild the index in this case.

If you change the full-text variables that affect the index (ft_min_word_len, ft_max_word_len or ft_stopword_file), or if you change the stopword file itself, you must rebuild your FULLTEXT index after making the change and restarting the server. At this time, to rebuild the index, just perform a QUICK repair operation:

mysql> REPAIR TABLE tbl_name QUICK;

Note that if you use myisamchk to perform an operation that modifies the table index (such as repair or analysis), use Rebuild the FULLTEXT index with the default full-text parameter values ​​for minimum and maximum word length and stop words, unless you have specified otherwise. This will cause the query to fail.

This problem occurs because only the server knows these parameters. Their storage location is not in the MyISAM index file. If you have modified the minimum word length or maximum word length or stop words in the server, to avoid this problem, specify the same ft_min_word_len, ft_max_word_len, and ft_stopword_file values ​​for myisamchk that you use for mysqld. For example, if you have set the minimum word length to 3, you can modify a table with myisamchk like this:

shell> myisamchk --recover --ft_min_word_len=3 tbl_name.MYI

To ensure that myisamchk and the server have the full text Use the same value for the parameters, and place each item in the [mysqld] and [myisamchk] sections of the selection file:

[mysqld]

ft_min_word_len=3

[myisamchk]

ft_min_word_len=3

Use REPAIR TABLE, ANALYZE TABLE, OPTIMIZE TABLE or ALTER TABLE instead of using myisamchk. These statements are executed by the server, which knows which full-text parameter value is more appropriate.


Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!