A while ago I tried to use Sphinx, a full-text search system that can be easily called by various languages (PHP/Python/Ruby/etc). Most of the information on the Internet is installed and used in the Linux environment. Of course, as a production environment, it is necessary to deploy it in a *nix environment. For learning and testing, the Windows environment is more convenient.
This article aims to provide a convenient way to install and configure Sphinx under Windows to support Chinese full-text search. The configuration part is common under Linux.
1. About Sphinx
Sphinx is a full-text search engine released under GPLv2. Commercial licensing (for example, embedding into other programs) requires contacting the author (Sphinxsearch.com) to obtain commercial licensing.
Generally speaking, Sphinx is an independent search engine, intended to provide high-speed, low-space-occupancy, and high-result-relevant full-text search capabilities for other applications. Sphinx can be easily integrated with SQL databases and scripting languages.
The current system has built-in support for MySQL and PostgreSQL database data sources, and also supports reading XML data in specific formats from standard input. By modifying the source code, users can add new data sources (for example, native support for other types of DBMS).
The search API supports PHP, Python, Perl, Rudy and Java, and can also be used as a MySQL storage engine. The search API is very simple and can be ported to new languages within a few hours.
Sphinx Features:
The Chinese manual is available here, thanks to the translator for his hard work.
2. Installation of Sphinx on Windows
1. Find the latest windows version directly at http://www.sphinxsearch.com/downloads.html. What I downloaded here is Win32 release binaries with MySQL support. After downloading, unzip it in the D:sphinx directory;
2. Create a new data directory under D:sphinx to store index files and a log directory for log files. Copy D:sphinxsphinx.conf.in to D:sphinxbinsphinx.conf (note to modify the file name);
3. Modify D:sphinxbinsphinx.conf. Here are a few that need to be modified:
type = mysql # 数据源,我这里是mysql<br>sql_host = localhost # 数据库服务器<br>sql_user = root # 数据库用户名<br>sql_pass = '' # 数据库密码<br>sql_db = test # 数据库<br>sql_port = 3306 # 数据库端口Copy after loginsql_query_pre = SET NAMES utf8 # 去掉此行前面的注释,如果你的数据库是uft8编码的Copy after loginindex test1<br>{<br># 放索引的目录<br> path = D:/sphinx/data/<br># 编码<br> charset_type = utf-8<br> # 指定utf-8的编码表<br> charset_table = 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F<br> # 简单分词,只支持0和1,如果要搜索中文,请指定为1<br> ngram_len = 1<br># 需要分词的字符,如果要搜索中文,去掉前面的注释<br> ngram_chars = U+3000..U+2FA1F<br>}Copy after login# index test1stemmed : test1<br># {<br> # path = @CONFDIR@/data/test1stemmed<br> # morphology = stem_en<br># }<br><br># 如果没有分布式索引,注释掉下面的内容<br><br># index dist1<br># {<br> # 'distributed' index type MUST be specified<br> # type = distributedCopy after login# local index to be searched<br> # there can be many local indexes configured<br> # local = test1<br> # local = test1stemmedCopy after login# remote agent<br> # multiple remote agents may be specified<br> # syntax is 'hostname:port:index1,[index2[,...]]<br> # agent = localhost:3313:remote1<br> # agent = localhost:3314:remote2,remote3Copy after login# remote agent connection timeout, milliseconds<br> # optional, default is 1000 ms, ie. 1 sec<br> # agent_connect_timeout = 1000Copy after login# remote agent query timeout, milliseconds<br> # optional, default is 3000 ms, ie. 3 sec<br> # agent_query_timeout = 3000<br># }Copy after login# 搜索服务需要修改的部分<br>searchd<br>{<br> # 日志<br> log = D:/sphinx/log/searchd.logCopy after login# PID file, searchd process ID file name<br> pid_file = D:/sphinx/log/searchd.pidCopy after login# windows下启动searchd服务一定要注释掉这个<br> # seamless_rotate = 1<br>}Copy after login
4. Import test data
C:Program FilesMySQLMySQL Server 5.0bin>mysql -uroot test 5. Create index D:sphinxbin>indexer.exe –all using config file ‘./sphinx.conf’… D:sphinxbin> 6. Search for ‘test’ and try D:sphinxbin>search.exe test using config file ‘./sphinx.conf’… displaying matches: words: Everyone has come out. 6. Test Chinese search Modify the documents data table in the test database, UPDATE `test`.`documents` SET `title` = 'Test Chinese', `content` = 'this is my test document number two, you should be able to find it' WHERE `documents`.`id` = 2 ; Rebuild index: D:sphinxbin>indexer.exe –all Try searching for ‘中文’: D:sphinxbin>search.exe Chinese using config file ‘./sphinx.conf’… words: It seems that it is not found. This is because the encoding in the windows command line is gbk, so of course it cannot be found. We can try it with a program, create a new file foo.php under D:sphinxapi, pay attention to utf-8 encoding
require ‘sphinxapi.php’; Start Sphinx searchd service D:sphinxbin>searchd.exe WARNING: forcing –console mode on Windows Execute PHP query: php d:/sphinx/api/foo.php Have the results come out? The remaining work is to read the manual and slowly explore the high-level configuration.
Sphinx 0.9.8-release (r1533)
Copyright (c) 2001-2008, Andrew Aksyonoff
indexing index ‘test1′…
collected 4 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 4 docs, 193 bytes
total 0.101 sec, 1916.30 bytes/sec, 39.72 docs/sec
Sphinx 0.9.8-release (r1533)
Copyright (c) 2001-2008, Andrew Aksyonoff
index ‘test1′: query ‘test ‘: returned 3 matches of 3 total in 0.000 sec
1. document=1, weight=2, group_id=1, date_added=Wed Nov 26 14:58:59 2008
id=1
group_id=1
group_id2=5
date_added=2008-11-26 14:58:59
title=test one
content=this is my test document number one. also checking search within
phrases.
2. document=2, weight=2, group_id=1, date_added=Wed Nov 26 14:58:59 2008
id=2
group_id=1
group_id2=6
date_added=2008-11-26 14:58:59
title=test two
content=this is my test document number two
3. document=4, weight=1, group_id=2, date_added=Wed Nov 26 14:58:59 2008
id=4
group_id=2
group_id2=8
date_added=2008-11-26 14:58:59
title=doc number four
content=this is to test groups
1. ‘test’: 3 documents, 5 hits
D:sphinxbin>
Sphinx 0.9.8-release (r1533)
Copyright (c) 2001-2008, Andrew Aksyonoff
index ‘test1′: query ‘中文‘: returned 0 matches of 0 total in 0.000 sec
D:sphinxbin>
$s = new SphinxClient();
$s->SetServer(’localhost’,3312);
$result = $s->Query('中文');
var_dump($result);
?>
Sphinx 0.9.8-release (r1533)
Copyright (c) 2001-2008, Andrew Aksyonoff
using config file ‘./sphinx.conf’…
creating server socket on 0.0.0.0:3312
accepting connections
Articles you may be interested in