使用postgreSQL+bamboo搭建比lucene方便N倍的全文搜索
使用postgreSQL+bamboo搭建比lucene方便N倍的全文搜索 所有用到到包有: cmake-2.6.4.tar.gz (编nlpbamboo用) CRF++-0.53.tar.gz(同上) nlpbamboo-1.1.1.tar.bz2(分词用) postgreSQL-8.3.3.tar.gz(索引用) 安装pgsql tar -zxvf postgreSQL-8.3.3.tar.gz
使用postgreSQL+bamboo搭建比lucene方便N倍的全文搜索
所有用到到包有:
cmake-2.6.4.tar.gz (编nlpbamboo用)
CRF++-0.53.tar.gz(同上)
nlpbamboo-1.1.1.tar.bz2(分词用)
postgreSQL-8.3.3.tar.gz(索引用)
安装pgsql
tar -zxvf postgreSQL-8.3.3.tar.gz
cd postgre-8.3.3
./configure –prefix=/opt/pgsql
make
make install
useradd postgre
chown -R postgre.postgre /opt/pgsql
su – postgre
vi ~postgre/.bash_profile
添加
export PATH
PGLIB=/opt/pgsql/lib
PGDATA=/data/PGSearch
PATH=$PATH:/opt/pgsql/bin
MANPATH=$MANPATH:/opt/pgsql/man
export PGLIB PGDATA PATH MANPATH
# mkdir -p /data/PGSearch
# chown -R postgre.postgre /data/PGSearch
# chown -R postgre.postgre /opt/pgsql
#sudo -u postgre /opt/pgsql/bin/initdb –locale=zh_CN.UTF-8 –encoding=utf8 -D /data/PGSearch
#sudo -u postgre /opt/pgsql/bin/postmaster -i -D /data/PGSearch & //允许网络访问
#sudo -u postgre /opt/pgsql/bin/createdb kxgroup
# vim /data/PGSearch/pg_hba.conf 如下增加可访问的机器: www.2cto.com
host all all 10.2.19.178 255.255.255.0 trust
#su – postgre
$pg_ctl stop
$postmaster -i -D /data/PGSearch &
安装中文分词(Cmake CRF++ bamboo)
Cmake是为了编译bamboo,CRF++是bamboo依赖的。
tar -zxvf cmake-2.6.4.tar.gz
cd cmake-2.6.4
./configure
gmake
make install
tar -zxvf CRF++-0.53.tar.gz
cd CRF++-0.53
./configure
make
make install
tar -jxvf nlpbamboo-1.1.1.tar.bz2
cd nlpbamboo-1.1.1
mkdir build
cd build/
cmake .. -DCMAKE_BUILD_TYPE=release
make all
make install
cp index.tar.bz2 /opt/bamboo/
cd /opt/bamboo/
tar -jxvf index.tar.bz2
#/opt/bamboo/bin/bamboo
如果出现:
ERROR: libcrfpp.so.0: cannot open shared object file: No such file or directory
就执行:
ln -s /usr/local/lib/libcrfpp.so.* /usr/lib/
ldconfig
增加上中文分词扩展到pgsql
#vim /root/.bash_profile 也增加:
PGLIB=/opt/pgsql/lib
PGDATA=/data/PGSearch
PATH=$PATH:/opt/pgsql/bin
MANPATH=$MANPATH:/opt/pgsql/man
export PGLIB PGDATA PATH MANPATH
#source ~/.bash_profile
cd /opt/bamboo/exts/postgres/chinese_parser/
make
make install
su – postgre
cd /opt/pgsql/share/contrib/
touch /opt/pgsql/share/tsearch_data/chinese_utf8.stop
psql kxgroup
\i chinese_parser.sql 导入
再执行下面的sql,已经可以将一段话分词了:
SELECT to_tsvector(’chinesecfg’, ‘结果在命令行下执行bamboo才知道’);
先到这里,下一部分讲述对TEXT字段进行索引和查询,完整构造一整个搜索引擎。
www.2cto.com
一、基础篇
本回从一条sql开始:
select * from dbname where field_name @@ ‘aa|bb’ order by rank(field_name, ‘aa|bb’);
从这个sql字面意思讲解:从 dbname这个表中查field_name匹配aa或者是bb的词,并且按照他们的匹配的RANK排序。
基本上明白上面这段话后,来学习四个概念:tsvector、 tsquery、 @@ 、gin。
1.tsvector:
在postgreSQL 8.3自带支持全文检索功能,在之前的版本中需要安装配置tsearch2才能使用。它提供两个数据类型(tsvector,tsquery),并且通过 动态检索自然语言文档的集合,定位到最匹配的查询结果,tsvector正是其中之一。
一个tsvector的值是唯一分词的分类列表,把一话一句词格式化为不同的词条,在进行分词处理的时候,tsvector会自动去掉分词中重复的词条,按照一定的顺序装入。例如
SELECT ‘a fat cat sat on a mat and ate a fat rat’::tsvector;
tsvector
—————————————————-
‘a’ ‘on’ ‘and’ ‘ate’ ‘cat’ ‘fat’ ‘mat’ ‘rat’ ’sat’
通过tsvector把一个字符串按照空格进行分词,这可以把分词后的词按照出现的次数排成一排(还会按词长度)。
对于英文和中文的全文检索我们还要看下面这条sql:
SELECT to_tsvector(’english’, ‘The Fat Rats’);
to_tsvector
—————–
‘fat’:2 ‘rat’:3
to_tsvector函数来是tsvector规格化的,在其中可指定所使用的分词。
2.tsquery:
顾名思义,tsquery,表示的应该是查询相关的.tsquery是存储用于检索的词条.并且可以联合使用boolean 操作符来连接, & (AND), | (OR), and ! (NOT). 使用括号(),可以强制分为一组.
同时,tsquery 在做搜索的时候,也可以使用权重,并且每个词都可以使用一个或者多个权重标记,这样在检索的时候,会匹配相同权重的信息.跟上面的tsvector相同,tsquery也有一个to_tsquery函数.
3.@@:
在postgresql中全文检索匹配操作使用@@ 操作符,如果一个
tsvector(document) 匹配到 tsquery(query)则返回true.
www.2cto.com
看一个简单的例子:
SELECT ‘a fat cat sat on a mat and ate a fat rat’::tsvector @@ ‘cat & rat’::tsquery;
?column?
———-
t
我们在处理索引的时候还是要使用他们的函数如下:
SELECT to_tsvector(’fat cats ate fat rats’) @@ to_tsquery(’fat & rat’);
?column?
———-
t
并且操作符 @@ 可以使用text作为tsvector和tsquery.如下操作符可以使使用的方法
tsvector @@ tsquery
tsquery @@ tsvector
text @@ tsquery
text @@ text
上面的前两种我们已经使用过了,但是后两种,
text @@ tsquery 等同于 to_tsvector(x) @@ y.
text @@ text 等同于 to_tsvector(x) @@ plainto_tsquery(y).(~)plainto_tsquery在后面讲。。。
4.gin:
gin是一种索引的名称,全文索引用的。
我们可以通过创建gin索引来加速检索速度.例如
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(’english’, body));
创建索引可以有多种方式.索引的创建甚至可以连接两个列:
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(’english’, title || body));
www.2cto.com
二、提高篇
基础知识学完了,应该上阵了,为了实现全文检索,我们需要把一个文档创建一个tsvector 格式,并且通过tsquery实现用户的查询,在查询中我们返回一个按照重要性排序的查询结果。
先看一个to_tsquery的sql:
SELECT to_tsquery(’english’, ‘Fat | Rats:AB’);
to_tsquery
——————
‘fat’ | ‘rat’:AB
可以看出,to_tsquery函数在处理查询文本的时候,查询文本的单个词之间要使用逻辑操作符(& (AND), | (OR) and ! (NOT))连接(或者使用括号)。
如果执行下面这条sql就会出错:
SELECT to_tsquery(’english’, ‘Fat Rats’);
plainto_tsquery函数却可以提供一个标准的tsquery,如上面的例子,plainto_tsquery会自动加上逻辑&操作符。
SELECT plainto_tsquery(’english’, ‘Fat Rats’);
plainto_tsquery
—————–
‘fat’ & ‘rat’
但是plainto_tsquery函数不能够识别逻辑操作符和权重标记。
SELECT plainto_tsquery(’english’, ‘The Fat & Rats:C’);
plainto_tsquery
———————
‘fat’ & ‘rat’ & ‘c’
www.2cto.com
三、终结篇
看完上面的一堆后,千言万语汇成一句话,本文主要讲的是一条sql,在加了第一部分里所讲述的扩展后,使用下面的sql,从一个字段中搜一句话,还要排序出来:
select * from tabname where to_tsvector(’chinesecfg’,textname) @@ plainto_tsquery(’搜点啥’) order by ts_rank(to_tsvector(’chinesecfg’,textname),plainto_tsquery(’搜点啥’)) limit 10;
之前的create table create index就不写了。授人以渔才是关键。

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



Magnet link is a link method for downloading resources, which is more convenient and efficient than traditional download methods. Magnet links allow you to download resources in a peer-to-peer manner without relying on an intermediary server. This article will introduce how to use magnet links and what to pay attention to. 1. What is a magnet link? A magnet link is a download method based on the P2P (Peer-to-Peer) protocol. Through magnet links, users can directly connect to the publisher of the resource to complete resource sharing and downloading. Compared with traditional downloading methods, magnetic

How to use mdf files and mds files With the continuous advancement of computer technology, we can store and share data in a variety of ways. In the field of digital media, we often encounter some special file formats. In this article, we will discuss a common file format - mdf and mds files, and introduce how to use them. First, we need to understand the meaning of mdf files and mds files. mdf is the extension of the CD/DVD image file, and the mds file is the metadata file of the mdf file.

CrystalDiskMark is a small HDD benchmark tool for hard drives that quickly measures sequential and random read/write speeds. Next, let the editor introduce CrystalDiskMark to you and how to use crystaldiskmark~ 1. Introduction to CrystalDiskMark CrystalDiskMark is a widely used disk performance testing tool used to evaluate the read and write speed and performance of mechanical hard drives and solid-state drives (SSD). Random I/O performance. It is a free Windows application and provides a user-friendly interface and various test modes to evaluate different aspects of hard drive performance and is widely used in hardware reviews

foobar2000 is a software that can listen to music resources at any time. It brings you all kinds of music with lossless sound quality. The enhanced version of the music player allows you to get a more comprehensive and comfortable music experience. Its design concept is to play the advanced audio on the computer The device is transplanted to mobile phones to provide a more convenient and efficient music playback experience. The interface design is simple, clear and easy to use. It adopts a minimalist design style without too many decorations and cumbersome operations to get started quickly. It also supports a variety of skins and Theme, personalize settings according to your own preferences, and create an exclusive music player that supports the playback of multiple audio formats. It also supports the audio gain function to adjust the volume according to your own hearing conditions to avoid hearing damage caused by excessive volume. Next, let me help you

Cloud storage has become an indispensable part of our daily life and work nowadays. As one of the leading cloud storage services in China, Baidu Netdisk has won the favor of a large number of users with its powerful storage functions, efficient transmission speed and convenient operation experience. And whether you want to back up important files, share information, watch videos online, or listen to music, Baidu Cloud Disk can meet your needs. However, many users may not understand the specific use method of Baidu Netdisk app, so this tutorial will introduce in detail how to use Baidu Netdisk app. Users who are still confused can follow this article to learn more. ! How to use Baidu Cloud Network Disk: 1. Installation First, when downloading and installing Baidu Cloud software, please select the custom installation option.

NetEase Mailbox, as an email address widely used by Chinese netizens, has always won the trust of users with its stable and efficient services. NetEase Mailbox Master is an email software specially created for mobile phone users. It greatly simplifies the process of sending and receiving emails and makes our email processing more convenient. So how to use NetEase Mailbox Master, and what specific functions it has. Below, the editor of this site will give you a detailed introduction, hoping to help you! First, you can search and download the NetEase Mailbox Master app in the mobile app store. Search for "NetEase Mailbox Master" in App Store or Baidu Mobile Assistant, and then follow the prompts to install it. After the download and installation is completed, we open the NetEase email account and log in. The login interface is as shown below

After long pressing the play button of the speaker, connect to wifi in the software and you can use it. Tutorial Applicable Model: Xiaomi 12 System: EMUI11.0 Version: Xiaoai Classmate 2.4.21 Analysis 1 First find the play button of the speaker, and press and hold to enter the network distribution mode. 2 Log in to your Xiaomi account in the Xiaoai Speaker software on your phone and click to add a new Xiaoai Speaker. 3. After entering the name and password of the wifi, you can call Xiao Ai to use it. Supplement: What functions does Xiaoai Speaker have? 1 Xiaoai Speaker has system functions, social functions, entertainment functions, knowledge functions, life functions, smart home, and training plans. Summary/Notes: The Xiao Ai App must be installed on your mobile phone in advance for easy connection and use.

MetaMask (also called Little Fox Wallet in Chinese) is a free and well-received encryption wallet software. Currently, BTCC supports binding to the MetaMask wallet. After binding, you can use the MetaMask wallet to quickly log in, store value, buy coins, etc., and you can also get 20 USDT trial bonus for the first time binding. In the BTCCMetaMask wallet tutorial, we will introduce in detail how to register and use MetaMask, and how to bind and use the Little Fox wallet in BTCC. What is MetaMask wallet? With over 30 million users, MetaMask Little Fox Wallet is one of the most popular cryptocurrency wallets today. It is free to use and can be installed on the network as an extension
