读了robbin写的Rails的sphinx全文搜索实现,于是想尝试使用php来玩玩看。
讲到sphinx,肯定的介绍一下sphinx的优势了:
Sphinx的主要功能包括:
从功能上看,sphinx很是适合我来使用,呵呵。
开始从实践中体验sphinx的优势吧,ok,安装
(我在mac os x 10.4与redhat 5上都有安装过,不过,每次都不是很顺利,而且遇到的问题都不尽相同,还好最后都找到解决的办法。)
这里介绍在redhat 5下编译的步骤:
1。下载需要的软件包 :
从这里http://www.sphinxsearch.com/downloads.html下载Sphinx 0.9.8 rc2版:
http://www.sphinxsearch.com/downloads/sphinx-0.9.8-rc2.tar.gz
但是,为了使用libmmseg支持中文分词,还需要下载两个补丁:
http://www.coreseek.com/uploads/sources/sphinx-0.98rc2.zhcn-support.patch
http://www.coreseek.com/uploads/sources/fix-crash-in-excerpts.patch
注意:补丁是针对特定的Sphinx版本的,若版本不匹配会更麻烦。所以最简单的方法是直接使用李沫南已经打好补丁的Coreseek包,和自己打补丁的效果是一样的:
http://www.coreseek.com/uploads/sources/coreseek_fulltext_2.5.tar.gz
下载libmmseg:
http://www.coreseek.com/uploads/sources/mmseg-0.7.3.tar.gz
2。因为sphinx的编译需要--with-mmseg的支持,所以首先编译libmmseg:
tar zxvf mmseg-0.7.3.tar.gz
cd mmseg-0.7.3
./configure --prefix=/home/someone/mmseg
make
make install
然后:
tar zxvf sphinx-0.9.8-rc2.tar.gz
cd sphinx-0.9.8-rc2
为sphinx打补丁:
patch -p1 < ../sphinx-0.98rc2.zhcn-support.patch
patch -p1 < ../fix-crash-in-excerpts.patch
./configure --prefix=/home/someone/sphinx --with-mysql=/usr/local/mysql --with-mysql-includes=/usr/local/mysql/include --with-mysql-libs=/usr/local/mysql/lib/ --with-mmseg-includes=/home/someone/mmseg/include/mmseg/ --with-mmseg-libs=/home/someone/mmseg/lib/
make & make install
(这里是我的参数配置,--with-mysql --with-libmmseg是必须的,编译中遇到很多问题,但是,大部分都是由于mysql的版本与libmysqlclient.so文件导致的,所以要注意下。)
ok,所需的软件包安装完毕了,开始修改配置文件sphinx.conf与创建索引词典吧。
CONTINUE在上一篇http://www.jaever.com/diary/7/的安装过程成功了,就开始 配置吧:)
因为要进行中文分词,所以就要构造词典:
(为了使用方面,可以添加到全局变量中:path=$HOME/mmseg/bin:$HOME/mmseg/sphinx/bin:$PATH:$HOME/bin
export path)
mmseg -u /path/to/unigram.txt
该命令执行后,将会产生一个名为unigram.txt.uni的文件,将该文件改名为uni.lib,完成词典的构造。需要注意的是,unigram.txt 必须为UTF-8编码。然后, uni.lib移到可以便于访问的目录(我的:/home/someone/dict/uni.lib)
接着,开始copy一份 sphinx/etc/sphinx.conf.dist为sphinx/etc/sphinx.conf,按照配置文件的注释,修改sphinx.conf对应的属性,但为了中文的支持,与注意以下的:
....
charset_dictpath = /Users/tian/Dict/lib
charset_type = zh_cn.utf-8
....
应该注释掉:
#ngram_len = 1
#ngram_chars =
# charset_table = 0..9, A..Z->a..z, _, a..z, U+A8->U+B8, U+B8, U+C0..U+DF->U+E0..U+FF, U+E0..U+FF....
保存sphinx.conf,执行indexer --config /path/to/sphinx/etc/sphinx.conf somesource(or all 创建全部索引),然后,执行命令search test测试看看是不是ok?
当然,sphinx也提供了php,python,java,rails等API,来使用多语言的应用。目前,我还只有使用php测试过(http://app.chinavisual.com/app/site/seek/search_design可以试用了),不久,准备在jaever上测试python的试用。呵呵,加油!
CONTINUE这里,使用了“主索引”+“增量索引”合并的模式。
source picmain
{
... ....
sql_query_pre = SET NAMES utf8
sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(aid) FROM pictures
sql_query = \
SELECT aid, aname, atags, UNIX_TIMESTAMP(ctime) AS date_added, atype, nid,ntitle \
FROM pictures WHERE aid <= ( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
sql_attr_uint = nid
sql_attr_timestamp = date_added
sql_ranged_throttle = 0
sql_query_info = SELECT * FROM pictures WHERE aid=$id
}
source picdelta : picmain
{
sql_query_pre =
sql_query = SELECT aid, aname, atags, UNIX_TIMESTAMP(ctime) AS date_added, atype, nid,ntitle \
FROM pictures WHERE aid > ( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
}
#search content
source contentmain
{
... ....
sql_query_pre = SET NAMES utf8
sql_query_pre = REPLACE INTO sph_counter SELECT 2, MAX(id) FROM articles
sql_query = SELECT id,title,tags,body,created_on,type,created_by,summary,site_id FROM articles \
WHERE state=2 AND id <= ( SELECT max_doc_id FROM sph_counter WHERE counter_id=2 )
sql_attr_uint = site_id
sql_attr_timestamp = created_on
sql_ranged_throttle = 0
sql_query_info = SELECT * FROM articles WHERE id=$id
}
source contentdelta : contentmain
{
sql_query_pre =
sql_query = SELECT id,title,tags,body,created_on,type,created_by,summary,site_id FROM articles \
WHERE state=2 AND id > ( SELECT max_doc_id FROM sph_counter WHERE counter_id=2 )
}
# local index picture
index picmain
{
source = picmain
path = /path/to/sphinx/var/data/picmain
docinfo = extern
mlock = 0
morphology = none
min_word_len = 2
charset_type = zh_cn.utf-8
charset_dictpath = /path/to/dict
html_strip = 1
html_remove_elements = style, script
preopen = 1
}
index picdelta : picmain
{
source = picdelta
path = /path/to/sphinx/var/data/picdelta
}
#index content
index contentmain
{
source = contentmain
path = /path/to/sphinx/var/data/contentmain
docinfo = extern
mlock = 0
morphology = none
min_word_len = 2
charset_type = zh_cn.utf-8
charset_dictpath =/path/to/dict
html_strip = 1
html_remove_elements = style, script
preopen = 1
}
index contentdelta : contentmain
{
source = contentdelta
path = /path/to/sphinx/var/data/contentdelta
}
indexer
{
# memory limit, in bytes, kiloytes (16384K) or megabytes (256M)
# optional, default is 32M, max is 2047M, recommended is 256M to 1024M
mem_limit = 64M
}
searchd
{
port = 3312
log = /path/to/sphinx/var/log/searchd.log
query_log = /path/to/sphinx/var/log/query.log
read_timeout = 5
max_children = 30
pid_file = /path/to/sphinx/var/log/searchd.pid
max_matches = 1000
seamless_rotate = 1
preopen_indexes = 0
unlink_old = 1
}