Welcome to jaever.com/diary

php实现sphinx+libmmseg中文分词的全文检索之安装过程

读了robbin写的Rails的sphinx全文搜索实现,于是想尝试使用php来玩玩看。

讲到sphinx,肯定的介绍一下sphinx的优势了:

Sphinx的主要功能包括:

  • 高速建立索引(可达10 MB/秒)
    • (本人初次创建索引的速度:collected 37594 docs, 104.0 MB
      sorted 19.9 Mhits, 100.0% done
      total 37594 docs, 103965976 bytes
      total 45.268 sec, 2296662.25 bytes/sec, 830.47 docs/sec)
  • 高性能搜索(在2-4 GB的文本上搜索,平均0.1秒内获得结果)(实现一些中小型的网站搜索,已经足够用的了)
  • 高扩展性(在单一CPU上,实测最高可对100GB的文本建立索引,单一索引可包括100M文件 )
  • 支持分布式检索
  •  
  • 支持基于短语和基于统计的复合结果排序机制
  • 支持任意数量的文件字段(数值属性或全文检索属性)
  •  
  • 支持不同的搜索模式( “完全匹配” , “短语匹配”和“任一匹配” )
  • 支持作为Mysql的存储引擎

 

从功能上看,sphinx很是适合我来使用,呵呵。

开始从实践中体验sphinx的优势吧,ok,安装

(我在mac os x 10.4与redhat 5上都有安装过,不过,每次都不是很顺利,而且遇到的问题都不尽相同,还好最后都找到解决的办法。)

 

这里介绍在redhat 5下编译的步骤:

1。下载需要的软件包 :

    从这里http://www.sphinxsearch.com/downloads.html下载Sphinx 0.9.8 rc2版:
http://www.sphinxsearch.com/downloads/sphinx-0.9.8-rc2.tar.gz

    但是,为了使用libmmseg支持中文分词,还需要下载两个补丁: 
    http://www.coreseek.com/uploads/sources/sphinx-0.98rc2.zhcn-support.patch
    http://www.coreseek.com/uploads/sources/fix-crash-in-excerpts.patch

   注意:补丁是针对特定的Sphinx版本的,若版本不匹配会更麻烦。所以最简单的方法是直接使用李沫南已经打好补丁的Coreseek包,和自己打补丁的效果是一样的:
http://www.coreseek.com/uploads/sources/coreseek_fulltext_2.5.tar.gz

   下载libmmseg:
http://www.coreseek.com/uploads/sources/mmseg-0.7.3.tar.gz

2。因为sphinx的编译需要--with-mmseg的支持,所以首先编译libmmseg:

    tar zxvf mmseg-0.7.3.tar.gz
     cd mmseg-0.7.3
     ./configure --prefix=/home/someone/mmseg
     make
     make install

     然后:

     tar zxvf sphinx-0.9.8-rc2.tar.gz
     cd sphinx-0.9.8-rc2

     为sphinx打补丁:
     patch -p1 < ../sphinx-0.98rc2.zhcn-support.patch
     patch -p1 < ../fix-crash-in-excerpts.patch

      ./configure --prefix=/home/someone/sphinx --with-mysql=/usr/local/mysql --with-mysql-includes=/usr/local/mysql/include --with-mysql-libs=/usr/local/mysql/lib/ --with-mmseg-includes=/home/someone/mmseg/include/mmseg/ --with-mmseg-libs=/home/someone/mmseg/lib/

       make & make install

 

      (这里是我的参数配置,--with-mysql --with-libmmseg是必须的,编译中遇到很多问题,但是,大部分都是由于mysql的版本与libmysqlclient.so文件导致的,所以要注意下。)

     ok,所需的软件包安装完毕了,开始修改配置文件sphinx.conf与创建索引词典吧。

CONTINUE
INFO: 3 days ago | purpen | digg | link

php中文分词的全文检索之配置过程

在上一篇http://www.jaever.com/diary/7/的安装过程成功了,就开始 配置吧:)

因为要进行中文分词,所以就要构造词典:

(为了使用方面,可以添加到全局变量中:path=$HOME/mmseg/bin:$HOME/mmseg/sphinx/bin:$PATH:$HOME/bin
 export path
)

 mmseg -u /path/to/unigram.txt

该命令执行后,将会产生一个名为unigram.txt.uni的文件,将该文件改名为uni.lib,完成词典的构造。需要注意的是,unigram.txt 必须为UTF-8编码。然后, uni.lib移到可以便于访问的目录(我的:/home/someone/dict/uni.lib)

接着,开始copy一份 sphinx/etc/sphinx.conf.dist为sphinx/etc/sphinx.conf,按照配置文件的注释,修改sphinx.conf对应的属性,但为了中文的支持,与注意以下的:

....

charset_dictpath = /Users/tian/Dict/lib
charset_type            = zh_cn.utf-8

....

应该注释掉:

#ngram_len = 1

#ngram_chars =  

# charset_table         = 0..9, A..Z->a..z, _, a..z, U+A8->U+B8, U+B8, U+C0..U+DF->U+E0..U+FF, U+E0..U+FF....

保存sphinx.conf,执行indexer --config /path/to/sphinx/etc/sphinx.conf somesource(or all 创建全部索引),然后,执行命令search test测试看看是不是ok?

当然,sphinx也提供了php,python,java,rails等API,来使用多语言的应用。目前,我还只有使用php测试过(http://app.chinavisual.com/app/site/seek/search_design可以试用了),不久,准备在jaever上测试python的试用。呵呵,加油!

CONTINUE
INFO: 3 days ago | purpen | digg | link

php中文分词的全文检索之配置文件实例

这里,使用了“主索引”+“增量索引”合并的模式。
source picmain
{
        ... ....

        sql_query_pre                   = SET NAMES utf8
    sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(aid) FROM pictures
        sql_query                               = \
                SELECT aid, aname, atags, UNIX_TIMESTAMP(ctime) AS date_added, atype, nid,ntitle \
                FROM pictures WHERE aid <= ( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
        sql_attr_uint                   = nid
        sql_attr_timestamp              = date_added
        sql_ranged_throttle     = 0
        sql_query_info          = SELECT * FROM pictures WHERE aid=$id
}

source picdelta : picmain
{
    sql_query_pre =
    sql_query = SELECT aid, aname, atags, UNIX_TIMESTAMP(ctime) AS date_added, atype, nid,ntitle \
                FROM pictures WHERE aid > ( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
}

#search content
source contentmain
{
      ... ....
    sql_query_pre                   = SET NAMES utf8
    sql_query_pre = REPLACE INTO sph_counter SELECT 2, MAX(id) FROM articles
        sql_query = SELECT id,title,tags,body,created_on,type,created_by,summary,site_id FROM articles \
                          WHERE state=2 AND id <= ( SELECT max_doc_id FROM sph_counter WHERE counter_id=2 )
        sql_attr_uint                   = site_id
        sql_attr_timestamp              = created_on
        sql_ranged_throttle     = 0
        sql_query_info          = SELECT * FROM articles WHERE id=$id
}

source contentdelta : contentmain
{
    sql_query_pre =
    sql_query = SELECT id,title,tags,body,created_on,type,created_by,summary,site_id FROM articles \
                          WHERE state=2 AND id > ( SELECT max_doc_id FROM sph_counter WHERE counter_id=2 )
}
# local index picture
index picmain
{
        source                  = picmain
        path                    = /path/to/sphinx/var/data/picmain
        docinfo                 = extern
        mlock                   = 0
        morphology              = none
        min_word_len            = 2
        charset_type            = zh_cn.utf-8
        charset_dictpath = /path/to/dict

        html_strip                              = 1
        html_remove_elements    = style, script
        preopen                         = 1
}

index picdelta : picmain
{
    source = picdelta
    path = /path/to/sphinx/var/data/picdelta
}
#index content
index contentmain
{
        source                  = contentmain
        path                    = /path/to/sphinx/var/data/contentmain
        docinfo                 = extern
        mlock                   = 0
        morphology              = none
        min_word_len            = 2
        charset_type            = zh_cn.utf-8
        charset_dictpath =/path/to/dict

        html_strip                              = 1
        html_remove_elements    = style, script
        preopen                         = 1
}
index contentdelta : contentmain
{
    source = contentdelta
    path = /path/to/sphinx/var/data/contentdelta
}
indexer
{
        # memory limit, in bytes, kiloytes (16384K) or megabytes (256M)
        # optional, default is 32M, max is 2047M, recommended is 256M to 1024M
        mem_limit                       = 64M
}
searchd
{
        port                            = 3312
        log                                     = /path/to/sphinx/var/log/searchd.log
        query_log                       = /path/to/sphinx/var/log/query.log
        read_timeout            = 5
        max_children            = 30
        pid_file                        = /path/to/sphinx/var/log/searchd.pid
        max_matches                     = 1000
        seamless_rotate         = 1
        preopen_indexes         = 0
        unlink_old                      = 1
}

CONTINUE
INFO: 3 days ago | purpen | digg | link

Copyright © 2008 Jaever. All rights reserved.

This Site looks and works best when viewed using browsers enabled with JavaScript 1.5 and CSS, such as Firefox 1+ or Safari 3+.