Sphinx全文索引安裝教程

首先了解一下sphinx全文索引的相關知識
官方網站:http://www.sphinxsearch.com/
官方文檔:http://www.sphinxsearch.com/docs/
中文支持:http://www.coreseek.cn/
中文使用手冊下載:http://www.coreseek.cn/uploads/pdf/sphinx_doc_zhcn_0.9.pdf

基 本上看看上面的官方教程和中文使用手冊,你應該會安裝和使用Sphix全文索引,固然,還有一些細節,須要不斷的google和baidu,那爲了節省大 家的時間,就出一個完整的Sphinx安裝教程和結合PHPWIND程序的使用教程(PHPWIND7.5版本支持)。

接下來開始Sphinx的技術之旅吧!

考慮到Sphinx全文索引使用的實際須要,主要介紹Sphinx全文索引中文方面的支持。
這裏須要感謝李沫南同窗對Sphinx全文索引中文支持的貢獻!

一,Windows下安裝Sphinx

1,開始前的準備工做
來源:http://www.coreseek.cn/products/ft_down/
下載csft3.1:http://www.coreseek.cn/uploads/csft/3.1/win32/csft3.1.bin.zip
下載標準詞庫:http://www.coreseek.cn/uploads/csft/3.1/data.zip
解壓:csft3.1.bin.zip 以下目錄,解壓在C:\csft3.1目錄下
解壓:data.zip,解壓在C:\csft3.1\data目錄下 [分詞包]
php

須要新建log文件夾

(1)複製 C:\csft3.1\conf\csft.conf.in 文件到 C:\csft3.1\bin\ 目錄下,並重命名爲csft.conf
注意csft.conf文件裏的相似:path = @CONFDIR@/data/test1
把@CONFDIR@替換爲C:\csft3.1\ 如上更改成:path = C:\csft3.1\ data\test1

(2)把測試數據 C:\csft3.1\conf\example.sql 導入數據庫 [這個基本都會吧!]

(3)創建索引,在DOC界面下運行:indexer.exe --all 以下圖,
html

創建索引過程須要仔細檢查csft.conf數據庫配置是否正確。以下:
sql_host = localhost #數據庫主機地址
sql_user = test #數據庫用戶名,擁有數據庫全部權限
sql_pass =
sql_db = test #數據庫名
sql_port = 3306 #可用端口,通常不須要更改

其它配置使用默認,先體驗下sphinx全文索引功能。

(4)測試搜索是否正常,運行:search.exe test 以下圖
python


測試正常將返回

(5)開啓搜索進程服務,運行:searchd.exe 以下圖mysql

這樣就能提供sphinx全文索引的搜索服務了,以上就是一個簡單的操做過程,若是須要支持中文索引,就須要配置相應的參數,具體請查看中文使用手冊。爲了便於你們瞭解相關配置,可查看PHPWind程序支持Sphinx全文索引的配置文件,你們可邊對照手冊邊瞭解[中文支持具體請看linux安裝部分]。

附:PHPWind程序支持Sphinx全文索引的配置。

Windows下安裝Sphix使用csft很是簡單,若是你們有興趣可從sphinx[www.sphinxsearch.com]官方下載安裝,不過有點複雜,這裏就不介紹了,高手們慢慢體驗。

二,linux下安裝Sphinx全文索引,以CentOS 5.3爲例

只能說windows下安裝sphinx只是爲了體驗,由於linux下安裝sphinx纔是正道。
爲了詳細體驗Centos下安裝Sphinx,從新安裝Centos系統,完總體驗Sphinx安裝過程。
Coreseek 全文檢索服務器版本已經集成sphinx和中文分詞補丁,只須要下載MMSeg和Coreseek Fulltext Server(源代碼) ,就能實現Sphinx服務支持。
下載地址:http://www.coreseek.cn/products/ft_down/

推薦源代碼安裝

1,開始前的準備工做 [若是已經安裝就不須要,若是下面列表沒有還有其它的請補上]
1)安裝mysql
2)安裝php
3)安裝apache
4)安裝python
5)安裝libiconv
6)安裝gcc-c++
7)下載Coreseek Fulltext Server(源代碼) :http://www.coreseek.cn/uploads/csft/3.1/Source/csft-3.1.tar.gz
8)下載Coreseek Mmseg(源代碼):http://www.coreseek.cn/uploads/csft/3.1/Source/mmseg-3.1.tar.gz

執行以下命令
yum install python python-dev

2,安裝步驟
(1)下載CSFT與MMseg
#wget http://www.coreseek.cn/uploads/csft/3.1/Source/mmseg-3.1.tar.gz
#wget http://www.coreseek.cn/uploads/csft/3.1/Source/csft-3.1.tar.gz

(2)安裝MMseg中文分詞
# pwd
/usr/local [知道當前的安裝目錄]
# wget http://www.coreseek.cn/uploads/csft/3.1/Source/mmseg-3.1.tar.gz
# tar xzvf mmseg-3.1.tar.gz
# mkdir /usr/local/mmseg
# cd mmseg-3.1
# ./configure --prefix=/usr/local/mmseg
# make
# make install

運行以下,看看mmseg是否安裝成功
# /usr/local/mmseg/bin/mmseg
Coreseek COS(tm) MM Segment 1.0
Copyright By Coreseek.com All Right Reserved.
Usage: /usr/local/mmseg/bin/mmseg <option> <file>
-u <unidict> Unigram Dictionary
-r Combine with -u, used a plain text build Unigram Dictionary, default Off
-b <Synonyms> Synonyms Dictionary
-h print this help and exit


(3)安裝csft-3.1
# pwd
/usr/local
# wget http://www.coreseek.cn/uploads/csft/3.1/Source/csft-3.1.tar.gz
# tar xzvf csft-3.1.tar.gz
# mkdir /usr/local/csft
# cd csft-3.1
# ./configure --prefix=/usr/local/csft --with-mmseg=/usr/local/mmseg/bin/mmseg --with-mmseg-includes=/usr/local/mmseg/include/mmseg/ --with-mmseg-libs=/usr/local/mmseg/lib/
# make
# make install

這裏make的時候可能出錯,解決以下:
1,檢查環境是否安裝以下軟件
# yum install mysql mysql-devel php-mysql qt4-mysql [mysql環境要首先安裝]
# yum install python python-dev

2,是否安裝libiconv
下載地址:http://savannah.gnu.org/projects/libiconv/

3,若是還有錯誤,打開src/Makefile文件,進行修改
# vi src/Makefile 找到182行
linux


LIBS = -lm -lz -lexpat -L/usr/local/lib -lpthread
LIBS = -lm -lz -lexpat -liconv -L/usr/local/lib -lpthread

這樣,若是一切順利,就開始配置你的sphinx全文索引服務器吧[若是安裝有什麼問題,歡迎在PHPWind官方提問]!

3,按下來就是配置
#cp /usr/local/csft/etc/sphinx-min.conf.dist /usr/local/csft/etc/sphinx.conf
修改sphinx.conf文件中的數據庫參數配置,方法同windows下同樣
sql_host = localhost
sql_user = root
sql_pass =
sql_db = test

4,把體驗數據/usr/local/csft/etc/example.sql 導入到數據庫 [這一步應該都會]
5,新建索引
# /usr/local/csft/bin/indexer --all

6,測試搜索
# /usr/local/csft/bin/search test
若是測試有返回,恭喜你的sphinx全文索引服務器配置成功

7,接下來就是支持中文的配置和實現

UTF8編碼實例 [若是已經存在utf8的數據庫就不須要新建,這裏只是舉例]
1)建立一個新的數據庫,注意編碼爲utf8_general_ci,如phpwind
2)導入部分現有的GBK數據,如pw_threads
3)配置csft.conf以下
source數據源部分
sql_host = localhost
sql_user = root
sql_pass =
sql_db = phpwind
sql_query_pre = SET NAMES utf8
sql_query_pre = SET SESSION query_cache_type=OFF
sql_query = SELECT tid,fid,authorid,subject FROM pw_threads
sql_attr_uint = fid
sql_attr_uint = authorid

索引部分
charset_type = zh_cn.utf-8
charset_dictpath = /usr/local/csft/
min_prefix_len = 0
min_infix_len = 0
min_word_len = 2

4)建立數據詞典
#pwd
/usr/local/mmseg-3.1/data [這是你解壓mmseg的目錄下的data]
運行以下命令
# mmseg -u unigram.txt
# ll
總計 10152
-rwxr-xr-x 1 root root 715 06-06 18:40 build_unigram.py
-rwxr-xr-x 1 root root 32674 06-06 18:40 char.stat.txt
-rwxr-xr-x 1 root root 1051268 06-06 18:40 Lexicon_full_words.txt
-rwxr-xr-x 1 root root 1826251 06-06 18:40 unigram.txt
-rw-r--r-- 1 root root 3729280 09-16 20:20 unigram.txt.uni

將會生成 unigram.txt.uni 文件
# mv unigram.txt.uni uni.lib
# cp uni.lib /usr/local/csft/ [這就是上面咱們在配置索引中用的charset_dictpath]

其它的默認不變,如上方法建立索引
# /usr/local/csft/bin/indexer --all

測試是否成功
# /usr/local/csft/bin/search 測試

以上就是utf8編碼的全文索引實現過程

GBK編碼實例

與utf8同樣,區別在於數據庫和數據表使用gbk編碼
同時只須要修改以下配置部分[csft.conf]

source數據源部分
sql_query_pre = SET NAMES gbk

索引部分
charset_type = zh_cn.gbk

這裏須要注意一下,若是要想測試支持gbk,能夠寫一個PHP文件,調用sphinx提供的api接口,注意要開啓searchd進程

# /usr/local/csft/bin/searchd

編寫以下代碼 [注意要與sphinxapi.php目錄存放在一個目錄]
sphinxapi.php目錄在# /usr/local/csft-3.1/api/下
也能夠直接使用api目錄下的test.php直接測試
<?php
require_once 'sphinxapi.php';
$sc = new SphinxClient();
$sc->SetServer('127.0.0.1',3312);
$sc->SetConnectTimeout(1);
$sc->SetWeights(array(100,1));
$sc->SetMatchMode(SPH_MATCH_ALL);
$sc->SetArrayResult(TRUE);
$res = $sc->query("簡單");
var_dump($res);
?>

也能夠直接運行search工具[utf8版],以下

c++


[root@localhost ~]# /usr/local/csft/bin/search 便宜
Coreseek Full Text Server 3.1
Copyright (c) 2006-2008 coreseek.com
using config file '/usr/local/csft/etc/csft.conf'...
index 'test1': query '便宜 ': returned 4 matches of 4 total in 0.015 sec

displaying matches:
1. document=3, weight=1, fid=7, authorid=1
2. document=97, weight=1, fid=35, authorid=1
3. document=108, weight=1, fid=32, authorid=1
4. document=146, weight=1, fid=7, authorid=1

words:
1. '便宜': 4 documents, 4 hits

若是返回false,請檢查searchd進程是否開啓,若是返回成功,恭喜,你已經成爲sphinx的使用者,向下一個高層次進軍吧!

三,後記
其實很想製做一個安裝視頻教程,但因爲時間有限,在安裝過程當中確定會存在一些細節上的問題,只要你們按照上面的步驟一步一步安裝,相信能把sphinx拿下,若是有什麼問題
你們可查看http://www.sphinxsearch.com/http://www.coreseek.cn/網站獲取更多幫助,同時也能夠查看中文手冊。

同時也能夠在phpwind官方網站www.phpwind.net提問和分享你的安裝過程,把一個細節都亮出來,幫助別人也幫助本身。BY liuhui.php@gmail.com 2009-9-17

其它連接
用 PHP 構建自定義搜索引擎
http://www.ibm.com/developerworks/cn/opensource/os-php-sphinxsearch/index.html

MMSEG: A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm
http://technology.chtsai.org/mmseg/

附phpwind配置實例[gbk版]
PHPWind搜索sphinx配置實例 [修改部分參數就可直接應用於phpwind程序]

部分解讀:
以下全文索引使用的是主索引+增量索引的方式,具體你們結合手冊瞭解相關知識

須要建立一張表 [編碼本身定,以下是gbk]
CREATE TABLE IF NOT EXISTS `search_counter` (
`counterid` int(11) NOT NULL DEFAULT '0',
`max_doc_id` int(11) NOT NULL DEFAULT '0',
`min_doc_id` int(10) NOT NULL DEFAULT '0',
PRIMARY KEY (`counterid`)
) ENGINE=MyISAM DEFAULT CHARSET=gbk;


csft.conf配置文件

source tmsgs
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass = xxxx
sql_db = phpwind
sql_port = 3307 # optional, default is 3306
sql_sock = /tmp/mysql3307.sock
sql_query_pre = SET NAMES gbk
sql_query_pre = SET SESSION query_cache_type=OFF
sql_query_pre = REPLACE INTO search_counter SELECT 1,MAX(tid),MIN(tid) FROM pw_tmsgs
sql_query_range = SELECT min_doc_id, max_doc_id FROM search_counter WHERE counterid = 1
sql_range_step = 1000
sql_query = SELECT th.tid,th.subject,th.authorid,th.postdate,th.lastpost,th.fid,th.digest,th.hits,th.replies,t.content FROM pw_threads th LEFT JOIN pw_tmsgs t USING(tid) WHERE th.tid > $start AND th.tid <= $end

sql_attr_uint = authorid
sql_attr_uint = hits
sql_attr_uint = replies
sql_attr_uint = fid
sql_attr_timestamp = postdate
sql_attr_timestamp = lastpost
sql_attr_uint = digest
}

source addtmsgs
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass = xxxx
sql_db = phpwind
sql_port = 3307 # optional, default is 3306
sql_sock = /tmp/mysql3307.sock
sql_query_pre = SET NAMES gbk
sql_query_pre = SET SESSION query_cache_type=OFF
sql_query_range = SELECT max_doc_id, max_doc_id+100000 FROM search_counter WHERE counterid = 1
sql_range_step = 100000
sql_query = SELECT th.tid,th.subject,th.authorid,th.postdate,th.lastpost,th.fid,th.digest,th.hits,th.replies,t.content FROM pw_threads th LEFT JOIN pw_tmsgs t USING(tid) WHERE th.tid > $start AND th.tid <= $end

sql_attr_uint = authorid
sql_attr_uint = hits
sql_attr_uint = replies
sql_attr_uint = fid
sql_attr_timestamp = postdate
sql_attr_timestamp = lastpost
sql_attr_uint = digest
sql_query_post = REPLACE INTO search_counter SELECT 1,MAX(tid),MIN(tid) FROM pw_tmsgs
#sql_attr_uint = tid
}

source threads
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass = xxxxxxx
sql_db = phpwind
sql_port = 3307 # optional, default is 3306
sql_sock = /tmp/mysql3307.sock
sql_query_pre = SET NAMES gbk
sql_query_pre = SET SESSION query_cache_type=OFF
sql_query_pre = REPLACE INTO search_counter SELECT 3,MAX(tid),MIN(tid) FROM pw_threads
sql_query_range = SELECT min_doc_id, max_doc_id FROM search_counter WHERE counterid = 3
sql_range_step = 1000
sql_query = SELECT th.tid,th.subject,th.authorid,th.postdate,th.lastpost,th.fid,th.digest,th.hits,th.replies FROM pw_threads th WHERE th.tid > $start AND th.tid <= $end
sql_attr_uint = authorid
sql_attr_uint = hits
sql_attr_uint = replies
sql_attr_uint = fid
sql_attr_timestamp = postdate
sql_attr_timestamp = lastpost
sql_attr_uint = digest
}

source addthreads
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass = xxx
sql_db = phpwind
sql_port = 3307 # optional, default is 3306
sql_sock = /tmp/mysql3307.sock
sql_query_pre = SET NAMES gbk
sql_query_pre = SET SESSION query_cache_type=OFF
sql_query_range = SELECT max_doc_id, max_doc_id+100000 FROM search_counter WHERE counterid = 3
sql_range_step = 100000
sql_query = SELECT th.tid,th.subject,th.authorid,th.postdate,th.lastpost,th.fid,th.digest,th.hits,th.replies FROM pw_threads th WHERE th.tid > $start AND th.tid <= $end

sql_attr_uint = authorid
sql_attr_uint = hits
sql_attr_uint = replies
sql_attr_uint = fid
sql_attr_timestamp = postdate
sql_attr_timestamp = lastpost
sql_attr_uint = digest
sql_query_post = REPLACE INTO search_counter SELECT 3,MAX(tid),MIN(tid) FROM pw_threads
#sql_attr_uint = tid
}

index tmsgsindex
{
source = tmsgs
path = /usr/local/csft/var/data/tmsgs
docinfo = extern
charset_type = zh_cn.gbk
#min_prefix_len = 0
#min_infix_len = 2
#ngram_len = 2
charset_dictpath = /usr/local/csft/
min_prefix_len = 0
min_infix_len = 0
min_word_len = 2
}

index addtmsgsindex
{
source = addtmsgs
path = /usr/local/csft/var/data/addtmsgs
docinfo = extern
charset_type = zh_cn.gbk
#min_infix_len = 2
#ngram_len = 2
charset_dictpath = /usr/local/csft/
min_prefix_len = 0
min_infix_len = 0
min_word_len = 2
}
index threadsindex
{
source = threads
path = /usr/local/csft/var/data/threads
docinfo = extern
charset_type = zh_cn.gbk
#min_prefix_len = 0
#min_infix_len = 2
#ngram_len = 2
charset_dictpath = /usr/local/csft/
min_prefix_len = 0
min_infix_len = 0
min_word_len = 2
}

index addthreadsindex
{
source = addthreads
path = /usr/local/csft/var/data/addthreads
docinfo = extern
charset_type = zh_cn.gbk
#min_infix_len = 2
#ngram_len = 2
charset_dictpath = /usr/local/csft/
min_prefix_len = 0
min_infix_len = 0
min_word_len = 2
}
indexer
{
mem_limit = 128M
}

searchd
{
port = 3312
log = /usr/local/csft/var/log/searchd.log
query_log = /usr/local/csft/var/log/query.log
read_timeout = 5
max_children = 30
pid_file = /usr/local/csft/var/log/searchd.pid
max_matches = 1000
seamless_rotate = 1
preopen_indexes = 0
unlink_old = 1
}
sql

相關文章
相關標籤/搜索