PostgreSQL的中文全文檢索(二)

時間 2019-11-12

原文原文鏈接

上一篇介紹了postgresql全文檢索的環境和一些示例，http://my.oschina.net/Kenyon/blog/80904，都是基於其自帶的模式，目前版本默認並不支持中文的全文檢索，可是咱們的實際使用過程當中確定會有用到中文的檢索，好在有強大的社區支持，結合第三方工具能夠簡單實現PG的中文全文檢索。

PG的中文全文檢索步驟也主要分三步走：
1.將中文分詞
2.轉換分詞,去掉無心義分詞
3.按必定順序排序，建索引加快查詢

1、使用到的測試環境與工具
VMWARE 6.0
PostgreSQL 9.1.2
CRF++-0.57 下載地址：http://crfpp.googlecode.com/svn/trunk/doc/index.html
nlpbamboo-1.1.2 下載地址：http://code.google.com/p/nlpbamboo/downloads/list
index.tar.bz2 下載地址：http://code.google.com/p/nlpbamboo/downloads/list

2、部署過程(root用戶)
1.先安裝CRF

cd CRF++-0.57
./configure
make
make install

2.安裝nlpbamboo

cd nlpbamboo
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=release
make all
make install

3.下載分詞數據庫文件
下載index.tar.bz2,解壓到/opt/bamboo/index

4.查看
安裝完了後，到默認的安裝路徑下查看軟件的安裝狀況,主要的默認路徑
/usr/lib
/usr/include/
/opt/bamboo/

[postgres@localhost ~]$ cd /usr/local/lib
[postgres@localhost lib]$ ll
total 788
-rw-r--r--. 1 root root 516882 Sep  3 19:57 libcrfpp.a
-rwxr-xr-x. 1 root root    952 Sep  3 19:57 libcrfpp.la
lrwxrwxrwx. 1 root root     17 Sep  3 19:57 libcrfpp.so -> libcrfpp.so.0.0.0
lrwxrwxrwx. 1 root root     17 Sep  3 19:57 libcrfpp.so.0 -> libcrfpp.so.0.0.0
-rwxr-xr-x. 1 root root 280760 Sep  3 19:57 libcrfpp.so.0.0.0

[postgres@localhost lib]$ cd /usr/lib
[postgres@localhost lib]$ ll lib*
-rw-r--r--. 1 root root 1027044 Sep  3 20:02 libbamboo.a
lrwxrwxrwx. 1 root root      14 Sep  3 20:03 libbamboo.so -> libbamboo.so.2
-rwxr-xr-x. 1 root root  250140 Sep  3 20:02 libbamboo.so.2
lrwxrwxrwx. 1 root root      25 Sep  3 23:56 libcrfpp.a -> /usr/local/lib/libcrfpp.a
lrwxrwxrwx. 1 root root      26 Sep  3 23:56 libcrfpp.so -> /usr/local/lib/libcrfpp.so
lrwxrwxrwx. 1 root root      28 Sep  3 23:56 libcrfpp.so.0 -> /usr/local/lib/libcrfpp.so.0

[postgres@localhost bamboo]$ cd /opt/bamboo/
[postgres@localhost bamboo]$ ll
total 17412
drwxr-xr-x. 2 postgres postgres     4096 Sep  3 20:03 bin
drwxr-xr-x. 2 postgres postgres     4096 Aug 15 01:52 etc
drwxr-xr-x. 4 postgres postgres     4096 Aug 15 01:52 exts
drwxr-sr-x. 2 postgres postgres     4096 Apr  1  2009 index
-rw-r--r--. 1 postgres postgres 17804377 Sep  3 23:52 index.tar.bz2
drwxr-xr-x. 2 postgres postgres     4096 Sep  3 20:03 processor
drwxr-xr-x. 2 postgres postgres     4096 Aug 15 01:52 template

5.編輯中文檢索干擾詞彙
編輯該詞彙是爲了減小一些無心義的詞彙被檢索出來，好比'a',‘的','得'等

[postgres@localhost tsearch_data]$touch /usr/share/postgresql/8.4/tsearch_data/chinese_utf8.stop

[postgres@localhost tsearch_data]$ pwd
/home/postgres/share/tsearch_data
[postgres@localhost tsearch_data]$ more chinese_utf8.stop 
的
我
咱們

6.編譯

cd /opt/bamboo/exts/postgres/pg_tokenize
make
make install
cd /opt/bamboo/exts/postgres/chinese_parser
make
make install

7.導入分詞函數和分詞模塊

[postgres@localhost ~]$ psql
postgres=# \i /home/postgres/share/contrib/pg_tokenize.sql
SET
CREATE FUNCTION
postgres=#  \i /home/postgres/share/contrib/chinese_parser.sql
SET
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE TEXT SEARCH PARSER
CREATE TEXT SEARCH CONFIGURATION
CREATE TEXT SEARCH DICTIONARY
ALTER TEXT SEARCH CONFIGURATION

8.安裝完圖形化展現

9.不一樣的DB安裝中文分詞
假如在一臺機子上同時有多個DB，則只須要把分詞函數和分詞模版在新庫裏導入一下便可。

3、應用
1.使用tokens測試中文分詞效果

[postgres@localhost ~]$ psql -p 5432
psql (9.1.2)
Type "help" for help.

postgres=# select tokenize('中文詞分浙江人民海量的人');
            tokenize            
---------------------------------
中文詞 分 浙江 人民 海量 的 人
(1 row)

postgres=# SELECT to_tsvector('chinesecfg', '我愛北京天安門');
            to_tsvector           
-----------------------------------
'北京':3 '天安門':4 '我':1 '愛':2
(1 row)

postgres=# select tokenize('南京市長江大橋');
     tokenize     
-------------------
南京市 長江 大橋
(1 row)

postgres=# select tokenize('南京市長');
  tokenize 
------------
南京 市長
(1 row)

有一個比較好的分詞效果，最明顯的是南京市長江大橋，並無被分紅南京,市長,江大橋之類的。

2.使用一個普通測試表，新建一個tsvector列用來存放分詞數據

ALTER TABLE t_store_adv add column index_col_ts tsvector;
UPDATE t_store_adv SET index_col_ts =
to_tsvector('chinesecfg', coalesce(adv_title,'') || ' ' || coalesce(adv_content,''));

3.創建索引

CREATE INDEX t_store_adv_idx ON t_store_adv USING gin(index_col_ts);

4.查詢

[postgres@localhost ~]$ psql  -p 5432
psql (9.1.2)
Type "help" for help.

postgres=# select count(1) from t_store_adv;
 count 
-------
  38803
(1 row)

postgres=# SELECT count(1) FROM t_store_adv WHERE index_col_ts @@ to_tsquery('南京');
 count 
-------
    16
(1 row)

postgres=# explain SELECT count(1) FROM t_store_adv WHERE index_col_ts @@ to_tsquery('南京');
                                      QUERY PLAN                                      
--------------------------------------------------------------------------------------
 Aggregate  (cost=108.61..108.62 rows=1 width=0)
   ->  Bitmap Heap Scan on t_store_adv  (cost=12.21..108.55 rows=27 width=0)
         Recheck Cond: (index_col_ts @@ to_tsquery('南京'::text))
         ->  Bitmap Index Scan on t_store_adv_idx  (cost=0.00..12.21 rows=27 width=0)
               Index Cond: (index_col_ts @@ to_tsquery('南京'::text))
(5 rows)

--普通的文本檢索
postgres=# select count(1) from t_store_adv where (adv_content like '%南京%' or adv_title like '%南京%');
 count 
-------
    17
(1 row)

postgres=# explain select count(1) from t_store_adv where (adv_content like '%南京%' or adv_title like '%南京%');
                                             QUERY PLAN                                             
----------------------------------------------------------------------------------------------------
 Aggregate  (cost=1348.05..1348.06 rows=1 width=0)
   ->  Seq Scan on t_store_adv  (cost=0.00..1348.05 rows=1 width=0)
         Filter: (((adv_content)::text ~~ '%南京%'::text) OR ((adv_title)::text ~~ '%南京%'::text))
(3 rows)

本次測試的數據量不是很大，但從執行計劃上可見一斑，所消耗的資源是要少不少的，固然存儲會消耗多一點，數據量大的狀況下，索引檢索的效率也能看出有很大的提高，具體可參考一個例子：http://www.oschina.net/question/96003_19020

4、總結：
示例中略去了使用觸發器來更新tsvector列。使用中文全文檢索能夠有效提高中文檢索速度，只是目前還不是內置的，須要藉助第三方工具手工安裝一下，選擇的分詞方案也比較多，能夠擇優選擇。

5、參考：
http://www.cnblogs.com/shuaixf/archive/2011/09/10/2173260.html
http://www.54chen.com/_linux_/postgresql-bamboo-lucene-part2.html ----kenyon never gone!

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。