在全文檢索沒有出來以前,普通的文件檢索都是採用的like,~,或者ilike來匹配文檔字段中內容,這種檢索方法對小數據量的文本檢索是OK的,但數據量大了就不行了。
普通檢索的劣勢:
1.語言不能徹底支持,哪怕是英文,好比檢索friend時不能檢索出friends或者friendly
2.檢索出的結果排序功能很差
3.缺乏索引支持,查詢速度慢,特別是兩頭加了兩個%時根本就不走索引
PostgreSQL在8.3.x版本後開始支持全文檢索。執行步驟,主要分三步走:
1.將文檔分詞(parsing documents into tokens)
2.轉換分詞規則(converting tokens into lexemes),如去掉複數後綴s/es,以及加入stop詞,使之不會在分詞中出現,如經常使用的'的'
3.按必定順序查詢的優化方式存儲(storing preprocessed documents optimized for searching) tsvector存儲,使用tsquery查詢
注:這裏tokes是原始的拆分分詞,可能包含經常使用的無心義的詞,lexemes是加工過的有價值的分詞
1、全文檢索的環境和例子:
postgres=# show default_text_search_config ;
default_text_search_config
----------------------------
pg_catalog.english
(1 row)
--全文檢索配置
postgres=# \dF
List of text search configurations
Schema | Name | Description
------------+------------+---------------------------------------
pg_catalog | danish | configuration for danish language
pg_catalog | dutch | configuration for dutch language
pg_catalog | english | configuration for english language
pg_catalog | finnish | configuration for finnish language
pg_catalog | french | configuration for french language
pg_catalog | german | configuration for german language
pg_catalog | hungarian | configuration for hungarian language
pg_catalog | italian | configuration for italian language
pg_catalog | norwegian | configuration for norwegian language
pg_catalog | portuguese | configuration for portuguese language
pg_catalog | romanian | configuration for romanian language
pg_catalog | russian | configuration for russian language
pg_catalog | simple | simple configuration
pg_catalog | spanish | configuration for spanish language
pg_catalog | swedish | configuration for swedish language
pg_catalog | turkish | configuration for turkish language
(16 rows)
--全文檢索查看russian具體配置
postgres=# \dF+ russian
Text search configuration "pg_catalog.russian"
Parser: "pg_catalog.default"
Token | Dictionaries
-----------------+--------------
asciihword | english_stem
asciiword | english_stem
email | simple
file | simple
float | simple
host | simple
hword | russian_stem
hword_asciipart | english_stem
hword_numpart | simple
hword_part | russian_stem
int | simple
numhword | simple
numword | simple
sfloat | simple
uint | simple
url | simple
url_path | simple
version | simple
word | russian_stem
--查看全文檢索模板
postgres=# \dFt+
List of text search templates
Schema | Name | Init | Lexize | Description
------------+-----------+----------------+------------------+-----------------------------------------------------------
pg_catalog | ispell | dispell_init | dispell_lexize | ispell dictionary
pg_catalog | simple | dsimple_init | dsimple_lexize | simple dictionary: just lower case and check for stopword
pg_catalog | snowball | dsnowball_init | dsnowball_lexize | snowball stemmer
pg_catalog | synonym | dsynonym_init | dsynonym_lexize | synonym dictionary: replace word by its synonym
pg_catalog | thesaurus | thesaurus_init | thesaurus_lexize | thesaurus dictionary: phrase by phrase substitution
(5 rows)
--全文檢索字典
postgres=# \dFd+
List of text search dictionaries
Schema | Name | Template | Init options | Description
------------+-----------------+---------------------+---------------------------------------------------+-----------------------------------------------------------
pg_catalog | danish_stem | pg_catalog.snowball | language = 'danish', stopwords = 'danish' | snowball stemmer for danish language
pg_catalog | dutch_stem | pg_catalog.snowball | language = 'dutch', stopwords = 'dutch' | snowball stemmer for dutch language
pg_catalog | english_stem | pg_catalog.snowball | language = 'english', stopwords = 'english' | snowball stemmer for english language
pg_catalog | finnish_stem | pg_catalog.snowball | language = 'finnish', stopwords = 'finnish' | snowball stemmer for finnish language
pg_catalog | french_stem | pg_catalog.snowball | language = 'french', stopwords = 'french' | snowball stemmer for french language
pg_catalog | german_stem | pg_catalog.snowball | language = 'german', stopwords = 'german' | snowball stemmer for german language
pg_catalog | hungarian_stem | pg_catalog.snowball | language = 'hungarian', stopwords = 'hungarian' | snowball stemmer for hungarian language
pg_catalog | italian_stem | pg_catalog.snowball | language = 'italian', stopwords = 'italian' | snowball stemmer for italian language
pg_catalog | norwegian_stem | pg_catalog.snowball | language = 'norwegian', stopwords = 'norwegian' | snowball stemmer for norwegian language
pg_catalog | portuguese_stem | pg_catalog.snowball | language = 'portuguese', stopwords = 'portuguese' | snowball stemmer for portuguese language
pg_catalog | romanian_stem | pg_catalog.snowball | language = 'romanian' | snowball stemmer for romanian language
pg_catalog | russian_stem | pg_catalog.snowball | language = 'russian', stopwords = 'russian' | snowball stemmer for russian language
pg_catalog | simple | pg_catalog.simple | | simple dictionary: just lower case and check for stopword
pg_catalog | spanish_stem | pg_catalog.snowball | language = 'spanish', stopwords = 'spanish' | snowball stemmer for spanish language
pg_catalog | swedish_stem | pg_catalog.snowball | language = 'swedish', stopwords = 'swedish' | snowball stemmer for swedish language
pg_catalog | turkish_stem | pg_catalog.snowball | language = 'turkish', stopwords = 'turkish' | snowball stemmer for turkish language
--查看全文檢索分析器,帶加號可看詳細配置,命令\dFp+
postgres=# \dFp
List of text search parsers
Schema | Name | Description
------------+---------------+---------------------
pg_catalog | chineseparser |
pg_catalog | default | default word parser
(2 rows)
參數和配置文件的具體位置通常在$PGHOME/SHARE裏面,stop詞是存放在$PGHOME/share/tsearch_data下面的
2、實際例子,以英文例子爲例
postgres=# SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat & rat'::tsquery as search;
search
--------
t
(1 row)
postgres=# SELECT 'fat & cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector as search;
search
--------
f
(1 row)
postgres=# SELECT to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat') as search;
search
--------
t
(1 row)
postgres=# SELECT 'fat cats ate fat rats'::tsvector @@ to_tsquery('fat & rat') as search;
search
--------
f
(1 row)
--默認的english分詞,to_tevector區別於::tsvector是前者會加工分詞,後者默認是加工好了
postgres=# SELECT to_tsvector('english','fat cats ate fat rats') @@ to_tsquery('english','fat & rat') as search;
search
--------
t
(1 row)
--plainto_tsquery不卻分分隔符,權重標籤
postgres=# SELECT plainto_tsquery('english', 'The Fat & Rats:C');
plainto_tsquery
---------------------
'fat' & 'rat' & 'c'
(1 行記錄)
--分詞之間不會區分分隔符,每一個分詞之間插入&;,::tsquery和to_tsquery則必需要用到
postgres=# SELECT plainto_tsquery('english', 'The Fat Rats');
plainto_tsquery
-----------------
'fat' & 'rat'
(1 行記錄)
postgres=# SELECT 'The & Fat & Rats'::tsquery;
tsquery
------------------------
'The' & 'Fat' & 'Rats'
(1 行記錄)
postgres=# SELECT to_tsquery('english', 'The & Fat & Rats');
to_tsquery
---------------
'fat' & 'rat'
(1 行記錄)
3、對全文檢索創建索引
有兩種辦法,一種是對當前文檔字段加內置的轉換函數,而後建索引,另外一種辦法是新增一個字段,而後更新原文檔內容(需創建觸發器和函數轉換)上創建索引。推薦後一個。
方法1.原字段上建索引
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', body));
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(config_name, body)); --組合索引,config_name是表pgweb的一個字段
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', title || ' ' || body));
方法2.新增一列轉換後建索引
ALTER TABLE pgweb ADD COLUMN textsearchable_index_col tsvector; --新建字段列類型是tsvector
UPDATE pgweb SET textsearchable_index_col = to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,'')); CREATE INDEX textsearch_idx ON pgweb USING gin(textsearchable_index_col);
SELECT title FROM pgweb WHERE textsearchable_index_col @@ to_tsquery('create & table') ORDER BY last_mod_date DESC LIMIT 10;
說明:
a.新增字段建的索引還須要建立一個觸發器來實時更新新建字段內容
b.表達式索引的優勢是簡單,佔用的空間少,缺點是每次執行須要調用to_tsvector函數來確保索引值關聯
c.新建字段索引的有點是查詢的速度快(無需每次去調用to_tsvevtor),尤爲是使用Gist索引的時候。缺點是新建一個單獨的列,消耗更多的存儲空間。
4、內置實用函數示例
諸如to_tsvector,to_tsquery,tsvector_update_trigger,tsvector_update_trigger_column,ts_stat等等
--tsvector_update_trigger示例
CREATE TABLE messages (
title text,
body text,
tsv tsvector
);
CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON messages FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(tsv, 'pg_catalog.english', title, body);
INSERT INTO messages VALUES('title here','the body text is here');
postgres=# select * from messages;
title | body | tsv
------------+-----------------------+----------------------------
title here | the body text is here | 'bodi':4 'text':5 'titl':1
(1 row)
postgres=# SELECT title, body FROM messages WHERE tsv @@ to_tsquery('title & body');
title | body
------------+-----------------------
title here | the body text is here
(1 row)
--ts_stat的使用
--尋找文檔中出現詞彙的排序
-- nentry是總的出現次數
-- ndoc是文檔中(tsvector)出現的次數,重複的記爲1次
postgres=# select * from messages;
title | body | tsv
----------------------+--------------------------------------------------------------+----------------------------------------------------------------------------------------------------
title here | the body text is here | 'bodi':4 'text':5 'titl':1
kenyon | a chinese boy | 'boy':4 'chines':3 'kenyon':1
Andy Roddick retired | Andy Roddick retired,a former rank number 1 player in tennis | '1':11 'andi':1,4 'former':8 'number':10 'player':12 'rank':9 'retir':3,6 'roddick':2,5 'tenni':14
kenyon retired | kenyon retired,a open-source lover,inserting in this area | 'area':13 'insert':10 'kenyon':1,3 'lover':9 'open':7 'open-sourc':6 'retir':2,4 'sourc':8
Michael Jordan | MJ is an American former professional basketball player | 'american':6 'basketbal':9 'former':7 'jordan':2 'michael':1 'mj':3 'player':10 'profession':8
(5 rows)
postgres=# SELECT * FROM ts_stat('SELECT tsv FROM messages') ORDER BY nentry DESC, ndoc DESC, word LIMIT 10;
word | ndoc | nentry
-----------+------+--------
retir | 2 | 4
kenyon | 2 | 3
former | 2 | 2
player | 2 | 2
andi | 1 | 2
roddick | 1 | 2
1 | 1 | 1
american | 1 | 1
area | 1 | 1
basketbal | 1 | 1
(10 rows)
5、全文檢索的限制
1.The length of each lexeme must be less than 2K bytes
2.The length of a tsvector (lexemes + positions) must be less than 1 megabyte
3.The number of lexemes must be less than 264
4.Position values in tsvector must be greater than 0 and no more than 16,383 No more than 256 positions per lexeme 5.The number of nodes (lexemes + operators) in a tsquery must be less than 32,768
6、總結: 以上是PostgreSQL內置的全文檢索的環境和實際使用例子,目前對中文的全文檢索並不支持,但已經有比較好的第三方工具結合使用,下一篇繼續PostgreSQL中文全文檢索環境搭建和實際使用。