要經過PostgreSQL實現相似Google搜索自動提示的功能,例如要實現一個查詢海量數據中的商品名字,每次輸入就提示用戶各類相關搜索選項,例如淘寶、京東等電商查詢html
這個功能能夠用 PostgreSQL的實時全文檢索和分詞、類似搜索、前模糊匹配等特性實現。具體策略是,定義一個搜索提示的最大數量。首先經過前模糊匹配查詢獲取數據,若是未滿最大數量,則進行全文檢索分詞查詢補償,若是全文檢索補償查詢的數據未滿最大提示數量,最後就加入相識查詢的結果。固然這裏是一個簡單的思路,複雜的還得根據實際需求實現。sql
新建一張商品表,插入一千萬條數據,name就是商品名字。dom
create table goods(id int, name varchar); insert into goods select generate_series(1,10000000),md5(random()::varchar);
實現SQL,每次輸入就做爲前綴模糊查詢:oop
select * from goods where name like '123%' ;
這個簡單的前模糊匹配SQL,可使用B-Tree來加速優化模糊查詢。post
未創建索引時查詢"123%"的商品名字,執行計劃顯示耗時大約爲575ms:優化
explain (analyze,verbose,timing,costs,buffers) select * from goods where name like '123%' ; ======================================== Gather (cost=1000.00..136516.59 rows=1000 width=37) (actual time=1.390..572.857 rows=2364 loops=1) Output: id, name Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=83334 -> Parallel Seq Scan on public.goods (cost=0.00..135416.59 rows=417 width=37) (actual time=0.750..528.116 rows=788 loops=3) Output: id, name Filter: ((goods.name)::text ~~ '123%'::text) Rows Removed by Filter: 3332545 Buffers: shared hit=83334 Worker 0: actual time=1.032..511.776 rows=676 loops=1 Buffers: shared hit=24201 Worker 1: actual time=0.145..511.737 rows=755 loops=1 Buffers: shared hit=26101 Planning time: 0.065 ms Execution time: 573.157 ms
經過lc_collate方式創建索引、也就是B-Tree索引。插件
創建索引腳本以下postgresql
create index idx_c on goods(name collate "C");
執行計劃顯示耗時爲10ms之內:code
explain (analyze,verbose,timing,costs,buffers) select * from goods where name like '123%' collate "C";
創建索引腳本以下htm
create index idx_varchar on goods(name varchar_pattern_ops);
執行計劃顯示耗時爲5ms之內:
explain (analyze,verbose,timing,costs,buffers) select * from goods where name like '123%' collate "C"; ====================================== Bitmap Heap Scan on public.goods (cost=86.60..7681.10 rows=1000 width=37) (actual time=0.740..4.628 rows=2364 loops=1) Output: id, name Filter: ((goods.name)::text ~~ '123%'::text) Heap Blocks: exact=2330 Buffers: shared hit=2351 -> Bitmap Index Scan on idx_varchar (cost=0.00..86.35 rows=2179 width=0) (actual time=0.487..0.487 rows=2364 loops=1) Index Cond: (((goods.name)::text ~>=~ '123'::text) AND ((goods.name)::text ~<~ '124'::text)) Buffers: shared hit=21 Planning time: 0.139 ms Execution time: 4.891 ms
注意:全文檢索和下面的相識搜索都須要pg_trgm插件。因此先要執行:
create extension pg_trgm;
具體SQL以下,每次輸入空格用&符號代替,最後接:*表示模糊檢索。to_tsvector ,to_tsquery參閱postgresql全文檢索文檔。
SELECT name FROM goods WHERE to_tsvector('English',name) @@ to_tsquery('English','aaa&bbb&cc:*')
經過執行計劃查看速度:接近8秒
。。。。。。。 Planning time: 0.129 ms Execution time: 7986.176 ms
經過gin索引來優化加速,這裏to_tsvector('English',name)就是一個表達式索引。
CREATE INDEX name_idx ON goods USING GIN(to_tsvector('English',name));
優化後後的執行計劃,速度爲13毫秒左右:
explain (analyze,verbose,timing,costs,buffers) SELECT name FROM goods WHERE to_tsvector('English',name) @@ to_tsquery('English','aaa&bbb&cc:*') ================================================= Bitmap Heap Scan on public.goods (cost=88.04..109.24 rows=5 width=33) (actual time=17.343..17.353 rows=4 loops=1) Output: name Recheck Cond: (to_tsvector('english'::regconfig, (goods.name)::text) @@ '''aaa'' & ''bbb'' & ''cc'':*'::tsquery) Heap Blocks: exact=1 Buffers: shared hit=473 -> Bitmap Index Scan on name_idx (cost=0.00..88.04 rows=5 width=0) (actual time=17.334..17.334 rows=4 loops=1) Index Cond: (to_tsvector('english'::regconfig, (goods.name)::text) @@ '''aaa'' & ''bbb'' & ''cc'':*'::tsquery) Buffers: shared hit=472 Planning time: 0.222 ms Execution time: 13.381 ms
具體實現SQL,經過查詢結果能夠看到越類似,類似度越小,能夠看到,在搜索aaa bbb的時候搜索出了aaa b6b,這就是類似搜索。
SELECT name ,name <-> 'aaa bbb' FROM goods WHERE name <-> 'aaa bbb' < 0.7 LIMIT 10
aaa bbb 0 aaa bbb ccc 0.333333 aaa ccc bbb 0.333333 aaa bbb ccc ddd 0.5 aaa b6b ccc 0.666667 aaa bbb ccsdsd 0.466667 aaa 0.5
在第一種模糊查詢中,可使用關鍵字ILIKE替換LIKE, ILIKE表示字符串匹配時與大小寫無關。這是一個PostgreSQL擴展、並非標準SQL語法。