使用 Python 的 SQLite JSON1 和 FTS5 擴展

時間 2019-12-09

標籤使用 python sqlite json1 json fts5 fts 擴展欄目 Python 简体版

原文原文鏈接

早在九月份，編程界出現一個名爲 json1.c 的文件，此前這個文件一直在 SQLite 的庫裏面。還有，筆者也曾總結經過使用新的 json1 擴展來編譯 pysqlite 的技巧。但如今隨着 SQLite 3.9.0 的發佈，用戶已經不用再費那麼大勁了。html

SQLite 3.9.0 版本作了很大的升級，不只增長了萬衆期待的 json1 擴展，還增長了具備全文檢索的新版本 fts5 擴展模塊。 fts5 擴展模塊提升了複雜查詢的性能，而且提供了開箱即用的 BM25 排序算法。該算法在其餘相關領域排序方面也有着重大意義。使用者可經過查看發佈說明以瞭解所有新增功能。python

本文主要介紹如何添加 json1 和 fts5 擴展編譯 SQLite。這裏將使用新版 SQLite 庫編譯 python 驅動程序，也利用 python 新功能。因爲我的很喜歡 pysqlite 和 apsw，因此下文步驟中將會包括創建二者的指令。最後，將在 peewee ORM 經過 json1 和 fts5 擴展進行查詢。linux

使用入門

首先從獲取新版 SQLite 源碼入手，一種方法是經過使用 SQLite 源代碼管理系統 fossil 來完成，另外一種是下載一個壓縮圖像。 SQLite 使用 tcl 和 awk 進行源碼融合，所以在開始前，須要安裝下列工具：git

tclgithub
awk (可用於大多數 unix系統)redis
fossil (可選)算法

該過程涉及幾個步驟，這裏儘可能將步驟細化。首先須要爲新庫分配一個全新的目錄，筆者把它放在 ~/bin/jqlite 中，使用者可根據我的喜愛自行選擇。sql

export JQLITE="$HOME/bin/jqlite"
mkdir -p $JQLITE
cd $JQLITE

經過 fossil 獲取源碼，運行如下命令：數據庫

fossil clone http://www.sqlite.org/cgi/src sqlite.fossil
fossil open sqlite.fossil

獲取快照文件，運行如下命令：編程

curl 'https://www.sqlite.org/src/tarball/sqlite.tar.gz?ci=trunk' | tar xz
mv sqlite/* .

若是你更喜歡使用官方正式版，可在 SQLite 下載頁下載 autoconf 的壓縮包，並將內容解壓到 $JQLITE 目錄中。

利用 json1 和 fts5 編譯 SQLite

代碼下載完成後，把它和 SQLite 源代碼樹放在同一目錄下。SQLite 支持大量的編譯配置選項，除了 json1 和 fts5，還有不少其餘有效的選擇。

編譯遵循典型的 configure -> make -> make install 順序：

export CFLAGS="-DSQLITE_ENABLE_COLUMN_METADATA=1 \
-DSQLITE_ENABLE_DBSTAT_VTAB=1 \
-DSQLITE_ENABLE_FTS3=1 \
-DSQLITE_ENABLE_FTS3_PARENTHESIS=1 \
-DSQLITE_ENABLE_FTS5=1 \
-DSQLITE_ENABLE_JSON1=1 \
-DSQLITE_ENABLE_RTREE=1 \
-DSQLITE_ENABLE_UNLOCK_NOTIFY \
-DSQLITE_ENABLE_UPDATE_DELETE_LIMIT \
-DSQLITE_SECURE_DELETE \
-DSQLITE_SOUNDEX \
-DSQLITE_TEMP_STORE=3 \
-fPIC"
LIBS="-lm" ./configure --prefix=$JQLITE --enable-static --enable-shared
make
make install

在 SQLite3 Source Checkout 中，應該有一個 lib/libsqlite3.a 文件。若是文件不存在，檢查控制器的輸出，查看錯誤日誌。我在 arch 和 ubuntu 上都已執行成功，但 fapple 和 windoze 我不肯定可否成功。

建立 pysqlite

大多數 python 開發者對 pysqlite 必定不陌生，在 Python 標準庫中 pysqlite 或多或少的和 sqlite3 模塊類似。要創建和 libsqlite3 相對應的 pysqlite，惟一須要作的是修改 setup.cfg 文件使其指向剛纔建立的 include 和 lib 目錄。

git clone https://github.com/ghaering/pysqlite
cd pysqlite/
cp ../sqlite3.c .
echo -e "library_dirs=$JQLITE/lib" >> setup.cfg
echo -e "include_dirs=$JQLITE/include" >> setup.cfg
LIBS="-lm" python setup.py build_static

測試安裝，進入 build/lib.linux-xfoobar/ 目錄，啓動 Python 解釋器，運行如下命令：

>>> from pysqlite2 import dbapi2 as sqlite
>>> conn = sqlite.connect(':memory:')
>>> conn.execute('CREATE VIRTUAL TABLE testing USING fts5(data);')
<pysqlite2.dbapi2.Cursor object at 0x7ff7d0a2dc60>
>>> conn.execute('SELECT json(?)', (1337,)).fetchone()
(u'1337',)

接下來就看你心情了，你能夠運行 python setup.py 安裝文件，也能夠把新建的 pysqlite2（可在 build/lib.linux.../ 目錄下查看）連接到 $PYTHONPATH。若是想同時使用 virtualenv 和 $PYTHONPATH ，能夠先激活 virtualenv，而後返回 pysqlite 目錄下運行 setup.py 來安裝文件。

建立 apsw

建立 apsw 的步驟幾乎和創建 pysqlite 相同。

cd $JQLITE
git clone https://github.com/rogerbinns/apsw
cd apsw
cp ../sqlite3{ext.h,.h,.c} .
echo -e "library_dirs=$SQLITE_SRC/lib" >> setup.cfg
echo -e "include_dirs=$SQLITE_SRC/include" >> setup.cfg
LIBS="-lm" python setup.py build

爲了測試新的 apsw 庫，更改目錄到 build/libXXX。啓動 Python 解釋器，運行下列命令：

>>> import apsw
>>> conn = apsw.Connection(':memory:')
>>> cursor = conn.cursor()
>>> cursor.execute('CREATE VIRTUAL TABLE testing USING fts5(data);')
<apsw.Cursor at 0x7fcf6b17fa80>
>>> cursor.execute('SELECT json(?)', (1337,)).fetchone()
(u'1337',)

可經過運行 Python setup.py 安裝文件來安裝新 apsw 全系統，或者連接 apsw.so 庫（可在 build/lib.linux.../ 查看）到 $PYTHONPATH。若是開發者想同時使用 virtualenv 和 apsw ，能夠先激活 virtualenv，而後返回 apsw 目錄下運行 setup.py 安裝文件。

使用 JSON 擴展

json1 擴展中具備一些簡潔特性，尤爲是 json_tree 和 json_each 函數/虛擬表（詳情）。爲了展現這些新功能，本文特地利用 peewee（小型 Python ORM）編寫了一些 JSON 數據並進行查詢。

原打算從 GitHub 的 API 上獲取測試數據，但爲了展現最少冗長這個特性，特地選擇編寫一個小的 JSON 文件（詳情）。其結構以下：

[{
   "title": "My List of Python and SQLite Resources",
   "url": "http://charlesleifer.com/blog/my-list-of-python-and-sqlite-resources/", 
   "metadata": {"tags": ["python", "sqlite"]}
 }, 
 {
   "title": "Using SQLite4's LSM Storage Engine as a Stand-alone NoSQL Database with Python"
   "url": "http://charlesleifer.com/blog/using-sqlite4-s-lsm-storage-engine-as-a-stand-alone-nosql-database-with-python/", 
   "metadata": {"tags": ["nosql", "python", "sqlite", "cython"]}
  },
  ...]

若是更願意以 IPython 格式查看代碼，參考此處。

填充數據庫

獲取 JSON 數據文件和進行解碼：

>>> import json, urllib2
>>> fh = urllib2.urlopen('http://media.charlesleifer.com/downloads/misc/blogs.json')
>>> data = json.loads(fh.read())
>>> data[0]
{u'metadata': {u'tags': [u'python', u'sqlite']},
 u'title': u'My List of Python and SQLite Resources',
 u'url': u'http://charlesleifer.com/blog/my-list-of-python-and-sqlite-resources/'}

如今，須要告知 peewee 怎樣去訪問咱們數據庫，經過存入 SQLite 數據庫的方式使用自定義的 pysqlite 接口。這裏使用的是剛剛編譯完成的 pysqlite2，雖然它和 tojqlite 有所混淆，但這並不衝突。在定義數據庫類後，將建立一個內存數據庫。（注：在接下來的2.6.5版本中，若是其使用比 sqlite3 更新版本編譯，peewee 將自動使用 pysqlite2）。

>>> from pysqlite2 import dbapi2 as jqlite
>>> from peewee import *
>>> from playhouse.sqlite_ext import *
>>> class JQLiteDatabase(SqliteExtDatabase):
...     def _connect(self, database, **kwargs):
...         conn = jqlite.connect(database, **kwargs)
...         conn.isolation_level = None
...         self._add_conn_hooks(conn)
...         return conn
...
>>> db = JQLiteDatabase(':memory:')

利用 JSON 數據填充數據庫十分簡單。首先使用單一 TEXT 字段建立一個通用表。此時，SQLite 不會顯示 JSON 數據單獨的列/數據類型，因此須要使用 TextField：

>>> class Entry(Model):
...     data = TextField()
...     class Meta:
...         database = db
... 
>>> Entry.create_table()
>>> with db.atomic():
...     for entry_json in data:
...         Entry.create(data=json.dumps(entry_json))
...

JSON 的功能

首先介紹下 json_extract()。它經過點/括號的路徑來描述要找的元素（postgres 使用的是[]）。數據庫的每一個 Entry 中包含單一數據列，每一個數據列中又包含一個 JSON 對象。每一個 JSON 對象包括一個標題，一個 URL 和頂層的元數據鍵，下面是提取做品標題的代碼：

>>> title = fn.json_extract(Entry.data, '$.title')
>>> query = (Entry
...          .select(title.alias('title'))
...          .order_by(title)
...          .limit(5))
...
>>> [row for row in query.dicts()]
[{'title': u'A Tour of Tagging Schemas: Many-to-many, Bitmaps and More'},
 {'title': u'Alternative Redis-Like Databases with Python'},
 {'title': u'Building the SQLite FTS5 Search Extension'},
 {'title': u'Connor Thomas Leifer'},
 {'title': u'Extending SQLite with Python'}]

對應下面 SQL 建立的查詢：

SELECT json_extract("t1"."data", '$.title') AS title 
FROM "entry" AS t1 
ORDER BY json_extract("t1"."data", '$.title')
LIMIT 5

在接下來的例子中，將提取包含特定標籤的條目。利用 json_each() 函數搜索標籤列表。該函數相似於表（實際指的是虛表），返回篩選後的指定 JSON 路徑，下面是如何檢索標題爲「Sqlite」條目的代碼。

>>> from peewee import Entity
>>> tags_src = fn.json_each(Entry.data, '$.metadata.tags').alias('tags')
>>> tags_ref = Entity('tags')

>>> query = (Entry
...          .select(title.alias('title'))
...          .from_(Entry, tags_src)
...          .where(tags_ref.value == 'sqlite')
...          .order_by(title))
... 
>>> [row for row, in query.tuples()]
[u'Building the SQLite FTS5 Search Extension',
 u'Extending SQLite with Python',
 u'Meet Scout, a Search Server Powered by SQLite',
 u'My List of Python and SQLite Resources',
 u'Querying Tree Structures in SQLite using Python and the Transitive Closure Extension',
 u"Using SQLite4's LSM Storage Engine as a Stand-alone NoSQL Database with Python",
 u'Web-based SQLite Database Browser, powered by Flask and Peewee']

上述查詢的 SQL 有助闡明整個過程：

SELECT json_extract("t1"."data", '$.title') AS title 
FROM
    "entry" AS t1, 
    json_each("t1"."data", '$.metadata.tags') AS tags 
WHERE ("tags"."value" = 'sqlite') 
ORDER BY json_extract("t1"."data", '$.title')

隨着查詢變得更加複雜，可經過使用 Peewee 對象對查詢進行封裝，使之變得更加有用，同時也使得代碼可以重用。

下面是 json_each() 的另外一個例子。此次將篩選每一個條目中的標題，並創建相關標籤的字符串，字符串中用逗號分隔。這裏將再次使用上文定義的 tags_src 和 tags_ref。

>>> query = (Entry
...          .select(
...              title.alias('title'),
...              fn.group_concat(tags_ref.value, ', ').alias('tags'))
...          .from_(Entry, tags_src)
...          .group_by(title)
...          .limit(5))
...
>>> [row for row in query.tuples()]
[(u'A Tour of Tagging Schemas: Many-to-many, Bitmaps and More',
  u'peewee, sql, python'),
 (u'Alternative Redis-Like Databases with Python',
  u'python, walrus, redis, nosql'),
 (u'Building the SQLite FTS5 Search Extension',
  u'sqlite, search, python, peewee'),
 (u'Connor Thomas Leifer', u'thoughts'),
 (u'Extending SQLite with Python', u'peewee, python, sqlite')]

爲了清晰起見，這裏是對應的 SQL 查詢語句：

SELECT 
    json_extract("t1"."data", '$.title') AS title, 
    group_concat("tags"."value", ', ') AS tags 
FROM 
    "entry" AS t1, 
    json_each("t1"."data", '$.metadata.tags') AS tags 
GROUP BY json_extract("t1"."data", '$.title') 
LIMIT 5

最後介紹的功能是 json_tree()。如同 json_each()，json_tree() 一樣是一個多值函數，一樣與表相似。但不一樣但時 json_each() 僅返回特定路徑的 children，而 json_tree() 將遞歸遍歷所有對象，返回所有的 children。

若是標籤鍵嵌套在條目的任意位置，下面是如何匹配給定標籤條目的代碼：

>>> tree = fn.json_tree(Entry.data, '$').alias('tree')
>>> parent = fn.json_tree(Entry.data, '$').alias('parent')

>>> tree_ref = Entity('tree')
>>> parent_ref = Entity('parent')

>>> query = (Entry
...          .select(title.alias('title'))
...          .from_(Entry, tree, parent)
...          .where(
...              (tree_ref.parent == parent_ref.id) &
...              (parent_ref.key == 'tags') &
...              (tree_ref.value == 'sqlite'))
...          .order_by(title))
...
>>> [title for title, in query.tuples()]
[u'Building the SQLite FTS5 Search Extension',
 u'Extending SQLite with Python',
 u'Meet Scout, a Search Server Powered by SQLite',
 u'My List of Python and SQLite Resources',
 u'Querying Tree Structures in SQLite using Python and the Transitive Closure Extension',
 u"Using SQLite4's LSM Storage Engine as a Stand-alone NoSQL Database with Python",
 u'Web-based SQLite Database Browser, powered by Flask and Peewee']

在上述代碼中選取了 Entry 自身，以及表明該 Entry 子節點的二叉樹。由於每一個樹節點包含對父節點的引用，咱們能夠十分簡單搜索命名爲「標籤」的父節點，該父節點包含值爲「sqlite」的子節點。
下面是 SQL 實現語句：

SELECT json_extract("t1"."data", '$.title') AS title 
FROM 
    "entry" AS t1, 
    json_tree("t1"."data", '$') AS tree, 
    json_tree("t1"."data", '$') AS parent 
WHERE (
    ("tree"."parent" = "parent"."id") AND 
    ("parent"."key" = 'tags') AND 
    ("tree"."value" = 'sqlite')) 
ORDER BY json_extract("t1"."data", '$.title')

這僅是 json1 擴展功能的一個方面，在接下來的幾周將會嘗試使用其更多的功能。請在此處給我留言，或者若是對該擴展存在特定的問題，可經過郵件向 sqlite-users 諮詢。

FTS5 與 Python

本小節中的代碼均是以前 JSON 示例中的代碼，這裏將使用 Entry 數據文件的標題而且用它們填充搜索索引。peewee 2.6.5版本將包含 FTS5Model 功能，目前該功能可在 Github 主分支上可用。

從新回到以前的 JSON 例子中去，新建另外一張表，做爲 Entry 數據的查詢索引。

fts5 擴展要求全部的列不包含任何類型或約束。用於表示一列的惟一附加信息是無索引，意味着只能存儲數據並不能進行數據查詢。

對 entry 模型定義一個查詢索引，以實現經過查詢標題來肯定相關的 URL。爲此，須要將 url 字段定義爲無索引。

class EntryIndex(FTS5Model):
    title = SearchField()
    url = SearchField(unindexed=True)
    class Meta:
        database = db
        options = {'tokenize': 'porter', 'prefix': '2,3'}

EntryIndex.create_table()

對於 fts5 擴展，該可選字典提供了附加元數據進行標記字段，以及經過前綴的長度存儲快速前綴查詢。利用 SQL 建立表的語句以下：

CREATE VIRTUAL TABLE "entryindex" USING fts5 (
    "title" ,
    "url"  UNINDEXED,
    prefix=2,3,
    tokenize=porter)

爲了填充索引，將使用一對 JSON 函數從 Entry 模型中複製數據：

title = fn.json_extract(Entry.data, '$.title').alias('title')
url = fn.json_extract(Entry.data, '$.url').alias('url')
query = Entry.select(title, url).dicts()
with db.atomic():
    for entry in query:
        EntryIndex.create(**entry)

索引填充後，進行一些查詢：

>>> query = EntryIndex.search('sqlite').limit(3)
>>> for result in query:
...     print result.title

Extending SQLite with Python
Building the SQLite FTS5 Search Extension
My List of Python and SQLite Resources

實現上述查詢的 SQL 語句爲：

SELECT "t1"."title", "t1"."url" 
FROM "entryindex" AS t1 
WHERE ("entryindex" MATCH 'sqlite') 
ORDER BY rank

一樣可對查詢後的結果進行檢索：

>>> query = EntryIndex.search('sqlite AND python', with_score=True)
>>> for result in query:
...     print round(result.score, 3), result.title

-1.259 Extending SQLite with Python
-1.059 My List of Python and SQLite Resources
-0.838 Querying Tree Structures in SQLite using Python and the Transitive Closure Extension

這些結果是很是準確，用於上述查詢的 SQL 語句以下：

SELECT "t1"."title", "t1"."url", rank AS score 
FROM "entryindex" AS t1 
WHERE ("entryindex" MATCH 'sqlite AND python') 
ORDER BY rank

本文中只是簡要介紹了 fts5 擴展的簡單功能，若是使用者查詢該文檔，將會發現其更多強大的功能。如下是一些例子：

多列索引，在排序時分配不一樣的權重
前綴查詢、引述語、相鄰的行的關鍵詞
上述查詢類型與布爾型運算符結合
unicode61默認編碼轉化器、porter分解器禁止使用
用於定義排序功能和斷詞的新的 C API。
詞彙表，用於查詢詞的數量和檢查索引

感謝閱讀

在 SQLite 添加 JSON 擴展對該項目和用戶來講都是一件好事。Postgresql 和 MySQL 都已支持 JSON 數據類型，很高興能 SQLite 跟隨他們的腳步。但並非任何條件下均須要是 JSON 數據格式，例如某些狀況下須要用到專用的嵌入式文件存儲庫 UnQLite。

json1.c 文件一樣值得注意。Dr. Hipp 提到：json1.c 如今只是第一步，將來還有更多的發展空間。所以，不管當前版本存在任何問題，我始終堅信未來發布的版本中性能和 APIS 兩個方面都會有很大的改善。還有一點，我相信他會考慮使用更高效的二進制格式。

很高興看到 SQLite 在全文查詢擴展模塊上不斷地自我完善和提升。爲用戶提供一個內置算法和一個用戶可自行添加所需內容的 API。

原文地址：http://charlesleifer.com/blog/using-the-sqlite-json1-and-fts5-extensions-with-python/

OneAPM 可以幫你查看 Python 應用程序的方方面面，不只可以監控終端的用戶體驗，還能監控服務器監性能，同時還支持追蹤數據庫、第三方 API 和 Web 服務器的各類問題。想閱讀更多技術文章，請訪問 OneAPM 官方技術博客。

本文轉自 OneAPM 官方博客

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。