Python內存數據庫/引擎

時間 2019-11-26

原文原文鏈接

1 初探

　　在平時的開發工做中，咱們可能會有這樣的需求：咱們但願有一個內存數據庫或者數據引擎，用比較Pythonic的方式進行數據庫的操做（好比說插入和查詢）。html

　　舉個具體的例子，分別向數據庫db中插入兩條數據，"a=1, b=1" 和 "a=1, b=2", 而後想查詢a=1的數據可能會使用這樣的語句db.query(a=1)，結果就是返回前面插入的兩條數據；若是想查詢a=1, b=2的數據，就使用這樣的語句db.query(a=1, b=2)，結果就返回前面的第二條數據。python

　　那麼是否擁有實現上述需求的現成的第三方庫呢？幾經查找，發現PyDbLite可以知足這樣的需求。其實，PyDbLite和Python自帶的SQLite均支持內存數據庫模式，只是前者是Pythonic的用法，然後者則是典型的SQL用法。
他們具體的用法是這樣的：git

PyDbLitegithub

import pydblite
# 使用內存數據庫
pydb = pydblite.Base(':memory:')
# 建立a,b,c三個字段
pydb.create('a', 'b', 'c')
# 爲字段a,b建立索引
pydb.create_index('a', 'b')
# 插入一條數據
pydb.insert(a=-1, b=0, c=1)
# 查詢符合特定要求的數據
results = pydb(a=-1, b=0)

SQLitesql

import sqlite3
# 使用內存數據庫
con = sqlite3.connect(':memory:')
# 建立a,b,c三個字段
cur = con.cursor()
cur.execute('create table test (a char(256), b char(256), c char(256));')
# 爲字段a,b建立索引
cur.execute('create index a_index on test(a)')
cur.execute('create index b_index on test(b)')
# 插入一條數據
cur.execute('insert into test values(?, ?, ?)', (-1,0,1))
# 查詢符合特定要求的數據
cur.execute('select * from test where a=? and b=?',(-1, 0))

2 pydblite和sqlite的性能

　　毫無疑問，pydblite的使用方式很是地Pythonic，可是它的效率如何呢？因爲咱們主要關心的是數據插入和查詢速度，因此不妨僅對這兩項作一個對比。寫一個簡單的測試腳本：數據庫

import time
count = 100000

def timeit(func):
    def wrapper(*args, **kws):
        t = time.time()
        func(*args)
        print time.time() - t, kws['des']
    return wrapper

@timeit
def test_insert(mdb, des=''):
    for i in xrange(count):
        mdb.insert(a=i-1, b=i, c=i+1)

@timeit
def test_query_object(mdb, des=''):
    for i in xrange(count):
        c = mdb(a=i-1, b=i)

@timeit
def test_sqlite_insert(cur, des=''):
    for i in xrange(count):
        cur.execute('insert into test values(?, ?, ?)', (i-1, i, i+1))

@timeit
def test_sqlite_query(cur, des=''):
    for i in xrange(count):
        cur.execute('select * from test where a=? and b=?', (i-1, i))

print '-------pydblite--------'
import pydblite
pydb = pydblite.Base(':memory:')
pydb.create('a', 'b', 'c')
pydb.create_index('a', 'b')
test_insert(pydb, des='insert')
test_query_object(pydb, des='query, object call')


print '-------sqlite3--------'
import sqlite3
con = sqlite3.connect(':memory:')
cur = con.cursor()
cur.execute('create table test (a char(256), b char(256), c char(256));')
cur.execute('create index a_index on test(a)')
cur.execute('create index b_index on test(b)')
test_sqlite_insert(cur, des='insert')
test_sqlite_query(cur, des='query')

　　在建立索引的狀況下，10w次的插入和查詢的時間以下：數據結構

-------pydblite--------
1.14199995995 insert
0.308000087738 query, object call
-------sqlite3--------
0.411999940872 insert
0.30999994278 query

　　在未建立索引的狀況（把建立索引的測試語句註釋掉）下，1w次的插入和查詢時間以下：app

-------pydblite--------
0.0989999771118 insert
5.15300011635 query, object call
-------sqlite3--------
0.0169999599457 insert
7.43400001526 query

　　咱們不可貴出以下結論：ide

　　sqlite的插入速度是pydblite的3-5倍；而在創建索引的狀況下，sqlite的查詢速度和pydblite至關；在未創建索引的狀況下，sqlite的查詢速度比pydblite慢1.5倍左右。函數

3 優化

　　咱們的目標很是明確，使用Pythonic的內存數據庫，提升插入和查詢效率，而不考慮持久化。那麼可否既擁有pydblite的pythonic的使用方式，又同時具有pydblite和sqlite中插入和查詢速度快的那一方的速度？針對咱們的目標，看看可否對pydblite作一些優化。

　　閱讀pydblite的源碼，首先映入眼簾的是對python2和3作了一個簡單的區分。給外部調用的Base基於_BasePy2或者_BasePy3，它們僅僅是在__iter__上有細微差別，最終調用的是_Base這個類。

class _BasePy2(_Base):

    def __iter__(self):
        """Iteration on the records"""
        return iter(self.records.itervalues())


class _BasePy3(_Base):

    def __iter__(self):
        """Iteration on the records"""
        return iter(self.records.values())

if sys.version_info[0] == 2:
    Base = _BasePy2
else:
    Base = _BasePy3

　　而後看下_Base的構造函數，作了簡單的初始化文件的操做，因爲咱們就是使用內存數據庫，因此文件相關的內容徹底能夠拋棄。

class _Base(object):

    def __init__(self, path, protocol=pickle.HIGHEST_PROTOCOL, save_to_file=True,
                 sqlite_compat=False):
        """protocol as defined in pickle / pickle.
        Defaults to the highest protocol available.
        For maximum compatibility use protocol = 0

        """
        self.path = path
        """The path of the database in the file system"""
        self.name = os.path.splitext(os.path.basename(path))[0]
        """The basename of the path, stripped of its extension"""
        self.protocol = protocol
        self.mode = None
        if path == ":memory:":
            save_to_file = False
        self.save_to_file = save_to_file
        self.sqlite_compat = sqlite_compat
        self.fields = []
        """The list of the fields (does not include the internal
        fields __id__ and __version__)"""
        # if base exists, get field names
        if save_to_file and self.exists():
            if protocol == 0:
                _in = open(self.path)  # don't specify binary mode !
            else:
                _in = open(self.path, 'rb')
            self.fields = pickle.load(_in)

　　緊接着比較重要的是create（建立字段）、create_index（建立索引）兩個函數：

    def create(self, *fields, **kw):
        """
        Create a new base with specified field names.

        Args:
            - \*fields (str): The field names to create.
            - mode (str): the mode used when creating the database.

        - if mode = 'create' : create a new base (the default value)
        - if mode = 'open' : open the existing base, ignore the fields
        - if mode = 'override' : erase the existing base and create a
          new one with the specified fields

        Returns:
            - the database (self).
        """
        self.mode = kw.get("mode", 'create')
        if self.save_to_file and os.path.exists(self.path):
            if not os.path.isfile(self.path):
                raise IOError("%s exists and is not a file" % self.path)
            elif self.mode is 'create':
                raise IOError("Base %s already exists" % self.path)
            elif self.mode == "open":
                return self.open()
            elif self.mode == "override":
                os.remove(self.path)
            else:
                raise ValueError("Invalid value given for 'open': '%s'" % open)

        self.fields = []
        self.default_values = {}
        for field in fields:
            if type(field) is dict:
                self.fields.append(field["name"])
                self.default_values[field["name"]] = field.get("default", None)
            elif type(field) is tuple:
                self.fields.append(field[0])
                self.default_values[field[0]] = field[1]
            else:
                self.fields.append(field)
                self.default_values[field] = None

        self.records = {}
        self.next_id = 0
        self.indices = {}
        self.commit()
        return self

    def create_index(self, *fields):
        """
        Create an index on the specified field names

        An index on a field is a mapping between the values taken by the field
        and the sorted list of the ids of the records whose field is equal to
        this value

        For each indexed field, an attribute of self is created, an instance
        of the class Index (see above). Its name it the field name, with the
        prefix _ to avoid name conflicts

        Args:
            - fields (list): the fields to index
        """
        reset = False
        for f in fields:
            if f not in self.fields:
                raise NameError("%s is not a field name %s" % (f, self.fields))
            # initialize the indices
            if self.mode == "open" and f in self.indices:
                continue
            reset = True
            self.indices[f] = {}
            for _id, record in self.records.items():
                # use bisect to quickly insert the id in the list
                bisect.insort(self.indices[f].setdefault(record[f], []), _id)
            # create a new attribute of self, used to find the records
            # by this index
            setattr(self, '_' + f, Index(self, f))
        if reset:
            self.commit()

　　能夠看出，pydblite在內存中維護了一個名爲records的字典變量，用來存放一條條的數據。它的key是內部維護的id，從0開始自增；而它的value則是用戶插入的數據，爲了後續查詢和記錄的方便，這裏在每條數據中額外又加入了__id__和__version__。其次，內部維護的indices字典變量則是是個索引表，它的key是字段名，而value則是這樣一個字典：其key是這個字段全部已知的值，value是這個值所在的那條數據的id。

　　舉個例子，假設咱們插入了「a=-1,b=0,c=1」和「a=0,b=1,c=2」兩條數據，那麼records和indices的內容會是這樣的：

# records
{0: {'__id__': 0, '__version__': 0, 'a': -1, 'b': 0, 'c': 1},
 1: {'__id__': 1, '__version__': 0, 'a': 0, 'b': 1, 'c': 2}}

# indices
{'a': {-1: [0], 0: [1]}, 'b': {0: [0], 1: [1]}}

　　比方說如今咱們想查找a=0的數據，那麼就會在indices中找key爲'a'的value，即{-1: set([0]), 0: set([1])}，而後在這裏面找key爲0的value，即[1]，由此咱們直到了咱們想要的這條數據它的id是1（也可能會有多個）；假設咱們對數據還有其餘要求好比a=0,b=1，那麼它會繼續上述的查找過程，找到a=0和b=1分別對應的ids，作交集，就獲得了知足這兩個條件的ids，而後再到records里根據ids找到全部對應的數據。

　　明白了原理，咱們再看看有什麼可優化的地方：

　　數據結構，總體的records和indeices數據結構已經挺精簡了，暫時不須要優化。其中的__version__能夠不要，由於咱們並不關注這個數據被修改了幾回。其次是因爲indices中最終的ids是個list，在查詢和插入的時候會比較慢，咱們知道內部維護的id必定是惟一的，因此這裏改爲set會好一些。

　　python語句，不難看出，整個_Base爲了同時兼容python2和python3，不得不使用了2和3都支持的語句，這就致使在部分語句上針對特定版本的python就會形成浪費或者說是性能開銷。好比說，d是個字典，那麼爲了同事兼容python2和3，做者使用了相似與for key in d.keys()這樣的語句，在python2中，d.keys()會首先產生一個list，用d.iterkeys是個更明智的方案。再如，做者會使用相似set(d.keys()) - set([1])這樣的語句，可是python2中，使用d.viewkeys() - set([1])效率將會更高，由於它不須要將list轉化成set。

　　對特定版本python的優化語句就不一一舉例，歸納地說，從數據結構，python語句以及是否須要某些功能等方面能夠對pydblite作進一步的優化。前面只是說了create和create_index兩個函數，包括insert和__call__的優化也十分相似。此外，用普通方法來代替魔法方法，也能稍微提高下效率，因此在後續的優化中將__call__改寫爲了query。

　　優化後的代碼，請見MemLite。

4 memlite、pydblite和sqlite的性能

　　讓咱們在上文的測試代碼中加入對memlite的測試：

@timeit
def test_query_method(mdb, des=''):
    for i in xrange(count):
        c = mdb.query(a=i-1, b=i)

print '-------memlite-------'
import memlite
db = memlite.Base()
db.create('a', 'b', 'c')
db.create_index('a', 'b')
test_insert(db, des='insert')
test_query_method(db, des='query, method call')

在建立索引的狀況下，10w次的插入和查詢的時間以下：

-------memlite-------
0.378000020981 insert
0.285000085831 query, method call
-------pydblite--------
1.3140001297 insert
0.309000015259 query, object call
-------sqlite3--------
0.414000034332 insert
0.3109998703 query

　　在未建立索引的狀況（把建立索引的測試語句註釋掉）下，1w次的插入和查詢時間以下：

-------memlite-------
0.0179998874664 insert
5.90199995041 query, method call
-------pydblite--------
0.0980000495911 insert
4.87400007248 query, object call
-------sqlite3--------
0.0170001983643 insert
7.42399978638 query

　　能夠看出，在建立索引的狀況下，memlite的插入和查詢性能在sqlite和pydblite之上；而在未建立索引的狀況下，memlite的插入性能和sqlite同樣，好於pydblite，memlite的查詢性能比pydblite稍差，但好於sqlite。綜合來看，memlite即擁有pydblite的pythonic的使用方式，又擁有pydblite和sqlite中性能較高者的效率，符合預期的優化目標。

轉載請註明出處：http://www.cnblogs.com/dreamlofter/p/5843355.html 謝謝！