編寫Postgres擴展之三:調試


在上一篇關於編寫Postgres Extensions的文章中,咱們從頭開始建立了一個新的數據類型base36。 然而,咱們留下了一個致使咱們的服務器崩潰的嚴重bug。html

如今讓咱們用調試器來查找這個錯誤並完成測試工具。node

咱們根據這些系列中關於編寫PostgreSQL擴展的內容建立了一個專用的github repo。上一篇文章中的代碼能夠在分支部分ii上找到,今天的更改在分支部分iii上。git

Bug

首先讓咱們復現這個bug。github

test=# CREATE EXTENSION base36;
test=# CREATE TABLE base36_test(val base36);
test=#  EXPLAIN SELECT * FROM base36_test where '3c'::base36 > val;
server closed the connection unexpectedly
  This probably means the server terminated abnormally
  before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Time: 680,225 ms
!>

咱們固然不但願這種狀況發生在咱們的生產數據庫上,因此讓咱們找出問題所在。咱們只編寫了兩個相對簡單的C函數base36_outbase36_in。若是咱們假設咱們並不比PostgreSQL核心團隊的人更聰明 - 至少對我我的來講這是一個合理的假設 - 那麼bug必定在其中一箇中。sql

#include "postgres.h"
#include "fmgr.h"
#include "utils/builtins.h"

PG_MODULE_MAGIC;

PG_FUNCTION_INFO_V1(base36_in);
Datum
base36_in(PG_FUNCTION_ARGS)
{
    long result;
    char *str = PG_GETARG_CSTRING(0);
    result = strtol(str, NULL, 36);
    PG_RETURN_INT32((int32)result);
}

PG_FUNCTION_INFO_V1(base36_out);
Datum
base36_out(PG_FUNCTION_ARGS)
{
    int32 arg = PG_GETARG_INT32(0);
    if (arg < 0)
        ereport(ERROR,
            (
             errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
             errmsg("negative values are not allowed"),
             errdetail("value %d is negative", arg),
             errhint("make it positive")
            )
        );
    char base36[36] = "0123456789abcdefghijklmnopqrstuvwxyz";

    /* max 6 char + '\0' */
    char *buffer        = palloc(7 * sizeof(char));
    unsigned int offset = 7 * sizeof(char);
    buffer[--offset]    = '\0';

    do {
        buffer[--offset] = base36[arg % 36];
    } while (arg /= 36);

    PG_RETURN_CSTRING(&buffer[offset]);
}

設置調試環境

爲了使用想LLDB這樣的調試器,你須要開啓調試符號編譯Postgresql。下面的簡短指南適應於安裝了Postgresql和Homebrew的MacOS。經過Xcode使用LLDB。shell

首先,咱們先關閉全部正在運行的Postgresql實例-你不想搞亂你現有的數據庫或其餘工做:)數據庫

$ cd /usr/local/opt/postgresql
$ launchctl unload homebrew.mxcl.postgresql.plist
# Double check it’s not running:
$ psql some_db
psql: could not connect to server: No such file or directory
  Is the server running locally and accepting
  connections on Unix domain socket "/tmp/.s.PGSQL.5432"?

而後,咱們經過運行下面的腳本下載postrgesql源碼。安全

$ cd ~
$ curl https://ftp.postgresql.org/pub/source/v9.4.4/postgresql-9.4.4.tar.bz2 | bzip2 -d | tar x
$ cd postgresql-9.4.4

並啓用調試選項進行構建。服務器

$  ./configure --enable-cassert --enable-debug CFLAGS="-ggdb"
$ make
$ sudo make install

咱們將跳過Postgres文檔推薦的adduser命令。相反,我將使用本身的用戶賬戶運行Postgres,以便更輕鬆地進行調試。app

$ sudo chown $(whoami) /usr/local/pgsql

而後初始化數據目錄

$ /usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data

接着啓動服務器

$ /usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data -l /usr/local/pgsql/data/postmaster.log start

將新安裝的pgsql/bin 路徑添加到path環境變量中

$ export PATH=/usr/local/pgsql/bin:$PATH

安裝擴展(因爲上面的export,此次使用新安裝的pgxn

$ make && make install

如今咱們能夠建立test數據庫了

$ /usr/local/pgsql/bin/createdb test

而後鏈接到這個數據庫

$ /usr/local/pgsql/bin/psql test

檢查是否正常

test=# CREATE EXTENSION base36 ;
CREATE EXTENSION
test=# CREATE TABLE base36_test(val base36);
CREATE TABLE
test=# INSERT INTO base36_test VALUES ('123'), ('3c'), ('5A'), ('zZz');
INSERT 0 4
test=# EXPLAIN SELECT * FROM base36_test where val='3c';
server closed the connection unexpectedly
  This probably means the server terminated abnormally
  before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

調試

如今咱們已經配置好了調試環境,讓咱們開始實際追蹤問題。首先,咱們看一下日誌文件。也就是咱們用-l指定的文件。根據上面的步驟,文件路徑是/usr/local/pgsql/data/postmaster.log

TRAP: FailedAssertion("!(pointer == (void *) (((uintptr_t) ((pointer)) + ((8) - 1)) & ~((uintptr_t) ((8) - 1))))", File: "mcxt.c", Line: 699)
LOG:  server process (PID 6515) was terminated by signal 6: Abort trap
DETAIL:  Failed process was running: EXPLAIN SELECT * FROM base36_test where val='3c';
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted; last known up at 2015-10-09 15:11:18 CEST
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/22D0868
LOG:  record with zero length at 0/2359140
LOG:  redo done at 0/2359110
LOG:  last completed transaction was at log time 2015-10-09 15:12:01.344859+02
LOG:  MultiXact member wraparound protections are now enabled
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

從新鏈接到數據庫並找出當前數據庫會話的pid

test=# SELECT pg_backend_pid();
 pg_backend_pid
----------------
           6644
(1 row)

將LLDB與pid鏈接(在另外一個終端中)

lldb -p 6644

在psql會話中運行以前失敗命令

EXPLAIN SELECT * FROM base36_test where val='3c';

繼續LLDB

(lldb) c
Process 6644 resuming
Process 6644 stopped
* thread #1: tid = 0x84aa4, 0x00007fff906d2286 libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x00007fff906d2286 libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill:
->  0x7fff906d2286 <+10>: jae    0x7fff906d2290            ; <+20>
    0x7fff906d2288 <+12>: movq   %rax, %rdi
    0x7fff906d228b <+15>: jmp    0x7fff906cdc53            ; cerror_nocancel
    0x7fff906d2290 <+20>: retq

從LLDB中獲取到一個回溯

(lldb) bt
* thread #1: tid = 0x84aa4, 0x00007fff906d2286 libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x00007fff906d2286 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff910f39f9 libsystem_pthread.dylib`pthread_kill + 90
    frame #2: 0x00007fff848f19b3 libsystem_c.dylib`abort + 129
    frame #3: 0x0000000108328549 postgres`ExceptionalCondition(conditionName="!(pointer == (void *) (((uintptr_t) ((pointer)) + ((8) - 1)) & ~((uintptr_t) ((8) - 1))))", errorType="FailedAssertion", fileName="mcxt.c", lineNumber=699) + 137 at assert.c:54
    frame #4: 0x000000010836129d postgres`pfree(pointer=0x00007ff9e1813674) + 173 at mcxt.c:699
    frame #5: 0x00000001082ab9e3 postgres`get_const_expr(constval=0x00007ff9e1806708, context=0x00007fff57e824c8, showtype=0) + 707 at ruleutils.c:8002
    frame #6: 0x00000001082a5f79 postgres`get_rule_expr(node=0x00007ff9e1806708, context=0x00007fff57e824c8, showimplicit='\x01') + 281 at ruleutils.c:6647
    frame #7: 0x00000001082acf22 postgres`get_rule_expr_paren(node=0x00007ff9e1806708, context=0x00007fff57e824c8, showimplicit='\x01', parentNode=0x00007ff9e1806788) + 146 at ruleutils.c:6600

...more

咱們獲得了什麼?異常在mcxt.c:699定義的pfree拋出。pfree在ruleutils.c:8002中被get_const_expr調用。若是咱們去調用堆棧四次。 咱們最終到了這裏:

(lldb) up
frame #4: 0x000000010836129d postgres`pfree(pointer=0x00007ff9e1813674) + 173 at mcxt.c:699
   696     * allocated chunk.
   697     */
   698    Assert(pointer != NULL);
-> 699    Assert(pointer == (void *) MAXALIGN(pointer));
   700
   701    /*
   702     * OK, it's probably safe to look at the chunk header.

讓咱們看看mcxt.c中的源碼

文件名mcxt.c:699

/*
 * pfree
 *    Release an allocated chunk.
 */
void
pfree(void *pointer)
{
  MemoryContext context;

  /*
   * Try to detect bogus pointers handed to us, poorly though we can.
   * Presumably, a pointer that isn't MAXALIGNED isn't pointing at an
   * allocated chunk.
   */
  Assert(pointer != NULL);
  Assert(pointer == (void *) MAXALIGN(pointer));

  /*
   * OK, it's probably safe to look at the chunk header.
   */
  context = ((StandardChunkHeader *)
         ((char *) pointer - STANDARDCHUNKHEADERSIZE))->context;

  AssertArg(MemoryContextIsValid(context));

  (*context->methods->free_p) (context, pointer);
  VALGRIND_MEMPOOL_FREE(context, pointer);
}

Postgres使用pfree從當前內存上下文釋放內存。不知何故,咱們把內存搞砸了。

咱們看一下指針的內容

(lldb) print (char *)pointer
(char *) $0 = 0x00007ff9e1813674 "3c"

這確實是咱們的搜索條件3c。因此咱們這裏作錯了什麼?正如第一篇文章提到的,pfreepalloc是Postgres中freemalloc的對等體,用於在當前內存上下文中安全地分配和釋放內存。不知何故咱們搞錯了。在base36_out中咱們使用

char *buffer = palloc0(7 * sizeof(char));

來分配了7個字節的內存。最終咱們偏移4返回一個指針

PG_RETURN_CSTRING(&buffer[offset]);

mcxt.c:699中的斷言以下:

Assert(pointer == (void *) MAXALIGN(pointer));

確保要釋放的數據正確對齊。這裏的條件是

(pointer == (void *) (((uintptr_t) ((pointer)) + ((8) - 1)) & ~((uintptr_t) ((8) - 1))))

當指針以8字節的倍數開始時讀取?因爲返回的地址與分配的地址不一樣,所以pfree會因指針沒有對齊而報錯。

讓咱們來修復他!

文件名:base36.c

PG_FUNCTION_INFO_V1(base36_out);
Datum
base36_out(PG_FUNCTION_ARGS)
{
    int32 arg = PG_GETARG_INT32(0);
    if (arg < 0)
        ereport(ERROR,
            (
             errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
             errmsg("negative values are not allowed"),
             errdetail("value %d is negative", arg),
             errhint("make it positive")
            )
        );
    char base36[36] = "0123456789abcdefghijklmnopqrstuvwxyz";

    /* max 6 char + '\0' */
    char buffer[7];
    unsigned int offset = sizeof(buffer);
    buffer[--offset] = '\0';

    do {
        buffer[--offset] = base36[arg % 36];
    } while (arg /= 36);

    PG_RETURN_CSTRING(pstrdup(&buffer[offset]));
}

如今咱們從堆棧中分配緩衝區(第18行),最後咱們使用pstrdup複製字符串到新分配的內存(第26行)。這種實現更接近 - 幾乎至關於維基百科

您可能已經猜到pstrdup是Postgres對應的strdup。它經過palloc安全地從當前內存上下文中獲取內存,並在事務結束時自動釋放。

類型轉換

如今咱們能夠爲咱們的類型輸入和輸出數據了。從其餘類型轉換或者轉換爲其餘類型都行。

文件名:base36–0.0.1.sql

-- type and operators omitted

CREATE CAST (integer as base36) WITHOUT FUNCTION AS IMPLICIT;
CREATE CAST (base36 as integer) WITHOUT FUNCTION AS IMPLICIT;

這是相對容易的。 因爲整數和base36是二進制可強制的(即二進制內部表示相同),轉換能夠免費完成(WITHOUT FUNCTION)。咱們還將此轉換標記爲隱式轉換,從而告訴postgres它能夠在適當的時候自動執行轉換。例如,考慮這個查詢:

test=# SELECT 10::integer + '5a'::base36;
 ?column?
----------
      200
(1 row)

沒有定義整數+ base36運算符,可是經過將base36隱式轉換爲integer Postgres可使用integer + integer運算符並將結果轉換爲integer。然而,隱式轉換應該當心定義,由於某些操做的結果多是可疑的。對於上面的操做,用戶不知道結果是整數仍是base36,所以可能會誤解它。若是咱們後來決定添加一個返回base36的運算符integer + base36,查詢將徹底中斷。

更使人困惑的多是這個查詢結果

test=# SELECT -50::base36;
 ?column?
----------
      -50
(1 row)

雖然咱們不容許負值,但咱們在這裏獲得了一個,這怎麼可能呢?內部Postgres執行如下操做:

test=# SELECT -(50::base36)::integer;
 ?column?
----------
      -50

咱們能夠也應該避免這種使人困惑的行爲。一種選擇是向base36輸出添加前綴(就像十六進制或八進制數同樣),或者將責任交給用戶,只容許顯式強制轉換。

澄清事情的另外一個選擇是標記zhuanhuan AS ASSIGNMENT.
只有當您爲base36類型指定一個整數時,纔會自動執行該轉換,反之亦然。這一般適用於INSERT或UPDATE語句。讓咱們試試這個

文件名:base36–0.0.1.sql

-- type and operators omitted

CREATE CAST (integer as base36) WITHOUT FUNCTION AS ASSIGNMENT;
CREATE CAST (base36 as integer) WITHOUT FUNCTION AS ASSIGNMENT;

填充咱們的表格

test=# INSERT INTO base36_test SELECT i FROM generate_series(1,100) i;
INSERT 0 100
SELECT * FROM base36_test ORDER BY val LIMIT 12;
 val
-----
 1
 2
 3
 4
 5
 6
 7
 8
 9
 a
 b
 c
(12 rows)

更多內容

You have seen how important it is to test everything, not only to find bugs that in the worst case might crash the server, but also to specify the expected output from certain operations such as casts. In the next post we’ll elaborate on that creating a full-coverage test suite.

相關文章
相關標籤/搜索