postgresql 數據庫沒法啓動

時間 2019-12-04

原文原文鏈接

在數據庫沒法啓動時，通常能夠根據報錯信息，採起對應措施便可，下面列出一些在數據庫啓動時報出錯誤比較嚴重而解決方式又不那麼明顯的處理方法。node

模擬錯誤，查到pg_class系統表中一個索引在磁盤中的位置，經過vim任意修改其中內容。sql

postgres=# select pg_relation_filepath('pg_class_oid_index');
 pg_relation_filepath
----------------------
 base/13219/36870
(1 row)

$ cd $PGDATA
$ vim  base/13219/36870

重啓數據庫。數據庫

$pg_ctl restart -m fast
postgres@db-192-168-173-230-> pg_ctl restart -m fast
waiting for server to shut down.... done
server stopped
waiting for server to start....2019-03-12 11:59:17.312 CST [5688] LOG:  00000: listening on IPv4 address "0.0.0.0", port 1921
2019-03-12 11:59:17.312 CST [5688] LOCATION:  StreamServerPort, pqcomm.c:593
2019-03-12 11:59:17.312 CST [5688] LOG:  00000: listening on IPv6 address "::", port 1921
2019-03-12 11:59:17.312 CST [5688] LOCATION:  StreamServerPort, pqcomm.c:593
2019-03-12 11:59:17.314 CST [5688] LOG:  00000: listening on Unix socket "./.s.PGSQL.1921"
2019-03-12 11:59:17.314 CST [5688] LOCATION:  StreamServerPort, pqcomm.c:587
2019-03-12 11:59:17.400 CST [5688] LOG:  00000: redirecting log output to logging collector process
2019-03-12 11:59:17.400 CST [5688] HINT:  Future log output will appear in directory "log".
2019-03-12 11:59:17.400 CST [5688] LOCATION:  SysLogger_Start, syslogger.c:667
 done
server started

數據庫能夠正常啓動，日誌也沒有報錯。vim

但鏈接數據庫時，會報出錯誤：app

$ psql
psql: FATAL:  could not read block 1 in file "base/13219/36870": read only 32756 of 32768 bytes

因爲上面是模擬的錯誤，咱們天然是知道出錯的是哪一個表或索引，但忽然遇到該問題又進不去數據庫時，可使用oid2name來肯定對應的數據庫和對象。socket

$ oid2name
All databases:
    Oid  Database Name  Tablespace
----------------------------------
  13219       postgres  pg_default
  16393           swrd  pg_default
  13218      template0  pg_default
      1      template1  pg_default
$ oid2name  -f 36870
From database "postgres":
  Filenode          Table Name
------------------------------
     36870  pg_class_oid_index

我上面的狀況是數據庫能夠啓動，可是沒法進入，當遇到沒法啓動但遇到相似錯誤的方法也適用。工具

1. 單用戶啓動數據庫

下面經過單用戶模式進入數據庫：post

$ postgres --single   -P -d 1

-P 參數是關閉系統索引。測試

-d 1是設置debug日誌級別爲1。級別是從1-5，數字越高日誌越詳盡。ui

$ postgres --single -P -d 1
2019-03-12 11:17:16.677 CST [1092] DEBUG:  00000: mmap(12998148096) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
2019-03-12 11:17:16.677 CST [1092] LOCATION:  CreateAnonymousSegment, pg_shmem.c:485
2019-03-12 11:17:16.759 CST [1092] NOTICE:  00000: database system was shut down at 2019-03-12 11:16:54 CST
2019-03-12 11:17:16.759 CST [1092] LOCATION:  StartupXLOG, xlog.c:6363
2019-03-12 11:17:16.759 CST [1092] DEBUG:  00000: checkpoint record is at 2/67000028
2019-03-12 11:17:16.759 CST [1092] LOCATION:  StartupXLOG, xlog.c:6646
2019-03-12 11:17:16.760 CST [1092] DEBUG:  00000: redo record is at 2/67000028; shutdown true
2019-03-12 11:17:16.760 CST [1092] LOCATION:  StartupXLOG, xlog.c:6724
2019-03-12 11:17:16.760 CST [1092] DEBUG:  00000: next transaction ID: 0:46060157; next OID: 36864
2019-03-12 11:17:16.760 CST [1092] LOCATION:  StartupXLOG, xlog.c:6728
2019-03-12 11:17:16.760 CST [1092] DEBUG:  00000: next MultiXactId: 1; next MultiXactOffset: 0
2019-03-12 11:17:16.760 CST [1092] LOCATION:  StartupXLOG, xlog.c:6731
2019-03-12 11:17:16.760 CST [1092] DEBUG:  00000: oldest unfrozen transaction ID: 561, in database 1
2019-03-12 11:17:16.760 CST [1092] LOCATION:  StartupXLOG, xlog.c:6734
2019-03-12 11:17:16.760 CST [1092] DEBUG:  00000: oldest MultiXactId: 1, in database 1
2019-03-12 11:17:16.760 CST [1092] LOCATION:  StartupXLOG, xlog.c:6737
2019-03-12 11:17:16.760 CST [1092] DEBUG:  00000: commit timestamp Xid oldest/newest: 0/0
2019-03-12 11:17:16.760 CST [1092] LOCATION:  StartupXLOG, xlog.c:6741
2019-03-12 11:17:16.760 CST [1092] DEBUG:  00000: transaction ID wrap limit is 2147484208, limited by database with OID 1
2019-03-12 11:17:16.760 CST [1092] LOCATION:  SetTransactionIdLimit, varsup.c:368
2019-03-12 11:17:16.760 CST [1092] DEBUG:  00000: MultiXactId wrap limit is 2147483648, limited by database with OID 1
2019-03-12 11:17:16.760 CST [1092] LOCATION:  SetMultiXactIdLimit, multixact.c:2269
2019-03-12 11:17:16.760 CST [1092] DEBUG:  00000: starting up replication slots
2019-03-12 11:17:16.760 CST [1092] LOCATION:  StartupReplicationSlots, slot.c:1110
2019-03-12 11:17:16.760 CST [1092] DEBUG:  00000: MultiXactId wrap limit is 2147483648, limited by database with OID 1
2019-03-12 11:17:16.760 CST [1092] LOCATION:  SetMultiXactIdLimit, multixact.c:2269
2019-03-12 11:17:16.760 CST [1092] DEBUG:  00000: MultiXact member stop limit is now 4294757632 based on MultiXact 1
2019-03-12 11:17:16.760 CST [1092] LOCATION:  SetOffsetVacuumLimit, multixact.c:2632

PostgreSQL stand-alone backend 11.0
backend> reindex table pg_class;
2019-03-12 11:18:34.181 CST [1092] DEBUG:  00000: building index "pg_class_oid_index" on table "pg_class" serially
2019-03-12 11:18:34.181 CST [1092] LOCATION:  index_build, index.c:2297
2019-03-12 11:18:34.188 CST [1092] DEBUG:  00000: building index "pg_class_relname_nsp_index" on table "pg_class" serially
2019-03-12 11:18:34.188 CST [1092] LOCATION:  index_build, index.c:2297
2019-03-12 11:18:34.191 CST [1092] DEBUG:  00000: building index "pg_class_tblspc_relfilenode_index" on table "pg_class" serially
2019-03-12 11:18:34.191 CST [1092] LOCATION:  index_build, index.c:2297
backend> 2019-03-12 11:18:47.832 CST [1092] NOTICE:  00000: shutting down
2019-03-12 11:18:47.832 CST [1092] LOCATION:  ShutdownXLOG, xlog.c:8459
2019-03-12 11:18:47.986 CST [1092] LOG:  00000: checkpoint starting: shutdown immediate
2019-03-12 11:18:47.986 CST [1092] LOCATION:  LogCheckpointStart, xlog.c:8508
2019-03-12 11:18:47.988 CST [1092] DEBUG:  00000: performing replication slot checkpoint
2019-03-12 11:18:47.988 CST [1092] LOCATION:  CheckPointReplicationSlots, slot.c:1074
2019-03-12 11:18:48.000 CST [1092] DEBUG:  00000: checkpoint sync: number=1 file=base/13219/1259 time=1.022 msec
2019-03-12 11:18:48.000 CST [1092] LOCATION:  mdsync, md.c:1251
2019-03-12 11:18:48.004 CST [1092] LOG:  00000: checkpoint complete: wrote 2 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.010 s, sync=0.001 s, total=0.019 s; sync files=1, longest=0.001 s, average=0.001 s; distance=16384 kB, estimate=16384 kB
2019-03-12 11:18:48.004 CST [1092] LOCATION:  LogCheckpointEnd, xlog.c:8590
2019-03-12 11:18:48.029 CST [1092] NOTICE:  00000: database system is shut down
2019-03-12 11:18:48.029 CST [1092] LOCATION:  UnlinkLockFiles, miscinit.c:860
$ pg_ctl start
waiting for server to start....2019-03-12 11:18:53.104 CST [1185] LOG:  00000: listening on IPv4 address "0.0.0.0", port 1921
2019-03-12 11:18:53.104 CST [1185] LOCATION:  StreamServerPort, pqcomm.c:593
2019-03-12 11:18:53.104 CST [1185] LOG:  00000: listening on IPv6 address "::", port 1921
2019-03-12 11:18:53.104 CST [1185] LOCATION:  StreamServerPort, pqcomm.c:593
2019-03-12 11:18:53.107 CST [1185] LOG:  00000: listening on Unix socket "./.s.PGSQL.1921"
2019-03-12 11:18:53.107 CST [1185] LOCATION:  StreamServerPort, pqcomm.c:587
2019-03-12 11:18:53.191 CST [1185] LOG:  00000: redirecting log output to logging collector process
2019-03-12 11:18:53.191 CST [1185] HINT:  Future log output will appear in directory "log".
2019-03-12 11:18:53.191 CST [1185] LOCATION:  SysLogger_Start, syslogger.c:667
 done
server started

能夠看到pg_class索引已修復。而後啓動數據庫便可，數據庫已恢復正常。這裏測試的是系統表的索引，至於咱們自定義的非系統對象，即便刪掉在數據庫啓動或進入時，都不會報錯，只有在用到時纔會報錯。若是不是磁盤壞道，在報錯後，一般reindex一下便可。

2. 使用物理備份恢復

將以前的數據目錄mv一下，建立新的數據庫目錄，而後使用備份恢復啓動。

3. 可能碰到是pg的bug，嘗試升級到小版本的最高版本。

4. 搜索郵件列表或在列表提問，尋求幫助

5. 修改源碼，將數據庫報錯的部分，修改爲警告，使之能夠正常啓動。

我嘗試搜索源碼，修改了幾處，但因爲報同類型錯誤的地方太多，沒有進行全部的修改。下面是修改幾處後報出的錯誤。

$ /opt/pgsql11_modify/bin/psql
WARNING:  could not read block 1 in file "base/13219/36864": read only 32756 of 32768 bytes
psql: FATAL:  could not open file "base/13219/36864.1" (target block 131072): previous segment is only 1 blocks