最近遇到了一個比較奇怪的問題,在你們都在睡午覺的時候,忽然手機響了起來,我爲了避免吵醒其餘人拿起了手機看了看監控信息,我去,竟然是數據庫down了,這是一臺運行好久的數據庫服務器,當我登進去服務器的時候,嘗試重啓mysql,可是報(Starting MySQL..... ERROR! The server quit without updating PID file (/usr/local/mysql/data/BigData_ZT_PY_92.pid).)錯誤,而後就去看錯誤日誌和其餘排查方法,就在排查期間忽然又來監控告警,提示xxx主機 has just been restarted,我嘗試ping一下主機結果ping不通,我當場就懵逼了,服務器無故端的就本身重啓了,並且後面連續重啓了幾回。最後聯繫機房人員,幫忙鏈接顯示屏查看什麼狀況。html
通過一番折騰,機器終於起來了,咱們就開始排查了。查看錯誤日誌發現mysql
InnoDB: End of page dumpsql
2018-05-23 21:10:08 7f6786710700 InnoDB: uncompressed page, stored checksum in field1 2222046951, calculated checksums for field1: crc32 2624418990, innodb 12552數據庫
80539, none 3735928559, stored checksum in field2 1914065653, calculated checksums for field2: crc32 2624418990, innodb 3045085343, none 3735928559, page LSN 555緩存
2748030571, low 4 bytes of LSN at page end 2748030571, page number (if stored to page already) 84692, space id (if created with >= MySQL-4.1.1 and stored alread安全
y) 2618服務器
InnoDB: Page may be an index page where index id is 8005數據結構
InnoDB: Database page corruption on disk or a failedide
InnoDB: file read of page 84692.性能
InnoDB: You may have to recover from a backup.
InnoDB: It is also possible that your operating
InnoDB: system has corrupted its own file cache
InnoDB: and rebooting your computer removes the
InnoDB: error.
InnoDB: If the corrupt page is an index page
InnoDB: you can also try to fix the corruption
InnoDB: by dumping, dropping, and reimporting
InnoDB: the corrupt table. You can use CHECK
InnoDB: TABLE to scan your table for corruption.
InnoDB: See also http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
InnoDB: Ending processing because of a corrupt database page.
2018-05-23 21:10:08 7f6786710700 InnoDB: Assertion failure in thread 140082613913344 in file buf0buf.cc line 4201
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
13:10:08 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=0
max_threads=1024
thread_count=0
connection_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 415416 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
63 /usr/local/mysql/bin/mysqld(my_print_stacktrace+0x2c)[0x8f339c]
/usr/local/mysql/bin/mysqld(handle_fatal_signal+0x364)[0x66e3e4]
/lib64/libpthread.so.0(+0xf5e0)[0x7f6b9c5b45e0]
/lib64/libc.so.6(gsignal+0x37)[0x7f6b9b3ba1f7]
/lib64/libc.so.6(abort+0x148)[0x7f6b9b3bb8e8]
/usr/local/mysql/bin/mysqld[0xa9c5c5]
/usr/local/mysql/bin/mysqld[0xadecd6]
/usr/local/mysql/bin/mysqld[0xa400c8]
/lib64/libpthread.so.0(+0x7e25)[0x7f6b9c5ace25]
/lib64/libc.so.6(clone+0x6d)[0x7f6b9b47d34d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
180523 21:10:09 mysqld_safe mysqld from pid file /usr/local/mysql/data/BigData_ZT_PY_92.pid ended
180523 21:44:59 mysqld_safe Starting mysqld daemon with databases from /usr/local/mysql/data
2018-05-23 21:44:59 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
以上能夠看出點信息就是回滾信息的時候出錯了,後來去查了一下資料發現,多是二進制文件被損壞了。
後來決定使用強制InnoDB恢復,
這裏解析下用法:
[mysqld]
innodb_force_recovery = 1
警告
只有在緊急狀況下將innodb_force_recovery設爲大於0的值,你才能啓動InnoDB並轉儲表。在進行此操做以前,確保你有數據庫的備份副本,以備須要重建它。4及以上的值能夠永久破壞數據文件。只有在數據庫的獨立物理副本的成功地測試了設置,才能在生產服務器實例使用4及以上的innodb_force_recovery設置。當強制InnoDB恢復,你應該老是以innodb_force_recovery=1啓動,且僅在須要時增長值。
innodb_force_recovery默認爲0(沒有強制恢復的正常啓動)。對於innodb_force_recovery容許的非零值是1至6。較大值包括較小值的功能。例如,爲3的值包括全部的值1和2的功能。
若是你能以innodb_force_recovery爲3或更低值轉儲你的表,那麼你是比較安全的,只有在損壞的我的頁的一些數據會丟失。4或更大的值被認爲是危險的,由於數據文件能夠被永久地損壞。值6被認爲是嚴重的,數據庫頁被留在一個陳舊的狀態,這反過來又可能帶給B-trees和其它數據庫結構更多的損壞。
做爲一個安全措施,InnoDB 在innodb_force_recovery大於0時阻止INSERT,UPDATE或DELETE操做。對於MySQL5.6.15,將innodb_force_recovery設爲4或更高會讓InnoDB處於只讀模式。
1 (SRV_FORCE_IGNORE_CORRUPT)
即便服務器檢測到損壞的頁仍讓它運行。試圖使SELECT* FROM tbl_name跳過損壞的索引記錄和頁,這樣有助於轉儲表。
2 (SRV_FORCE_NO_BACKGROUND)
阻止主線程和任何清除線程的運行。若是崩潰會在清除操做中發生,該恢復值會阻止它。
3 (SRV_FORCE_NO_TRX_UNDO)
不要在崩潰恢復後運行事務回滾。
4 (SRV_FORCE_NO_IBUF_MERGE)
阻止插入緩衝合併操做。若是它們會致使崩潰,不要作這些。不計算表統計。這個值能夠永久損壞數據文件。使用這個值後,準備號刪除並重建全部輔助索引。在MySQL5.6.15中,設置InnoDB爲只讀。
5 (SRV_FORCE_NO_UNDO_LOG_SCAN)
在啓動數據庫時不查看撤消日誌:InnoDB將即便未完成的事務也做爲已提交。這個值能夠永久損壞數據文件。在MySQL5.6.15中,設置InnoDB爲只讀。
6 (SRV_FORCE_NO_LOG_REDO)
不要經過恢復對重作日誌進行前滾。這個值可能永久損壞數據文件。數據庫頁被留在一個陳舊的狀態,這反過來又可能帶給B-trees和其它數據庫結構更多的損壞。在MySQL5.6.15中,設置InnoDB爲只讀。
你能夠從表中SELECT來轉儲它們。innodb_force_recovery的值爲3或更低,你能夠DROP或CREATE表。在MySQL 5.6.27中,DROP TABLE還受大於3的innodb_force_recovery值支持。
若是你知道一個給定表在回滾形成崩潰,你能夠將其刪除。若是遇到所形成失敗的大規模導入的失控回滾或ALTER TABLE,你能夠殺掉mysqld進程,並設置innodb_force_recovery爲3使數據庫啓動而不回滾,而後DROP致使失控回滾的表。
若是表數據中的損壞阻止你轉儲整個表的內容,帶ORDER BY primary_key DESC子句的查詢可以轉儲損壞部分後的表的部分。
若是一個高innodb_force_recovery值須要啓動InnoDB,可能有被破壞的數據結構,可能致使複雜查詢(含有WHERE,ORDER BY或其餘子句的查詢)失敗。在這種狀況下,你可能只能運行基本的SELECT* FROM t查詢。
而後啓動下數據庫:
[root@databases ~]# /etc/init.d/mysql start
啓動數據庫之後進去數據庫show slave status\G;看到從庫沒起來,而後把/etc/my.cnf文件中innodb_force_recovery = 1註釋叼重啓數據庫就沒問題了。
後來排查多是服務器硬件發生故障,從而使數據庫被中止,也可能順壞了二進制文件。
並且在/etc/my.cnf配置文件裏面設置了
innodb_flush_log_at_trx_commit = 1
# 關鍵參數,0表明大約每秒寫入到日誌並同步到磁盤,數據庫故障會丟失1秒左右事務數據。1爲每執行一條SQL後寫入到日誌並同步到磁盤,I/O開銷大,執行完SQL要等待日誌讀寫,效率低。2表明只把日誌寫入到系統緩存區,再每秒同步到磁盤,效率很高,若是服務器故障,纔會丟失事務數據。
假如設置爲1時io性能會不好,因此這臺主機只能設置爲2.