xpb備份過程FTWRL超時問題

現象:

PXC版本:5.5.34
xtrabackup版本:2.0.6

xtrabackup夜間備份失敗,查看備份日誌以下:
    ...
    ...
    >> log scanned up to (386464269338)
    >> log scanned up to (386464269338)
    >> log scanned up to (386464269338)
    >> log scanned up to (386464269338)
    innobackupex: Error: Connection to mysql child process (pid=14432) timedout. (Time limit of 900 seconds exceeded. You may adjust time limit by editing the value of parameter "$mysql_response_timeout" in this script.) while waiting for reply to MySQL request: 'FLUSH TABLES WITH READ LOCK;' at /usr/bin/innobackupex line 381.

分析排查:

發現備份開始時間爲00:12分,根據備份日誌文件時間戳發現innobackupex超時結束時間是01:50
懷疑是否在此期間有大量DML或者大事務的寫入致使FTWRL超時。此段時間binlog寫入較小,排查慢日誌發現此段時間有執行

# Time: 190627  2:00:21
User@Host: pull_data[pull_data] @  [10.x.x.x]
Thread_id: 6984770  Schema: as_tv_online  Last_errno: 0  Killed: 0
Query_time: 2190.463271  Lock_time: 0.000220  Rows_sent: 6791540  Rows_examined: 6791540  Rows_affected: 0  Rows_read: 6791540
Bytes_sent: 5164166545
SET timestamp=1561572021;
select `id`,`services_no`,`order_type`,`order_create_time`,`order_stage`,`order_status`,`source_doc_no`,`implement_no`,`source_doc_type`,`source_info`,`dispatch_info`,`urgency_level`,`order_level`,`service_level`,`upgrade_status`,`checkup_nature`,`checkup_result`,`visit_flag`,`complaint_level`,`present_vip_card`,`mall_wechat_no`,`mall_apply_order_no`,`mall_prolong_contractno`,`demand_desc`,`remarks`,`other_remarks`,`one_level_item`,`two_level_item`,`three_level_item`,`repair_time`,`req_service_begin_time`,`req_service_end_time`,`dispatch_time`,`add_msg_times`,`related_order`,`related_order_type`,`reminder_numbers`,`complaint_numbers`,`subprocess_flag`,`subprocess_numbers`,`net_id`,`net_full_path`,`org_full_path`,`services_id`,`end_userid`,`end_time`,`servicer_receive_flag`,`servicer_refuse_reason`,`cancel_reason`,`cancel_time`,`valid_flag`,`create_name`,`is_del`,`create_time`,`create_id`,`last_modify_time`,`last_modify_id`,`user_need`,`client`,`org_id`,`net_name`,`net_parent_name`,`net_super_name`,`org_name`,`one_level_item_name`,`two_level_item_name`,`three_level_item_name`,`end_user_name`,`settlement_name`,`app_service_type`,`cancel_reason_desc`,`base_fee_amt`,`reminder_time`,`reminder_desc`,`sync_flag`,`cancel_reason_name`,`repair_result`,`adnet_id`,`adnet_name`,`adnet_full_path`,`adnet_org_full_path`,`adorder_id`,`adorder_status`,`product_version`,`rma_no`,`last_modify_name` from t_ser_order_his where 1=1;

Time: 190627  2:04:05
User@Host: pull_data[pull_data] @  [10.x.x.x]
Thread_id: 6984771  Schema: as_tv_online  Last_errno: 0  Killed: 0
Query_time: 2408.901693  Lock_time: 0.000205  Rows_sent: 6609499  Rows_examined: 6609499  Rows_affected: 0  Rows_read: 6609499
Bytes_sent: 3631997475
SET timestamp=1561572245;
select `id`,`ser_order_id`,`service_result`,`org_code`,`handjob_time`,`emp_id`,`call_customer_time`,`change_appoint_status`,`change_appoint_time`,`one_legacy_reason`,`two_legacy_reason`,`detail_reason`,`receive_time`,`contact_express_flag`,`notice_express_time`,`pickup_pay_user`,`pickup_express_no`,`pickup_express_name`,`store_location`,`notice_user_time`,`cust_receive_time`,`send_pay_user`,`send_express_no`,`send_express_name`,`warranty_type`,`service_process_desc`,`one_level_code`,`two_level_code`,`reback_type`,`reback_reason`,`reback_fee_amt`,`fee_total_amt`,`reback_amt`,`reback_time`,`debug_flag`,`nodebug_reason`,`attachments_flag`,`dispatched_flag`,`todoor_flag`,`first_door_time`,`finish_time`,`fault_type`,`fault_type_desc`,`one_level_handle_type`,`two_level_handle_type`,`new_softversion`,`charge_flag`,`invoice_flag`,`charge_amt`,`base_fee_amt`,`msg_fee_amt`,`examine_fee_amt`,`special_approve_amt`,`enjoy_le_amt`,`contact_name`,`contact_phone`,`customer_reciews`,`remote_distance`,`remote_fee`,`appraisal_no`,`finish_order_time`,`old_ser_order_id`,`old_ser_order_no`,`old_ser_order_netid`,`old_ser_order_netcode`,`old_ser_order_empid`,`old_ser_order_empname`,`repeat_ser_order_id`,`repeat_ser_order_no`,`repeat_ser_order_netid`,`repeat_ser_order_netcode`,`repeat_ser_order_empid`,`repeat_ser_order_empname`,`le_pylons_flag`,`main_wall_flag`,`prolong_materiel_no`,`bak_address`,`second_unit_name`,`order_flag`,`collect_mac_type`,`net_id`,`net_full_path`,`is_del`,`create_time`,`create_id`,`last_modify_time`,`last_modify_id`,`charge_details`,`duty_judgment`,`one_level_handle_type_name`,`two_level_handle_type_name`,`one_level_code_name`,`two_level_code_name`,`one_legacy_reason_name`,`two_legacy_reason_name`,`receive_name`,`appoint_time`,`receive_detail`,`damage_reason`,`fault_type_name`,`fault_type_desc_name`,`emp_name`,`emp_mobile_phone`,`reback_reason_name`,`duty_judgment_name`,`send_type`,`ser_product_id`,`flag_rights`,`rights_order_id`,`rights_no`,`rights_name`,`rights_type`,`operate_date`,`rights_time`,`money_used`,`rights_amount`,`repair_result`,`screen_broken`,`is_to_door`,`delivery_user_time`,`return_warehouse_time`,`recv_new_tv_time`,`identi_create_time`,`fault_phenomena`,`is_fixed_screen`,`customer_provides_materials`,`current_reason`,`fixing_position` from t_ser_order_handle_his where 1=1;

User@Host: backup[backup] @  [127.0.0.1]
Thread_id: 6985104  Schema: mysql  Last_errno: 0  Killed: 0
Query_time: 1725.026800  Lock_time: 0.000000  Rows_sent: 0  Rows_examined: 0  Rows_affected: 0  Rows_read: 0
Bytes_sent: 11
use mysql;
SET timestamp=1561572245;   ------->轉換後爲02:04:05
FLUSH TABLES WITH READ LOCK;

經過監控發現DB活躍線程數在1:15分開始陡增,一點35分繼續陡增、1:36之後趨於平穩,到2:00和2:04分別階梯恢復。

問題定位:

經過慢日誌定位問題緣由
slowlog中發現,前邊兩條查詢分別跑了2190.463271s和2408.901693s、flush 跑了1725.026800s
定位慢SQL開始時間點都在1:20分左右。
定位FTWRL開始及結束時間:結束時間02:04:05(SET timestamp=1561572245轉換後)  開始時間:1:35分左右(根據執行時間1725.026800往前推28分鐘左右)
查看備份信息:備份日誌實際結束(備份未成功)時間爲1:50
經過以上時間段能夠初步定位慢查詢時間太久致使FTWRL超過900s失敗。

過程分析:

00:12:02時xtrabackup開始備份,先copy innodb ibd文件
01:20:00左右慢SQL跑起
01:35:00左右xtrabackup copy完成innodb ibd文件,加FTWRL(此過程需加全局鎖、關閉open tables、打上阻塞commit的標誌)。由於加全局鎖成功,但關閉open tables時,由於慢SQL一直未結束,一直夯住(01:35分以後全部的寫入都被阻塞,應用不停建立鏈接,thread running升高,與監控發現的1:35分開始線程陡增想吻合,2點和2點04慢SQL跑完,線程恢復正常)
01:50左右FTWRL鎖超時(900s),致使備份結束(與備份日誌結束時間吻合)

總結:

因爲太久慢SQL致使FTWRL 關閉open tables失敗,使得備份異常退出。
flush table with read lock失敗,thread running爲何沒有在flush失敗時間點降低?
    flush table with read lock 由xpb發起超時後xpb會結束,flush table with read lock會一直等待慢SQL執行完畢會話才結束,而後寫入慢日誌,
slowlog記錄FTWRL的時間爲什麼不是超時以後就記錄而是等待慢查詢結束以後再記錄?
    理解問題,xpb發起FTWRL命令後超時,XPB會結束,而發起的FTWRL命令會在會話層面一直等待慢日誌結束,而後退出會話
相關文章
相關標籤/搜索