中午12點半,接到了線上MongoDB 數據庫異常的告警通報:node
「CPU不間斷飆升到百分百,業務也相應出現了抖動現象。」mongodb
經過排查數據庫主節點的日誌,發現了這樣的一個慢語句:數據庫
2019-03-07T10:56:43.470+0000 I COMMAND [conn2544772] command nlp.ApplicationDevice appName: "nlp" command: find { find: "ApplicationDevice", filter: { appId: "Gf93VvCfOdttrxSOemt_03ff", tags.tagName: "pipeline", tags.tagValue: "multi", _id: { $gt: ObjectId('000000000000000000000000') } }, projection: { $sortKey: { $meta: "sortKey" } }, sort: { _id: 1 }, limit: 10, shardVersion: [ Timestamp 1000|1000, ObjectId('5c64f941c81e2b922e48e347') ] } planSummary: IXSCAN { appId: 1, tags.tagName: 1, tags.tagValue: 1, _id: 1 } keysExamined:1000002 docsExamined:1000001 hasSortStage:1 cursorExhausted:1 numYields:7829 nreturned:10 reslen:7102 locks:{ Global: { acquireCount: { r: 15660 } }, Database: { acquireCount: { r: 7830 } }, Collection: { acquireCount: { r: 7830 } } } protocol:op_command 4008ms
從語句中初步判斷,"keysExamined"和docsExamined 顯示掃描了100W 條記錄,其中也用到了下面的索引:數組
{ appId: 1, tags.tagName: 1, tags.tagValue: 1, _id: 1 }
跟研發兄弟確認事後,該查詢的目的是 找到某些應用下帶指定標籤的設備信息,按ID分段去獲取,每次只查詢10條。
關於索引的設計也已經確認過是最優的了,並且此前在開發環境中一直沒有出現過問題,不知道爲何此次就出問題了。app
接下來,看了下該集合的模型,大體是長下面的樣子:測試
/* 1 */ { "appId" : "Gf93VvCfOdttrxSOemt_03ff", "deviceId" : "bbc-lmc-03991933", "nodeType" : "FACTORY", "creationTime" : ISODate("2019-03-01T10:11:39.852Z"), "lastModifiedTime" : ISODate("2019-03-03T10:45:40.249Z"), "tags" : [ { "tagName" : "pipeline", "tagValue" : "multi", "tagType" : 1 } ], ... }
說明
除了其餘的屬性以外,tags字段採用了嵌套文檔數組的結構;
每個元素都對應了一個tag對象,包含 tagName/tagValue/tagType幾個字段。ui
而後是查詢的模式:設計
//過濾條件 { appId: "Gf93VvCfOdttrxSOemt_03ff", tags.tagName: "pipeline", tags.tagValue: "multi", _id: { $gt: ObjectId('000000000000000000000000') } //排序 sort: { _id: 1 } //限制條數 limit: 10
這從索引的前綴匹配來看,是應該沒有問題的,索引的定義以下所示:日誌
{ appId: 1, tags.tagName: 1, tags.tagValue: 1, _id: 1 }
爲了不對線上環境形成影響,咱們找了一個測試環境來作了嘗試復現,執行:code
db.ApplicationDevice.find({ "appId":"Gf93VvCfOdttrxSOemt_03ff", "tags.tagName":"pipeline", "tags.tagValue":"multi", _id:{$gt:ObjectId("000000000000000000000000")}}) .sort({"_id" : 1}) .explain()
結果卻跟線上的狀況不大同樣,此次選中的是**_id**索引!
"winningPlan" : { "stage" : "LIMIT", "limitAmount" : 10, "inputStage" : { "stage" : "SHARDING_FILTER", "inputStage" : { "stage" : "FETCH", "filter" : { "$and" : [ { "appId" : { "$eq" : "Gf93VvCfOdttrxSOemt_03ff" } }, { "tags.tagName" : { "$eq" : "pipeline" } }, { "tags.tagValue" : { "$eq" : "multi" } } ] }, "inputStage" : { "stage" : "IXSCAN", "keyPattern" : { "_id" : 1 }, "indexName" : "_id_", "isMultiKey" : false, "multiKeyPaths" : { "_id" : [ ] }, "isUnique" : true, "isSparse" : false, "isPartial" : false, "indexVersion" : 2, "direction" : "forward", "indexBounds" : { "_id" : [ "(ObjectId('000000000000000000000000'), ObjectId('ffffffffffffffffffffffff')]" ] } } } }
而一樣的是也掃描了100W+的記錄數,因而你們認爲可能索引的選擇器出了問題,但就算是選擇器的問題也仍然沒辦法解釋線上出現的現象(線上的索引但是命中的)
爲了一探究竟,咱們使用 hint 強制讓查詢命中 **appId_1_tags.tagName_1_tags.tagValue_1__id_1**這個索引:
db.ApplicationDevice.find({ "appId":"Gf93VvCfOdttrxSOemt_03ff", "tags.tagName":"pipeline","tags.tagValue":"multi", _id:{$gt:ObjectId("000000000000000000000000")}}) .sort({"_id" : 1}).limit(10) .hint("appId_1_tags.tagName_1_tags.tagValue_1__id_1") .explain("executionStats")
這一次的結果顯示確實命中了對應的索引:
"winningPlan" : { "stage" : "SORT", "sortPattern" : { "_id" : 1.0 }, "limitAmount" : 10, "inputStage" : { "stage" : "SORT_KEY_GENERATOR", "inputStage" : { "stage" : "FETCH", "filter" : { "tags.tagValue" : { "$eq" : "multi" } }, "inputStage" : { "stage" : "IXSCAN", "keyPattern" : { "appId" : 1.0, "tags.tagName" : 1.0, "tags.tagValue" : 1.0, "_id" : 1.0 }, "indexName" : "appId_1_tags.tagName_1_tags.tagValue_1__id_1" ... "executionStats" : { "executionSuccess" : true, "nReturned" : 10, "executionTimeMillis" : 3665, "totalKeysExamined" : 1000002, "totalDocsExamined" : 1000001, "executionStages" : { "stage" : "SORT", "nReturned" : 10, "executionTimeMillisEstimate" : 3513, "works" : 1000014, "sortPattern" : { "_id" : 1.0 }, "memUsage" : 6660, "memLimit" : 33554432, "limitAmount" : 10, "inputStage" : { "stage" : "SORT_KEY_GENERATOR", "nReturned" : 500001, "executionTimeMillisEstimate" : 3333, "works" : 1000003, "advanced" : 500001, "needTime" : 500001, "inputStage" : { "stage" : "FETCH", "filter" : { "tags.tagValue" : { "$eq" : "multi" } }, "nReturned" : 500001, "executionTimeMillisEstimate" : 3087, "works" : 1000002, "advanced" : 500001, "needTime" : 500000, "docsExamined" : 1000001, "alreadyHasObj" : 0, "inputStage" : { "stage" : "IXSCAN", "nReturned" : 1000001, "executionTimeMillisEstimate" : 1117, "works" : 1000002, "advanced" : 1000001, "keyPattern" : { "appId" : 1.0, "tags.tagName" : 1.0, "tags.tagValue" : 1.0, "_id" : 1.0 }, "indexName" : "appId_1_tags.tagName_1_tags.tagValue_1__id_1", "isMultiKey" : true, "multiKeyPaths" : { "appId" : [], "tags.tagName" : [ "tags" ], "tags.tagValue" : [ "tags" ], "_id" : [] }, "indexBounds" : { "appId" : [ "[\"Gf93VvCfOdttrxSOemt_03ff\", \"Gf93VvCfOdttrxSOemt_03ff\"]" ], "tags.tagName" : [ "[\"pipeline\", \"pipeline\"]" ], "tags.tagValue" : [ "[MinKey, MaxKey]" ], "_id" : [ "(ObjectId('000000000000000000000000'), ObjectId('ffffffffffffffffffffffff')]" ] }, "keysExamined" : 1000002, "dupsTested" : 1000001, } } }
然而,在整個執行過程當中(executionStats),出現了內存排序(SORT)。
並且,從一開始命中** appId_1_tags.tagName_1_tags.tagValue_1__id_1 **這個索引的階段中,就已經掃描了100W條記錄,簡直難以想象!
但同時,咱們也從indexBounds的指示中找到了端倪:
appId、tags.tagName 都命中了單值,在 tags.tagValue 的路徑節點上卻覆蓋了所有範圍!
**因爲中間索引節點出現了大範圍覆蓋,致使最終須要在內存中對大量的數據作 _id字段的排序**,這個就是致使慢操做的緣由!
既然從前面的分析中找到了問題的來源,咱們的推論以下:
再次拿出前面的查詢條件:
{ appId: "Gf93VvCfOdttrxSOemt_03ff", tags.tagName: "pipeline", tags.tagValue: "multi", _id: { $gt: ObjectId('000000000000000000000000') }
在索引的匹配中,只能單鍵命中tags.tagName: "pipeline" 這一個條件,那麼因爲 tags是一個嵌套文檔的數組,
對於上面的查詢,語義上是指那些 包含某個元素 可命中tagName,且包含某個元素 可命中 tagValue的文檔,這裏面並不要求 同一個元素同時命中tagName和tagValue。
但 MongoDB 在嵌套數組索引的構建上是按照同一個元素的字段組合去構建的。 關於這點,能夠參考下面的地址:
https://docs.mongodb.com/manual/core/index-multikey/#multikey-embedded-documents
對於數組元素的條件匹配,應該使用 $elemMatch,其用法可參考這裏
爲此,咱們構建了下面的查詢:
db.ApplicationDevice.find({ "appId":"Gf93VvCfOdttrxSOemt_03ff", "tags": {$elemMatch: { "tagName":"pipeline","tagValue":"multi" }}, _id:{$gt:ObjectId("000000000000000000000000")}}) .sort({"_id" : 1}).limit(10) .explain("executionStats")
執行後輸出以下:
"winningPlan" : { "stage" : "LIMIT", "limitAmount" : 10, "inputStage" : { "stage" : "IXSCAN", "keyPattern" : { "appId" : 1.0, "tags.tagName" : 1.0, "tags.tagValue" : 1.0, "_id" : 1.0 }, "indexName" : "appId_1_tags.tagName_1_tags.tagValue_1__id_1", } }, "executionStats" : { "executionSuccess" : true, "nReturned" : 10, "executionTimeMillis" : 3, "totalKeysExamined" : 10, "totalDocsExamined" : 10, "executionStages" : { "stage" : "LIMIT", "nReturned" : 10, "inputStage" : { "stage" : "FETCH", "filter" : {...}, "nReturned" : 10,, "inputStage" : { "stage" : "IXSCAN", "nReturned" : 10, "executionTimeMillisEstimate" : 0, "works" : 10, "advanced" : 10, "isEOF" : 0, "indexName" : "appId_1_tags.tagName_1_tags.tagValue_1__id_1", "isMultiKey" : true, "indexBounds" : { "appId" : [ "[\"Gf93VvCfOdttrxSOemt_03ff\", \"Gf93VvCfOdttrxSOemt_03ff\"]" ], "tags.tagName" : [ "[\"pipeline\", \"pipeline\"]" ], "tags.tagValue" : [ "[\"multi\", \"multi\"]" ], "_id" : [ "(ObjectId('000000000000000000000000'), ObjectId('ffffffffffffffffffffffff')]" ] } ...
這個結果是使人滿意的,除了自動命中合適的索引以外,這個查詢過程也達到了最優的路徑匹配,掃描記錄數才10條!
最後,根據該方案調整了查詢模式,線上的問題獲得恢復。
看似很簡單的一個查詢語句,沒想到會出現這麼大的坑,其實不管是做爲開發人員仍是DBA,都應當謹慎對待你的SQL。 重要的事情說三遍!!! SQl查詢上線前務必 explain、務必分析到位,這難道沒有道理?