遍歷數據庫中的全部記錄時,咱們首先想到的是Model.all.each
。可是,當數據量很大的時候(數萬?),這就不怎麼合適了,由於Model.all.each
會一次性加載全部記錄,並將其實例化成 Model 對象,這顯然會增長內存負擔,甚至耗盡內存。mongodb
對於ActiveRecord
而言,有個find_each
專門解決此類問題。find_each
底層依賴於find_in_batches
,會分批加載記錄,默認每批爲1000。數據庫
對Mongoid
而言,話說能夠直接用Person.all.each
,它會自動利用遊標(cursor
)幫你分批加載的。不過有個問題得留意一下:cursor
有個10分鐘超時限制。這就意味着遍歷時長超過10分鐘就危險了,極可能在中途遭遇no cursor
錯誤。ruby
# gems/mongo-2.2.4/lib/mongo/collection.rb:218 # @option options [ true, false ] :no_cursor_timeout The server normally times out idle cursors # after an inactivity period (10 minutes) to prevent excess memory use. Set this option to prevent that.
雖然能夠這樣來繞過它 Model.all.no_timeout.each
,不過不建議這樣作。另外默認的batch_size
不必定合適你,能夠這樣指定Model.all.no_timeout.batch_size(500).each
。不過 mongodb 的默認batch_size
看起來比較複雜,得謹慎。(The MongoDB server returns the query results in batches. Batch size will not exceed the maximum BSON document size. For most queries, the first batch returns 101 documents or just enough documents to exceed 1 megabyte. Subsequent batch size is 4 megabytes. To override the default size of the batch, see batchSize() and limit().less
For queries that include a sort operation without an index, the server must load all the documents in memory to perform the sort before returning any results.ide
As you iterate through the cursor and reach the end of the returned batch, if there are more results, cursor.next() will perform a getmore operation to retrieve the next batch. )ui
Model.all.each { print '.' }
將獲得相似的查詢:this
相似的方案應是利用skip
和limit
,相似於這樣Model.all.skip(m).limit(n)
。很不幸的是,數據量過大時,這並很差使,由於隨着 skip 值變大,會愈來愈慢(The cursor.skip() method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return results. As the offset (e.g. pageNumber above) increases, cursor.skip() will become slower and more CPU intensive. With larger collections, cursor.skip() may become IO bound.)。這讓我想起了曾經看過的帖子will_paginate 分頁過多 (大概 10000 頁),點擊最後幾頁的時候,速度明顯變慢,大體緣由就是分頁底層用到了offset
,也是由於offset 越大查詢就會越慢。3d
讓咱們再次回到ActiveRecord
的find_each
上來。Rails 考慮得很周全,它底層沒利用offset
,而是將每批查詢的最後一條記錄的id
做爲下批查詢的primary_key_offset
:code
# gems/activerecord-4.2.5.1/lib/active_record/relation/batches.rb:98 def find_in_batches(options = {}) options.assert_valid_keys(:start, :batch_size) relation = self start = options[:start] batch_size = options[:batch_size] || 1000 unless block_given? return to_enum(:find_in_batches, options) do total = start ? where(table[primary_key].gteq(start)).size : size (total - 1).div(batch_size) + 1 end end if logger && (arel.orders.present? || arel.taken.present?) logger.warn("Scoped order and limit are ignored, it's forced to be batch order and batch size") end relation = relation.reorder(batch_order).limit(batch_size) records = start ? relation.where(table[primary_key].gteq(start)).to_a : relation.to_a while records.any? records_size = records.size primary_key_offset = records.last.id raise "Primary key not included in the custom select clause" unless primary_key_offset yield records break if records_size < batch_size records = relation.where(table[primary_key].gt(primary_key_offset)).to_a end end
前面的Model.all.no_timeout.batch_size(1000).each
是 server 端的批量查詢,咱們也可模仿出client 端的批量查詢,即Mongoid
版的find_each
:orm
# /config/initializers/mongoid_batches.rb module Mongoid module Batches def find_each(batch_size = 1000) return to_enum(:find_each, batch_size) unless block_given? find_in_batches(batch_size) do |documents| documents.each { |document| yield document } end end def find_in_batches(batch_size = 1000) return to_enum(:find_in_batches, batch_size) unless block_given? documents = self.limit(batch_size).asc(:id).to_a while documents.any? documents_size = documents.size primary_key_offset = documents.last.id yield documents break if documents_size < batch_size documents = self.gt(id: primary_key_offset).limit(batch_size).asc(:id).to_a end end end end Mongoid::Criteria.include Mongoid::Batches
最後對於耗時操做,還可考慮引入並行計算,相似於這樣:
Model.find_each { ... } Model.find_in_batches do |items| Parallel.each items, in_processes: 4 do |item| # ... end end