

不少使用Elasticsearch的同窗會關心數據存儲在ES中的存儲容量,會有這樣的疑問:xxTB的數據入到ES會使用多少存儲空間。這個問題其實很難直接回答的,只有數據寫入ES後,才能觀察到實際的存儲空間。好比一樣是1TB的數據,寫入ES的存儲空間可能差距會很是大,可能小到只有300~400GB,也可能多到6-7TB,爲何會形成這麼大的差距呢?究其緣由,咱們來探究下Elasticsearch中的數據是如何存儲。文章中我以Elasticsearch 2.3版本爲示例,對應的lucene版本是5.5,Elasticsearch如今已經來到了6.5版本,數字類型、列存等存儲結構有些變化,但基本的概念變化很少,文章中的內容依然適用。







一個shard就對應了一個lucene的library。對於一個shard,Elasticsearch增長了translog的功能,相似於HBase WAL,是數據寫入過程當中的中間數據,其他的數據都在lucene庫中管理的。算法





  • segment : lucene內部的數據是由一個個segment組成的,寫入lucene的數據並不直接落盤,而是先寫在內存中,通過了refresh間隔,lucene纔將該時間段寫入的所有數據refresh成一個segment,segment多了以後會進行merge成更大的segment。lucene查詢時會遍歷每一個segment完成。因爲lucene* 寫入的數據是在內存中完成,因此寫入效率很是高。可是也存在丟失數據的風險,因此Elasticsearch基於此現象實現了translog,只有在segment數據落盤後,Elasticsearch纔會刪除對應的translog。
  • doc : doc表示lucene中的一條記錄
  • field :field表示記錄中的字段概念,一個doc由若干個field組成。
  • term :term是lucene中索引的最小單位,某個field對應的內容若是是全文檢索類型,會將內容進行分詞,分詞的結果就是由term組成的。若是是不分詞的字段,那麼該字段的內容就是一個term。
  • 倒排索引(inverted index): lucene索引的通用叫法,即實現了term到doc list的映射。
  • 正排數據:搜索引擎的通用叫法,即原始數據,能夠理解爲一個doc list。
  • docvalues :Elasticsearch中的列式存儲的名稱,Elasticsearch除了存儲原始存儲、倒排索引,還存儲了一份docvalues,用做分析和排序。



Name Extension Brief Description
Segment Info .si segment的元數據文件
Compound File .cfs, .cfe 一個segment包含了以下表的各個文件,爲減小打開文件的數量,在segment小的時候,segment的全部文件內容都保存在cfs文件中,cfe文件保存了lucene各文件在cfs文件的位置信息
Fields .fnm 保存了fields的相關信息
Field Index .fdx 正排存儲文件的元數據信息
Field Data .fdt 存儲了正排存儲數據,寫入的原文存儲在這
Term Dictionary .tim 倒排索引的元數據信息
Term Index .tip 倒排索引文件,存儲了全部的倒排索引數據
Frequencies .doc 保存了每一個term的doc id列表和term在doc中的詞頻
Positions .pos Stores position information about where a term occurs in the index
Payloads .pay Stores additional per-position metadata information such as character offsets and user payloads
Norms .nvd, .nvm 文件保存索引字段加權數據
Per-Document Values .dvd, .dvm lucene的docvalues文件,即數據的列式存儲,用做聚合和排序
Term Vector Data .tvx, .tvd, .tvf Stores offset into the document data file
Live Documents .liv 記錄了segment中刪除的doc






PUT test_field
  "settings": {
    "index": {
      "number_of_shards": "1",
      "number_of_replicas": "0",
      "refresh_interval": "30s"
  "mappings": {
    "type": {
      "_all": {
        "enabled": false
      "properties": {
        "uuid": {
          "type": "string",
          "index": "not_analyzed"


health status index      pri rep docs.count docs.deleted store.size pri.store.size 
green  open   test_field   1   0    1000000            0    122.7mb        122.7mb 

-rw-r--r--  1 weizijun  staff    41M Aug 19 21:23 _8.fdt
-rw-r--r--  1 weizijun  staff    17K Aug 19 21:23 _8.fdx
-rw-r--r--  1 weizijun  staff   688B Aug 19 21:23 _8.fnm
-rw-r--r--  1 weizijun  staff   494B Aug 19 21:23 _8.si
-rw-r--r--  1 weizijun  staff   265K Aug 19 21:23 _8_Lucene50_0.doc
-rw-r--r--  1 weizijun  staff    44M Aug 19 21:23 _8_Lucene50_0.tim
-rw-r--r--  1 weizijun  staff   340K Aug 19 21:23 _8_Lucene50_0.tip
-rw-r--r--  1 weizijun  staff    37M Aug 19 21:23 _8_Lucene54_0.dvd
-rw-r--r--  1 weizijun  staff   254B Aug 19 21:23 _8_Lucene54_0.dvm
-rw-r--r--  1 weizijun  staff   195B Aug 19 21:23 segments_2
-rw-r--r--  1 weizijun  staff     0B Aug 19 21:20 write.lock




health status index      pri rep docs.count docs.deleted store.size pri.store.size 
green  open   test_field   1   0    1000000            0     13.2mb         13.2mb 

-rw-r--r--  1 weizijun  staff   5.5M Aug 19 21:29 _6.fdt
-rw-r--r--  1 weizijun  staff    15K Aug 19 21:29 _6.fdx
-rw-r--r--  1 weizijun  staff   688B Aug 19 21:29 _6.fnm
-rw-r--r--  1 weizijun  staff   494B Aug 19 21:29 _6.si
-rw-r--r--  1 weizijun  staff   309K Aug 19 21:29 _6_Lucene50_0.doc
-rw-r--r--  1 weizijun  staff   7.0M Aug 19 21:29 _6_Lucene50_0.tim
-rw-r--r--  1 weizijun  staff   195K Aug 19 21:29 _6_Lucene50_0.tip
-rw-r--r--  1 weizijun  staff   244K Aug 19 21:29 _6_Lucene54_0.dvd
-rw-r--r--  1 weizijun  staff   252B Aug 19 21:29 _6_Lucene54_0.dvm
-rw-r--r--  1 weizijun  staff   195B Aug 19 21:29 segments_2
-rw-r--r--  1 weizijun  staff     0B Aug 19 21:26 write.lock


因此在Elasticsearch中創建索引的字段若是基數越大(count distinct),越佔用磁盤空間。


health status index      pri rep docs.count docs.deleted store.size pri.store.size 
green  open   test_field   1   0    1000000            0     13.6mb         13.6mb 

-rw-r--r--  1 weizijun  staff   6.1M Aug 28 10:19 _42.fdt
-rw-r--r--  1 weizijun  staff    22K Aug 28 10:19 _42.fdx
-rw-r--r--  1 weizijun  staff   688B Aug 28 10:19 _42.fnm
-rw-r--r--  1 weizijun  staff   503B Aug 28 10:19 _42.si
-rw-r--r--  1 weizijun  staff   2.8M Aug 28 10:19 _42_Lucene50_0.doc
-rw-r--r--  1 weizijun  staff   2.2M Aug 28 10:19 _42_Lucene50_0.tim
-rw-r--r--  1 weizijun  staff    83K Aug 28 10:19 _42_Lucene50_0.tip
-rw-r--r--  1 weizijun  staff   2.5M Aug 28 10:19 _42_Lucene54_0.dvd
-rw-r--r--  1 weizijun  staff   228B Aug 28 10:19 _42_Lucene54_0.dvm
-rw-r--r--  1 weizijun  staff   196B Aug 28 10:19 segments_2
-rw-r--r--  1 weizijun  staff     0B Aug 28 10:16 write.lock



health status index      pri rep docs.count docs.deleted store.size pri.store.size 
green  open   test_field   1   0    1000000            0    107.2mb        107.2mb 

-rw-r--r--  1 weizijun  staff    25M Aug 20 12:30 _5.fdt
-rw-r--r--  1 weizijun  staff   6.0K Aug 20 12:30 _5.fdx
-rw-r--r--  1 weizijun  staff   688B Aug 20 12:31 _5.fnm
-rw-r--r--  1 weizijun  staff   500B Aug 20 12:31 _5.si
-rw-r--r--  1 weizijun  staff   265K Aug 20 12:31 _5_Lucene50_0.doc
-rw-r--r--  1 weizijun  staff    44M Aug 20 12:31 _5_Lucene50_0.tim
-rw-r--r--  1 weizijun  staff   322K Aug 20 12:31 _5_Lucene50_0.tip
-rw-r--r--  1 weizijun  staff    37M Aug 20 12:31 _5_Lucene54_0.dvd
-rw-r--r--  1 weizijun  staff   254B Aug 20 12:31 _5_Lucene54_0.dvm
-rw-r--r--  1 weizijun  staff   224B Aug 20 12:31 segments_4
-rw-r--r--  1 weizijun  staff     0B Aug 20 12:00 write.lock



health status index      pri rep docs.count docs.deleted store.size pri.store.size 
green  open   test_field   1   0    1000000            0    162.4mb        162.4mb 

-rw-r--r--  1 weizijun  staff    41M Aug 18 22:59 _20.fdt
-rw-r--r--  1 weizijun  staff    18K Aug 18 22:59 _20.fdx
-rw-r--r--  1 weizijun  staff   777B Aug 18 22:59 _20.fnm
-rw-r--r--  1 weizijun  staff    59B Aug 18 22:59 _20.nvd
-rw-r--r--  1 weizijun  staff    78B Aug 18 22:59 _20.nvm
-rw-r--r--  1 weizijun  staff   539B Aug 18 22:59 _20.si
-rw-r--r--  1 weizijun  staff   7.2M Aug 18 22:59 _20_Lucene50_0.doc
-rw-r--r--  1 weizijun  staff   4.2M Aug 18 22:59 _20_Lucene50_0.pos
-rw-r--r--  1 weizijun  staff    73M Aug 18 22:59 _20_Lucene50_0.tim
-rw-r--r--  1 weizijun  staff   832K Aug 18 22:59 _20_Lucene50_0.tip
-rw-r--r--  1 weizijun  staff    37M Aug 18 22:59 _20_Lucene54_0.dvd
-rw-r--r--  1 weizijun  staff   254B Aug 18 22:59 _20_Lucene54_0.dvm
-rw-r--r--  1 weizijun  staff   196B Aug 18 22:59 segments_2
-rw-r--r--  1 weizijun  staff     0B Aug 18 22:53 write.lock














public final class SegmentInfos implements Cloneable, Iterable<SegmentCommitInfo> {
  // generation是segment的版本的概念,從文件名中提取出來,實例中爲:2t/101
  private long generation;     // generation of the "segments_N" for the next commit

  private long lastGeneration; // generation of the "segments_N" file we last successfully read
                               // or wrote; this is normally the same as generation except if
                               // there was an IOException that had interrupted a commit

  /** Id for this commit; only written starting with Lucene 5.0 */
  private byte[] id;

  /** Which Lucene version wrote this commit, or null if this commit is pre-5.3. */
  private Version luceneVersion;

  /** Counts how often the index has been changed.  */
  public long version;

  /** Used to name new segments. */
  // TODO: should this be a long ...?
  public int counter;

  /** Version of the oldest segment in the index, or null if there are no segments. */
  private Version minSegmentLuceneVersion;

  private List<SegmentCommitInfo> segments = new ArrayList<>();

  /** Opaque Map&lt;String, String&gt; that user can specify during IndexWriter.commit */
  public Map<String,String> userData = Collections.emptyMap();

/** Embeds a [read-only] SegmentInfo and adds per-commit
 *  fields.
 *  @lucene.experimental */
public class SegmentCommitInfo {

  /** The {@link SegmentInfo} that we wrap. */
  public final SegmentInfo info;

  // How many deleted docs in the segment:
  private int delCount;

  // Generation number of the live docs file (-1 if there
  // are no deletes yet):
  private long delGen;

  // Normally 1+delGen, unless an exception was hit on last
  // attempt to write:
  private long nextWriteDelGen;

  // Generation number of the FieldInfos (-1 if there are no updates)
  private long fieldInfosGen;

  // Normally 1+fieldInfosGen, unless an exception was hit on last attempt to
  // write
  private long nextWriteFieldInfosGen; //fieldInfosGen == -1 ? 1 : fieldInfosGen + 1;

  // Generation number of the DocValues (-1 if there are no updates)
  private long docValuesGen;

  // Normally 1+dvGen, unless an exception was hit on last attempt to
  // write
  private long nextWriteDocValuesGen; //docValuesGen == -1 ? 1 : docValuesGen + 1;

  // TODO should we add .files() to FieldInfosFormat, like we have on
  // LiveDocsFormat?
  // track the fieldInfos update files
  private final Set<String> fieldInfosFiles = new HashSet<>();

  // Track the per-field DocValues update files
  private final Map<Integer,Set<String>> dvUpdatesFiles = new HashMap<>();

  // Track the per-generation updates files
  private final Map<Long,Set<String>> genUpdatesFiles = new HashMap<>();

  private volatile long sizeInBytes = -1;










 * Information about a segment such as its name, directory, and files related
 * to the segment.
 * @lucene.experimental
public final class SegmentInfo {

  // _bl
  public final String name;

  /** Where this segment resides. */
  public final Directory dir;

  /** Id that uniquely identifies this segment. */
  private final byte[] id;

  private Codec codec;

  // Tracks the Lucene version this segment was created with, since 3.1. Null
  // indicates an older than 3.0 index, and it's used to detect a too old index.
  // The format expected is "x.y" - "2.x" for pre-3.0 indexes (or null), and
  // specific versions afterwards ("3.0.0", "3.1.0" etc.).
  // see o.a.l.util.Version.
  private Version version;

  private int maxDoc;         // number of docs in seg

  private boolean isCompoundFile;

  private Map<String,String> diagnostics;

  private Set<String> setFiles;

  private final Map<String,String> attributes;










 *  Access to the Field Info file that describes document fields and whether or
 *  not they are indexed. Each segment has a separate Field Info file. Objects
 *  of this class are thread-safe for multiple readers, but only one thread can
 *  be adding documents at a time, with no other reader or writer threads
 *  accessing this object.
public final class FieldInfo {
  /** Field's name */
  public final String name;

  /** Internal field number */
  public final int number;

  //field docvalues的類型
  private DocValuesType docValuesType = DocValuesType.NONE;

  // True if any document indexed term vectors
  private boolean storeTermVector;

  private boolean omitNorms; // omit norms associated with indexed fields 

  private IndexOptions indexOptions = IndexOptions.NONE;

  private boolean storePayloads; // whether this field stores payloads together with term positions 

  private final Map<String,String> attributes;

  // docvalues的generation
  private long dvGen;


文件後綴:.fdx, .fdt

索引文件爲.fdx,數據文件爲.fdt,數據存儲文件功能爲根據自動的文檔id,獲得文檔的內容,搜索引擎的術語習慣稱之爲正排數據,即doc_id -> content,es的_source數據就存在這







 * Random-access reader for {@link CompressingStoredFieldsIndexWriter}.
 * @lucene.internal
public final class CompressingStoredFieldsIndexReader implements Cloneable, Accountable {
  private static final long BASE_RAM_BYTES_USED = RamUsageEstimator.shallowSizeOfInstance(CompressingStoredFieldsIndexReader.class);

  final int maxDoc;

  final int[] docBases;

  final long[] startPointers;

  final int[] avgChunkDocs;

  final long[] avgChunkSizes;

  final PackedInts.Reader[] docBasesDeltas; // delta from the avg

  final PackedInts.Reader[] startPointersDeltas; // delta from the avg

 * {@link StoredFieldsReader} impl for {@link CompressingStoredFieldsFormat}.
 * @lucene.experimental
public final class CompressingStoredFieldsReader extends StoredFieldsReader {

  private final int version;

  // field的基本信息
  private final FieldInfos fieldInfos;

  private final CompressingStoredFieldsIndexReader indexReader;

  private final long maxPointer;

  private final IndexInput fieldsStream;

  private final int chunkSize;

  private final int packedIntsVersion;

  private final CompressionMode compressionMode;

  private final Decompressor decompressor;

  private final int numDocs;

  private final boolean merging;

  private final BlockState state;
  private final long numChunks; // number of compressed blocks written

  //dirty chunk的數量
  private final long numDirtyChunks; // number of incomplete compressed blocks written

  private boolean closed;




5.5.0版本的倒排索引實現爲FST tree,FST tree的最大優點就是內存空間佔用很是低 ,具體能夠參看下這篇文章:http://www.cnblogs.com/bonelee/p/6226185.html

http://examples.mikemccandless.com/fst.py?terms=&cmd=Build+it 爲FST圖實例,能夠根據輸入的數據構造出FST圖

輸入到 FST 中的數據爲:
String inputValues[] = {"mop","moth","pop","star","stop","top"};
long outputValues[] = {0,1,2,3,4,5};

生成的 FST 圖爲:












public final class BlockTreeTermsReader extends FieldsProducer {
  // Open input to the main terms dict file (_X.tib)
  final IndexInput termsIn;
  // Reads the terms dict entries, to gather state to
  // produce DocsEnum on demand
  final PostingsReaderBase postingsReader;
  private final TreeMap<String,FieldReader> fields = new TreeMap<>();

  /** File offset where the directory starts in the terms file. */
  private long dirOffset;
  /** File offset where the directory starts in the index file. */

  private long indexDirOffset;

  final String segment;

  final int version;

  //5.3.x index, we record up front if we may have written any auto-prefix terms,示例中記錄的是false
  final boolean anyAutoPrefixTerms;

 * BlockTree's implementation of {@link Terms}.
 * @lucene.internal
public final class FieldReader extends Terms implements Accountable {

  final long numTerms;

  final FieldInfo fieldInfo;

  final long sumTotalTermFreq;

  final long sumDocFreq;

  final int docCount;

  final long indexStartFP;

  final long rootBlockFP;

  final BytesRef rootCode;

  final BytesRef minTerm;

  final BytesRef maxTerm;

  //longs:metadata buffer, holding monotonic values
  final int longsSize;

  final BlockTreeTermsReader parent;

  final FST<BytesRef> index;


文件後綴:.doc, .pos, .pay

.doc保存了每一個term的doc id列表和term在doc中的詞頻








 * Concrete class that reads docId(maybe frq,pos,offset,payloads) list
 * with postings format.
 * @lucene.experimental
public final class Lucene50PostingsReader extends PostingsReaderBase {
  private static final long BASE_RAM_BYTES_USED = RamUsageEstimator.shallowSizeOfInstance(Lucene50PostingsReader.class);
  private final IndexInput docIn;
  private final IndexInput posIn;
  private final IndexInput payIn;
  final ForUtil forUtil;
  private int version;

  final class BlockDocsEnum extends PostingsEnum {

  final class BlockPostingsEnum extends PostingsEnum {

  final class EverythingEnum extends PostingsEnum {


文件後綴:.dvm, .dvd



  • 一、NONE 不開啓docvalue時的狀態
  • 二、NUMERIC 單個數值類型的docvalue主要包括(int,long,float,double)
  • 三、BINARY 二進制類型值對應不一樣的codes最大值可能超過32766字節,
  • 四、SORTED 有序增量字節存儲,僅僅存儲不一樣部分的值和偏移量指針,值必須小於等於32766字節
  • 五、SORTED_NUMERIC 存儲數值類型的有序數組列表
  • 六、SORTED_SET 能夠存儲多值域的docvalue值,但返回時,僅僅只能返回多值域的第一個docvalue
  • 七、對應not_anaylized的string字段,使用的是SORTED_SET類型,number的類型是SORTED_NUMERIC類型

其中SORTED_SET 的 SORTED_SINGLE_VALUED類型包括了兩類數據 : binary + numeric, binary是按ord排序的term的列表,numeric是doc到ord的映射。






/** reader for {@link Lucene54DocValuesFormat} */
final class Lucene54DocValuesProducer extends DocValuesProducer implements Closeable {
  private final Map<String,NumericEntry> numerics = new HashMap<>();

  private final Map<String,BinaryEntry> binaries = new HashMap<>();

  private final Map<String,SortedSetEntry> sortedSets = new HashMap<>();

  private final Map<String,SortedSetEntry> sortedNumerics = new HashMap<>();

  private final Map<String,NumericEntry> ords = new HashMap<>();

  //docId -> address -> ord 中field的ords列表
  private final Map<String,NumericEntry> ordIndexes = new HashMap<>();

  private final int numFields;

  private final AtomicLong ramBytesUsed;

  private final IndexInput data;

  private final int maxDoc;
  // memory-resident structures
  private final Map<String,MonotonicBlockPackedReader> addressInstances = new HashMap<>();
  private final Map<String,ReverseTermsIndex> reverseIndexInstances = new HashMap<>();
  private final Map<String,DirectMonotonicReader.Meta> directAddressesMeta = new HashMap<>();

  private final boolean merging;

/** metadata entry for a numeric docvalues field */
  static class NumericEntry {
    private NumericEntry() {}
    /** offset to the bitset representing docsWithField, or -1 if no documents have missing values */
    long missingOffset;

    /** offset to the actual numeric values */
    public long offset;

    /** end offset to the actual numeric values */
    public long endOffset;

    /** bits per value used to pack the numeric values */
    public int bitsPerValue;

    int format;
    /** count of values written */
    public long count;
    /** monotonic meta */
    public DirectMonotonicReader.Meta monotonicMeta;

    long minValue;

    //Compressed by computing the GCD
    long gcd;

    //Compressed by giving IDs to unique values.
    long table[];
    /** for sparse compression */
    long numDocsWithValue;
    NumericEntry nonMissingValues;
    NumberType numberType;

  /** metadata entry for a binary docvalues field */
  static class BinaryEntry {
    private BinaryEntry() {}
    /** offset to the bitset representing docsWithField, or -1 if no documents have missing values */
    long missingOffset;
    /** offset to the actual binary values */
    long offset;
    int format;
    /** count of values written */
    public long count;

    int minLength;

    int maxLength;
    /** offset to the addressing data that maps a value to its slice of the byte[] */
    public long addressesOffset, addressesEndOffset;
    /** meta data for addresses */
    public DirectMonotonicReader.Meta addressesMeta;
    /** offset to the reverse index */
    public long reverseIndexOffset;
    /** packed ints version used to encode addressing information */
    public int packedIntsVersion;
    /** packed ints blocksize */
    public int blockSize;


