MapReduce學習筆記

MapOutputBuffer中有一個變量叫作mapOutputFile。在sortAndSpill函數中(被flush調用),會經過這個變量拿到文件地址,並寫出中間結果,在該方法中,調用了下文中提到的writer.append(key, value)來寫出數據。看起來沒有加密的過程。app

在執行shuffle.run()時,會對map的數據進行提取併合並。就會調用merger.close(),
實際會調用到MergeManagerlmpl的close方法,代碼以下:ide

@Override
  public RawKeyValueIterator close() throws Throwable {
    // Wait for on-going merges to complete
    if (memToMemMerger != null) { 
      memToMemMerger.close();
    }
    inMemoryMerger.close();
    onDiskMerger.close();

    List<InMemoryMapOutput<K, V>> memory = 
      new ArrayList<InMemoryMapOutput<K, V>>(inMemoryMergedMapOutputs);
    inMemoryMergedMapOutputs.clear();
    memory.addAll(inMemoryMapOutputs);
    inMemoryMapOutputs.clear();
    List<CompressAwarePath> disk = new ArrayList<CompressAwarePath>(onDiskMapOutputs);
    onDiskMapOutputs.clear();
    return finalMerge(jobConf, rfs, memory, disk);
  }

那麼咱們看到了memToMemMerger\inMemoryMerger\onDiskMerger三種不一樣的Merger,定義以下:函數

private IntermediateMemoryToMemoryMerger memToMemMerger;
private final MergeThread<InMemoryMapOutput<K,V>, K,V> inMemoryMerger;
private final OnDiskMerger onDiskMerger;

其中IntermediateMemoryToMemoryMerger繼承自 MergeThread<InMemoryMapOutput<K, V>, K, V>,然而MergeThread的close方法和run方法以下:oop

public synchronized void close() throws InterruptedException {
  closed = true;
  waitForMerge();
  interrupt();
}


public void run() {
  while (true) {
    List<T> inputs = null;
    try {
      // Wait for notification to start the merge...
      synchronized (pendingToBeMerged) {
      while(pendingToBeMerged.size() <= 0) {
        pendingToBeMerged.wait();
      }
      // Pickup the inputs to merge.
      inputs = pendingToBeMerged.removeFirst();
    }

    // Merge
    merge(inputs);
    } catch (InterruptedException ie) {
      numPending.set(0);
      return;
    } catch(Throwable t) {
      numPending.set(0);
      reporter.reportException(t);
      return;
    } finally {
      synchronized (this) {
      numPending.decrementAndGet();
      notifyAll();
    }
  }
}

而imMemoryMerger則是由createInMemoryMerger函數建立,實際上是一個InMemoryMerger的實例。this

這三者都會在merge方法中建立一個Writer變量,並調用Merger.writeFile(iter, writer, reporter, jobConf)。隨後調用writer.close()來完成調用。close函數實現以下:加密

public void close() throws IOException {

  // When IFile writer is created by BackupStore, we do not have
  // Key and Value classes set. So, check before closing the
  // serializers
  if (keyClass != null) {
    keySerializer.close();
    valueSerializer.close();
  }

  // Write EOF_MARKER for key/value length
  WritableUtils.writeVInt(out, EOF_MARKER);
  WritableUtils.writeVInt(out, EOF_MARKER);
  decompressedBytesWritten += 2 * WritableUtils.getVIntSize(EOF_MARKER);

  //Flush the stream
  out.flush();

  if (compressOutput) {
    // Flush
    compressedOut.finish();
    compressedOut.resetState();
  }

  // Close the underlying stream iff we own it...
  if (ownOutputStream) {
    out.close();
  }
  else {
    // Write the checksum
    checksumOut.finish();
  }

  compressedBytesWritten = rawOut.getPos() - start;

  if (compressOutput) {
    // Return back the compressor
    CodecPool.returnCompressor(compressor);
    compressor = null;
  }

  out = null;
  if(writtenRecordsCounter != null) {
    writtenRecordsCounter.increment(numRecordsWritten);
  }
}

咱們會發現其中關鍵的就是out。out的建立以下:code

if (codec != null) {
    this.compressor = CodecPool.getCompressor(codec);
    if (this.compressor != null) {
      this.compressor.reset();
      this.compressedOut = codec.createOutputStream(checksumOut, compressor);
      this.out = new FSDataOutputStream(this.compressedOut,  null);
      this.compressOutput = true;
    } else {
      LOG.warn("Could not obtain compressor from CodecPool");
      this.out = new FSDataOutputStream(checksumOut,null);
    }
  } else {
    this.out = new FSDataOutputStream(checksumOut,null);
  }

這一部分解釋了黨咱們傳入了壓縮格式的時候,中間結果如何進行壓縮。orm

幾個結論:繼承

  • 輸出應該是機遇Job Configuration裏面的設定,壓縮成具體的格式。能夠參看:StackOverflow
  • 直接使用Map的中間結果應該也是不可行的,除非本身修改源代碼。能夠參看:StackOverflow。可是能夠嘗試實現IFile作一些常識。
相關文章
相關標籤/搜索