看過MR的處理流程的人應該都知道,在MR處理的時候有個split,這個split數量決定了mapper的數量,那split是怎麼來的呢?咱們在寫MR代碼的時候也沒有接口能夠定義split的數量,那split怎麼來的? 有人說是block數量,真是是這樣嗎? 咱們來看一下源碼:app
public List<InputSplit> getSplits(JobContext job) throws IOException { StopWatch sw = new StopWatch().start(); long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job)); long maxSize = getMaxSplitSize(job); // generate splits List<InputSplit> splits = new ArrayList<InputSplit>(); List<FileStatus> files = listStatus(job); for (FileStatus file: files) { Path path = file.getPath(); long length = file.getLen(); if (length != 0) { BlockLocation[] blkLocations; if (file instanceof LocatedFileStatus) { blkLocations = ((LocatedFileStatus) file).getBlockLocations(); } else { FileSystem fs = path.getFileSystem(job.getConfiguration()); blkLocations = fs.getFileBlockLocations(file, 0, length); } if (isSplitable(job, path)) { long blockSize = file.getBlockSize(); long splitSize = computeSplitSize(blockSize, minSize, maxSize);//這裏計算splitSize long bytesRemaining = length; while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) { int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining); splits.add(makeSplit(path, length-bytesRemaining, splitSize, blkLocations[blkIndex].getHosts(), blkLocations[blkIndex].getCachedHosts())); bytesRemaining -= splitSize; } if (bytesRemaining != 0) { int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining); splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining, blkLocations[blkIndex].getHosts(), blkLocations[blkIndex].getCachedHosts())); } } else { // not splitable splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(), blkLocations[0].getCachedHosts())); } } else { //Create empty hosts array for zero length files splits.add(makeSplit(path, 0, length, new String[0])); } } // Save the number of input files for metrics/loadgen job.getConfiguration().setLong(NUM_INPUT_FILES, files.size()); sw.stop(); if (LOG.isDebugEnabled()) { LOG.debug("Total # of splits generated by getSplits: " + splits.size() + ", TimeTaken: " + sw.now(TimeUnit.MILLISECONDS)); } return splits; }
protected long computeSplitSize(long blockSize, long minSize, long maxSize) { return Math.max(minSize, Math.min(maxSize, blockSize)); //獲取splitSize }
其中的getMinSplitSize和getMaxSplitSize方法分別用於獲取最小InputSplit和最大InputSplit的值,對應的配置參數分別爲mapreduce.input.fileinputformat.split.minsize,默認值爲1L和mapreduce.input.fileinputformat.split.maxsize,默認值爲Long.MAX_VALUEoop
因此,hadoop中默認的block size爲128M,因此split的size通常對應爲block的大小,因此,Mapper的數量就是文件個數的數量;spa
這樣能夠作到數據本地性,提示效率;debug
1.若是根據特殊狀況的須要非要自定義mapper的數量怎麼辦?code
那就只有修改塊的大小、split的最小值和最大值來影響mapper的數量;orm