Partitioning, Shuffle and sort

時間 2019-11-18

標籤 partitioning shuffle sort 简体版

原文原文鏈接

Partitioning, Shuffle and sort what happened?node

- Partitioning

Partitioning is the process of determining which reducer instance will receive which intermediate keys and values. Each mapper must determine for all of its output (key, value) pairs which reducer will receive them. It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same

Problem: how dose the hadoop make it? Use a hash function ? what is the function?

here is code~

1 public class HashPartitioner<K, V> extends Partitioner<K, V> {
2     public int getPartition(K key, V value, int numReduceTasks) {
3         return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
4     }
5 }

解釋：將key均勻分佈在ReduceTasks上，舉例若是Key爲Text的話，Text的hashcode方法跟String的基本一致，都是採用的Horner公式計算，獲得一個int，string太大的話這個int值可能會溢出變成負數，因此與上Integer.MAX_VALUE（即0111111111111111），而後再對reduce個數取餘，這樣就可讓相同key分佈在一個節點上，而且較爲均勻的分佈在reduce上算法

Horner規則:算法導論上有介紹這個，百度之app

think about BloomFilter~ 保證這個任務任務分發的均勻是關鍵，因此要設計優秀的hash函數是關鍵less

- Shuffle

After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. This process of moving map outputs to the reducers is known as shuffling.

- Sort

Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。