2014 MapReduce

時間 2019-11-18

原文原文鏈接

function map(String name, String document):
  // name: document name
  // document: document contents
  for each word w in document:
    emit (w, 1)

function reduce(String word, Iterator partialCounts):
  // word: a word
  // partialCounts: a list of aggregated partial counts
  sum = 0
  for each pc in partialCounts:
    sum += pc
  emit (word, sum)

The prototypical MapReduce example counts the appearance of each word in a set of documents:^[14]php

Here, each document is split into words, and each word is counted by the map function, using the word as the result key. The framework puts together all the pairs with the same key and feeds them to the same call to reduce. Thus, this function just needs to sum all of its input values to find the total appearances of that word.編程

 SELECT age, AVG(contacts)
    FROM social.person
GROUP BY age
ORDER BY age

function Map is
    input: integer K1 between 1 and 1100, representing a batch of 1 million social.person records
    for each social.person record in the K1 batch do
        let Y be the person's age
        let N be the number of contacts the person has
        produce one output record (Y,(N,1))
    repeat
end function

function Reduce is
    input: age (in years) Y
    for each input record (Y,(N,C)) do
        Accumulate in S the sum of N*C
        Accumulate in Cnew the sum of C
    repeat
    let A be S/Cnew
    produce one output record (Y,(A,Cnew))
end function

-- map output #1: age, quantity of contacts
10, 9
10, 9
10, 9

-- map output #2: age, quantity of contacts
10, 9
10, 9

-- map output #3: age, quantity of contacts
10, 10

-- reduce step #1: age, average of contacts
10, 9

(9*3+9*2+10*1)/(3+2+1)服務器

(9*5+10*1)/(5+1)網絡

imagine that for a database of 1.1 billion people, one would like to compute the average number of social contacts a person has according to age架構

Dataflow

The frozen part of the MapReduce framework is a large distributed sort. The hot spots, which the application defines, are:併發

an input reader
a Map function
a partition function
a compare function
a Reduce function
an output writer

wapp

https://zh.wikipedia.org/wiki/MapReduceless

^ "咱們的靈感來自lisp和其餘函數式編程語言中的古老的映射和概括操做." -"MapReduce：大規模集羣上的簡單數據處理方式"編程語言

MapReduce是Google提出的一個軟件架構，用於大規模數據集（大於1TB）的並行運算。概念「Map（映射）」和「Reduce（概括）」，及他們的主要思想，都是從函數式編程語言借來的，還有從矢量編程語言借來的特性。^[1]函數式編程

當前的軟件實現是指定一個Map（映射）函數，用來把一組鍵值對映射成一組新的鍵值對，指定併發的Reduce（概括）函數，用來保證全部映射的鍵值對中的每個共享相同的鍵組。

映射和概括

簡單來說，一個映射函數就是對一些獨立元素組成的概念上的列表（例如，一個測試成績的列表）的每個元素進行指定的操做（好比，有人發現全部學生的成績都被高估了一分，他能夠定義一個「減一」的映射函數，用來修正這個錯誤。）。事實上，每一個元素都是被獨立操做的，而原始列表沒有被更改，由於這裏建立了一個新的列表來保存新的答案。這就是說，Map操做是能夠高度並行的，這對高性能要求的應用以及並行計算領域的需求很是有用。

而概括操做指的是對一個列表的元素進行適當的合併（繼續看前面的例子，若是有人想知道班級的平均分該怎麼作？他能夠定義一個概括函數，經過讓列表中的奇數（odd）或偶數（even）元素跟本身的相鄰的元素相加的方式把列表減半，如此遞歸運算直到列表只剩下一個元素，而後用這個元素除以人數，就獲得了平均分）。雖然他不如映射函數那麼並行，可是由於概括老是有一個簡單的答案，大規模的運算相對獨立，因此概括函數在高度並行環境下也頗有用。

分佈和可靠性

MapReduce經過把對數據集的大規模操做分發給網絡上的每一個節點實現可靠性；每一個節點會週期性的把完成的工做和狀態的更新報告回來。若是一個節點保持沉默超過一個預設的時間間隔，主節點（類同Google檔案系統中的主服務器）記錄下這個節點狀態爲死亡，並把分配給這個節點的數據發到別的節點。每一個操做使用命名文件的不可分割操做以確保不會發生並行線程間的衝突；當文件被更名的時候，系統可能會把他們複製到任務名之外的另外一個名字上去。（避免反作用）。

概括操做工做方式很相似，可是因爲概括操做在並行能力較差，主節點會盡可能把概括操做調度在一個節點上，或者離須要操做的數據儘量近的節點上了；這個特性能夠知足Google的需求，由於他們有足夠的帶寬，他們的內部網絡沒有那麼多的機器。

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.^[1]^[2]

A MapReduce program is composed of a Map() procedure (method) that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

The model is a specialization of the split-apply-combine strategy for data analysis.^[3] It is inspired by the map and reduce functions commonly used in functional programming,^[4] although their purpose in the MapReduce framework is not the same as in their original forms.^[5] The key contributions of the MapReduce framework are not the actual map and reduce functions (which, for example, resemble the 1995 Message Passing Interface standard's^[6] reduce^[7] and scatter^[8] operations), but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine. As such, a single-threaded implementation of MapReduce will usually not be faster than a traditional (non-MapReduce) implementation; any gains are usually only seen with multi-threaded implementations.^[9] The use of this model is beneficial only when the optimized distributed shuffle operation (which reduces network communication cost) and fault tolerance features of the MapReduce framework come into play. Optimizing the communication cost is essential to a good MapReduce algorithm.^[10]

MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation that has support for distributed shuffles is part of Apache Hadoop. The name MapReduce originally referred to the proprietary Google technology, but has since been genericized. By 2014, Google was no longer using MapReduce as their primary Big Data processing model,^[11] and development on Apache Mahout had moved on to more capable and less disk-oriented mechanisms that incorporated full map and reduce capabilities.^[12]

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。