原文
I'll try to give a very crude overview of how the pieces fit in together, because the details span multiple books. Please forgive me for some oversimplifications.html
- MapReduce is the Google paper that started it all (Page on googleusercontent.com). It's a paradigm for writing distributed code inspired by some elements of functional programming. You don't have to do things this way, but it neatly fits a lot of problems we try to solve in a distributed way. The Google internal implementation is called MapReduce and Hadoop is it's open-source implementation. Amazon's Hadoop instance is called Elastic MapReduce (EMR) and has plugins for multiple languages.
- HDFS is an implementation inspired by the Google File System (GFS) to store files across a bunch of machines when it's too big for one. Hadoop consumes data in HDFS (Hadoop Distributed File System).
- Apache Spark is an emerging platform that has more flexibility than MapReduce but more structure than a basic message passing interface. It relies on the concept of distributed data structures (what it calls RDDs) and operators. See this page for more: The Apache Software Foundation
- Because Spark is a lower level thing that sits on top of a message passing interface, it has higher level libraries to make it more accessible to data scientists. The Machine Learning library built on top of it is called MLib and there's a distributed graph library called GraphX.
- Pregel and it's open source twin Giraph is a way to do graph algorithms on billions of nodes and trillions of edges over a cluster of machines. Notably, the MapReduce model is not well suited to graph processing so Hadoop/MapReduce are avoided in this model, but HDFS/GFS is still used as a data store.
- Zookeeper is a coordination and synchronization service that a distributed set of computer make decisions by consensus, handles failure, etc.
- Flume and Scribe are logging services, Flume is an Apache project and Scribe is an open-source Facebook project. Both aim to make it easy to collect tons of logged data, analyze it, tail it, move it around and store it to a distributed store.
- Google BigTable and it's open source twin HBase were meant to be read-write distributed databases, originally built for the Google Crawler that sit on top of GFS/HDFS and MapReduce/Hadoop. Google Research Publication: BigTable
- Hive and Pig are abstractions on top of Hadoop designed to help analysis of tabular data stored in a distributed file system (think of excel sheets too big to store on one machine). They operate on top of a data warehouse, so the high level idea is to dump data once and analyze it by reading and processing it instead of updating cells and rows and columns individually much. Hive has a language similar to SQL while Pig is inspired by Google's Sawzall - Google Research Publication: Sawzall. You generally don't update a single cell in a table when processing it with Hive or Pig.
- Hive and Pig turned out to be slow because they were built on Hadoop which optimizes for the volume of data moved around, not latency. To get around this, engineers bypassed and went straight to HDFS. They also threw in some memory and caching and this resulted in Google's Dremel (Dremel: Interactive Analysis of Web-Scale Datasets), F1 (F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business), Facebook's Presto (Presto | Distributed SQL Query Engine for Big Data), Apache Spark SQL (Page on apache.org ), Cloudera Impala (Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real), Amazon's Redshift, etc. They all have slightly different semantics but are essentially meant to be programmer or analyst friendly abstractions to analyze tabular data stored in distributed data warehouses.
- Mahout (Scalable machine learning and data mining) is a collection of machine learning libraries written in the MapReduce paradigm, specifically for Hadoop. Google has it's own internal version but they haven't published a paper on it as far as I know.
- Oozie is a workflow scheduler. The oversimplified description would be that it's something that puts together a pipeline of the tools described above. For example, you can write an Oozie script that will scrape your production HBase data to a Hive warehouse nightly, then a Mahout script will train with this data. At the same time, you might use pig to pull in the test set into another file and when Mahout is done creating a model you can pass the testing data through the model and get results. You specify the dependency graph of these tasks through Oozie (I may be messing up terminology since I've never used Oozie but have used the Facebook equivalent).
- Lucene is a bunch of search-related and NLP tools but it's core feature is being a search index and retrieval system. It takes data from a store like HBase and indexes it for fast retrieval from a search query. Solr uses Lucene under the hood to provide a convenient REST API for indexing and searching data. ElasticSearch is similar to Solr.
- Sqoop is a command-line interface to back SQL data to a distributed warehouse. It's what you might use to snapshot and copy your database tables to a Hive warehouse every night.
- Hue is a web-based GUI to a subset of the above tools - http://gethue.com/
簡要譯文
注:由於我沒有真正的使用過下面的這些技術,仍是處於學習階段,因此各類翻譯的內容會比較生硬且存在錯誤的地方,因此若是你的英文比較好,建議直接看上面的原文。node
- MapReduce - 來自谷歌的一篇論文。是一種受到一些函數式編程元素所啓發的編寫分佈式代碼的範例。Google內部將它叫作MapReduce,且Hadoop是它的一種開源實現。Amazon的Hadoop實例叫作EMR(Elastic MapReduce)。
- HDFS - 是一種受到GFS(Google File System)啓發的實現,用於在集羣中存儲文件。Hadoop使用HDFS用於存儲。
- Apache Spark - 是一種比MapReduce更加適用的工程平臺,相對於在傳遞時使用基礎的消息,須要更多的結構。它依賴於分佈式數據結構(RDDs)和操做的概念。點擊查看更多。
- 由於Spark在頂層信息傳遞時使用更底層的方式,它對於數據科學有更高層的庫來帶來更好的可訪問性。機器學習庫構建在它的頂層叫MLib,而且有一個叫GraphX的分佈式圖形庫。
- Pregel及它的開源實現Giraph 是一種處理十億級節點和萬億級邊界覆蓋機器簇。值得注意的是MapReduce模型與圖形處理不是很適配,因此Hadoop/MapReduce不適合這個模型,可是HDFS/GFS仍然用於數據存儲。
- Zookeeper 是一種分佈式計數機集的同步和異步服務。根據一致性、處理失敗等來制定決策。
- Flume和Scribe是一種日誌服務。Flume是一個Apache項目,Scribe是一個開源的Facebook項目。它們兩個旨在讓收集大量的日誌數據,而後分析這些數據、跟蹤這些數據、移動這些數據圍繞/存儲在分佈式存儲中。
- Google BigTable以及他的開源實現HBase,用於讀/寫分佈式數據庫,源於構建GoogleCrawler,處於GFS/HDFS和MapReduce/Hadoop的頂層。更多點擊查看。
- Hive和Pig是在Hadoop的抽象,被設計用於分析存儲於分佈式文件系統的平滑數據(tabular data)。他們操做在一個數據倉庫的頂層,因此高層級的思路是加載數據一次,而後經過讀和處理來分析它,代替分別修改單元格、行、列。Hive有一種相似SQL的語言,Pig是受Google的Sawzall啓發。當使用Hive或者Pig處理數據的時候,一般不用更新在表格中的單獨的一個單元格。
- 由於構建在Hadoop(圍繞優化處理數據容量)上,因此Hive和Pig會變得比較慢。基於這個緣由,工程師們繞過它,並直接面向HDFS。
- Mahout - 一種寫在MapReduce範例中的機器學籍庫的集合。
- Oozie - 一種工做流序列。簡單的描述,它用於集成上面描述的各類工具的管道。例如,你能夠寫一個Oozie腳本,用來從HBase中提取數據到一個Hive數據倉庫,而後使用一個Mahout腳原本使用這個數據完成‘訓練’。同時,你可使用pig來測試集到另外一個文件,當Mahout已經建立一個模型,你能夠傳遞測試數據到這個模型,而後獲取結果。你經過Oozie指定這些任務的圖形依賴。
- Lucene - 是一個搜索相關和NLP工具的集,它的核心功能是一個搜索索引以及一個檢索系統。它從一個存儲(如HBase)中取得數據,而後索引化它,以便更快的從一個搜索結果中檢索數據。Solr使用Lucene在遮蓋下來提供一種便利的REST API,來索引和搜索數據。ElasticSearch與Solr類似。
- Sqoop是一個命令行接口,來處理SQL數據到一個分佈式倉庫。它用於處理快照和複製你的數據表到一個Hive倉庫中。
- Hue是一個基於Web的GUI,針對以上工具的一個子集。