Meet Hadoopweb
1.1 Data!(數據)數據庫
Most of the data is locked up in the largest web properties (like search engines), or scientific or financial institutions, isn’t it?網絡
Does the advent of 「Big Data,」 as it is being
called, affect smaller organizations or individuals?app
做爲普通民衆並未在浩瀚的數據中受益。數據都在網絡中存儲或者被廣大的研究機構存儲。所以大數據的挖掘也就應用而生。dom
從我的角度來看,因爲數據量的不斷擴大。對數據的讀取和篩選都會消耗大量的時間。分佈式
1.2 Data Storage and Analysis (數據存儲和分析)ide
儘管硬盤等存儲介質的讀取速度不斷的提升,但是相對數據量的增加速率相比,數據的檢索和篩選仍是會消耗大量的時間。oop
This is a long time to read all data on a single drive—and writing is even slower. The obvious way to reduce the time is to read from multiple disks at once. Imagine if we
had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under two minutes.大數據
從單一的驅動器上讀取數據就更慢了,最顯而易見的方式就是下降從多個介質中一次讀取。但是同一時候在過高讀取速率的同一時候也下降了硬件的利用率。ui
並行從多個驅動器上讀取數據也同一時候存在風險:
1.硬件故障形成的數據讀取失敗。redundant copies of the data are kept by the system so that in the event of failure, there is another copy available.數據備份
2.從不一樣的驅動器中整合數據也是一個很是大的挑戰。
由此也就引出了MapReduce.
1.3 Comparison with Other Systems(與其它系統比較)
MapReduce is a batch query processor, and the ability to run an ad hoc query against your whole dataset and get the results in a reasonable time is transformative.
RDBMS 關係型數據庫管理系統
Grid Computing 網格計算
網格計算分佈式計算是近年提出的一種新的計算方式。所謂分佈式計算就是在兩個或多個軟件互相共享信息,這些軟件既可以在同一臺計算機上執行。也可以在經過網絡鏈接起來的多臺計算機上執行。
volunteer computing 志願計算
志願計算是經過互聯網讓全球的普通大衆志願提供空暇的PC時間。參與科學計算或數據分析的一種計算方式。這樣的方式爲解決基礎科學運算規模較大、計算資源需求較多的難題提供了一種行之有效的解決途徑。
對於科學家而言,志願計算意味着近乎免費且無限的計算資源;而就志願者而言,他們可以獲得一個瞭解科學、參與科學的機會。以促進公衆對科學的理解。
1.4 A Brief History of Hadoop(Hadoop歷史簡單介紹)
Apache Lucene
1.5 Apache Hadoop and Hadoop ecosystem(關於組織和Hadoop生態系統)
Common :A set of components and interfaces for distributed filesystems and general I/O (serialization, Java RPC, persistent data structures).
Avro:A serialization system for efficient, cross-language RPC, and persistent data storage.
MapReduce:A distributed data processing model and execution environment that runs on large clusters of commodity machines.
HDFS:A distributed filesystem that runs on large clusters of commodity machines.
Pig:A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.
Hive:A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to
MapReduce jobs) for querying the data.
HBase:A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point
queries (random reads).
ZooKeeper:A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.
Sqoop:A tool for efficiently moving data between relational databases and HDFS.
1.6 Hadoop Releases(Hadoop的版本號介紹)