As we now know, many prominent internet companies, most notably Google, Amazon, Yahoo!, and Facebook, were on the forefront of this explosion of data. Some generated their own data, and others collected what was freely available; but managing these vastly different kinds of datasets became core to doing business. They all started by building on the technology available at the time, but the limitations of this technology became limitations on the continued growth and success of these businesses. Although data management technology wasn’t core to the businesses, it became essential for doing business. The ensuing internal investment in technical research resulted in many new experiments in data technology.html
正如咱們如今所知道的,許多著名的互聯網公司,尤爲是谷歌,亞馬遜,雅虎和Facebook,他們衝在了這些爆炸性增加數據處理技術的前沿。其中一些公司的系統在生成這些數據,而另外一些公司則是收集那些免費的數據, 而另外一些是免費的。管理這些大相徑庭的數據集明顯已經成爲了公司處理業務的核心支撐。他們都開始經過當前的可用技術來創建本身的系統,但這些技術的侷限性明顯成爲限制這些企業業務持續增加和成功的約束。儘管數據管理技術並非企業的核心業務,但它已經成爲了企業業務處理所必需的支撐,因而企業的內部技術研究性投資,促進了許多數據技術方面的新試驗。node
http://www.uifanr.com/promise
Although many companies kept their research closely guarded, Google chose to talk about its successes. The publications that shook things up were the Google File System and MapReduce papers. Taken together, these papers represented a novel approach to the storage and processing of data. Shortly thereafter, Google published the Bigtable paper, which provided a complement to the storage paradigm provided by its file system. Other companies built on this momentum, both the ideas and the habit of publishing their successful experiments. As Google’s publications provided insight into indexing the internet, Amazon published Dynamo, demystifying a fundamental component of the company’s shopping cart.服務器
儘管許多公司繼續嚴格保密本身的研究成果,可是谷歌選擇了開放並探討它的成果。它發行和出版了Google文件系統和MapReduce的論文。這些論文展現了數據存儲和處理的新方法。此後不久,谷歌發表這篇Bigtable的論文,提供了一份關於它的存儲文件系統的存儲模式的範例。其餘公司看到這樣一種勢頭, 也開始展現他們的成功功經驗與成果。谷歌的出版公司提供了互聯網索引技術的細節內容,而亞馬遜公司則公佈了該公司購物車模塊的一個神祕組成部分。數據結構
It didn’t take long for all these new ideas to begin condensing into open source implementations. In the years following, the data management space has come to host all manner of projects. Some focus on fast key-value stores, whereas others provide native data structures or document-based abstractions. Equally diverse are the intended access patterns and data volumes these technologies support. Some forego writing data to disk, sacrificing immediate persistence for performance. Most of these technologies don’t hold ACID guarantees as sacred. Although proprietary products do exist, the vast majority of the technologies are open source projects. Thus, these technologies as a collection have come to be known as NoSQL.分佈式
沒過多久,全部這些新想法開始變成了開源實現。以後的幾年裏,源代碼管理空間裏託管了各類各樣的這種類型的項目。一些專一於快速鍵值對存儲功能,而另外一些則提供原生的數據結構或基於文檔的抽象存儲功能。不一樣之處就在於這些技術支持不一樣的訪問模式和數據量。一些技術選擇放棄將數據實時寫入磁盤,爲了性能而犧牲了當即持久的特性。大部分的這些技術再也不信奉ACID的約束。雖然這類技術有的是選擇成爲了專利產品,但絕大多數這類技術仍是屬於開放源碼的項目。這些技術統稱爲NoSQL技術。ide
Where does HBase fit in? HBase does qualify as a NoSQL store. It provides a key value API, although with a twist not common in other key-value stores. It promises strong consistencyso clients can see data immediately after it’s written. HBase runs on multiple nodes in a cluster instead of on a single machine. It doesn’t expose this detail to its clients. Your application code doesn’t know if it’s talking to 1 node or 100, which makes things simpler for everyone. HBase is designed for terabytes to petabytes of data, so it optimizes for this use case. It’s a part of the Hadoop ecosystemand depends on some key features, such as data redundancy and batch processing, to be provided by other parts of Hadoop.性能
HBase的適合用於什麼場景呢? HBase的確有資格稱爲NoSQL存儲。它提供了一個鍵值對操做API,儘管與其餘鍵值對存儲系統不太同樣。它提供了強一致性,客戶端能夠當即看到它剛寫入後和數據。 HBase運行於多服務器節點的集羣中,而不是在單臺機器上。但它的客戶端並不須要理會這個細節。客戶端應用程序的代碼根本不知道它是在和1個節點仍是100個節點通信,這使得你們使用起來感受更簡單。 HBase的是專爲TB到PB級的數據而準備的,因此它針對這種場景作了專門的優化。它是Hadoop生態圈的一部分,並依賴於生態圈中的一些關鍵功能,例如,數據冗餘和批量處理,這些關鍵功能是由Hadoop生態圈的其餘部分提供的,而不是HBase自己所具有的。
Now that you have some context for the environment at large, let’s consider specifically the beginnings of HBase.
好,如今你們瞭解了大數據方面的一些前因後果了,那麼讓咱們開始HBase的專項學習吧。
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, 「The Google File System,」 Google Research Publications, http://research.google.com/archive/gfs.html.
Jeffrey Dean and Sanjay Ghemawat, 「MapReduce: Simplified Data Processing on Large Clusters,」 Google Research Publications, http://research.google.com/archive/mapreduce.html.
Fay Chang et al., 「Bigtable: A Distributed Storage System for Structured Data,」 Google Research Publications, http://research.google.com/archive/bigtable.html.
Werner Vogels, 「Amazon’s Dynamo,」 All Things Distributed, www.allthingsdistributed.com/2007/10/amazons_dynamo.html.
桑傑 Ghemawat,霍華德Gobioff和舜德亮,「谷歌文件系統」,谷歌研究出版物,http://research.google.com/archive/gfs.html。
傑弗裏·迪恩和桑傑 Ghemawat「MapReduce:簡化數據處理的大型集羣」,谷歌研究出版物,http://research.google.com/archive/mapreduce.html。
費伊Chang等人,「Bigtable:一個結構化數據的分佈式存儲系統」,谷歌研究出版物,http://research.google.com/archive/bigtable.html。
維爾納·沃格爾,「亞馬遜的發電機」,萬物分佈式出版社區,www.allthingsdistributed.com/2007/10/amazons_dynamo.html。