Data often trickles in and is added to an existing data store for further usage, such as analytics, processing, and serving. Many HBase use cases fall in this category—using HBase as the data store that captures incremental data coming in from various data sources. These data sources can be, for example, web crawls (the canonical Bigtable use case that we talked about), advertisement impression data containing information about which user saw what advertisement and for how long, or time series data generated from recording metrics of various kinds. Let’s talk about a few successful use cases and the companies that are behind these projects.web
數據通常是不斷流動並加入到一個現有的數據存儲中,爲進一步使用做好準備,如分析,處理和提供數據服務。許多HBase的應用案件屬於這種類型,HBase做爲數據存儲,從各類數據源中捕獲增量數據。這些數據源能夠是,抓取的網頁(咱們談到了Bigtable的典型案例),廣告統計數據(包含哪一個用戶看了哪一個廣告,停留了多長時間)或者各類記錄度量的時間序列數據。讓咱們來談談一些成功應用HBase的應用案例和公司。數據庫
CAPTURING METRICS: OPENTSDB瀏覽器
Web-based products serving millions of users typically have hundreds or thousands of servers in their back-end infrastructure. These servers spread across various functions—serving traffic, capturing logs, storing data, processing data, and so on. To keep 安全
the products up and running, it’s critical to monitor the health of the servers as well as the software running on these servers (from the OS right up to the application the user is interacting with). Monitoring the entire stack at scale requires systems that can collect and store metrics of all kinds from these different sources. Every company has its own way of achieving this. Some use proprietary tools to collect and visualize metrics; others use open source frameworks. 服務器
獲取度量:OPENTSDB網絡
服務百萬用戶的web類型產品,其後端基礎設施一般具備數百或數千個服務器。在這些服務器上,分佈着不一樣的功能,如提供數據傳輸的,捕獲日誌的,存儲數據的,處理數據的,等等。爲了維護該產品的運行穩定性,很關鍵的一件事情是監控服務器的運行情況,以及這些服務器上運行的軟件的運行情況(包括操做系統,用戶使用的應用程序)。監控整個規模的系統,咱們須要收集到各類不一樣來源的指標數據。每一個公司都有本身的實現這一目標的方式。一些公司會使用專門的工具來收集和查看這些度量,而有的公司會使用開源的框架。架構
StumbleUpon built an open source framework that allows the company to collect metrics of all kinds into a single system. Metrics being collected over time can be thought of as basically time-series data: that is, data collected and recorded over time.The framework that StumbleUpon built is called OpenTSDB, which stands for Open Time Series Database. This framework uses HBase at its core to store and access the collected metrics. The intention of building this framework was to have an extensible metrics collection system that could store and make metrics be available for access over a long period of time, as well as allow for all sorts of new metrics to be added as more features are added to the product. StumbleUpon uses OpenTSDBto monitor all of its infrastructure and software, including its HBase clusters. We cover OpenTSDB in detail in chapter 7 as a sample application built on top of HBase. 框架
StumbleUpon公司創建了一套開源框架,它使公司可以收集各類指標數據到一套單一的系統中。指標被收集的時間能夠被認爲是基本的時間序列數據,也就是說指標數據是基於時間來收集並記錄的。StumbleUpon公司構建的框架數據被稱爲OpenTSDB,稱爲開放的時間序列數據庫。這個框架是使用HBase做爲核心來存儲和訪問收集的指標數據的。創建這個框架的目的是爲了創建一個可擴展的指標收集系統,它能夠存儲並度量很長一段時間的指標數據,同時容許各類各樣的指標集添加到系統中,加強產品的功能。 StumbleUpon公司使用OpenTSDB來監控他們全部的基礎設施及應用軟件,包括其HBase集羣。咱們會在第七章中,把OpenTSDB做爲一個基於HBase構建的示例應用程序進行詳細的介紹。
CAPTURING USER-INTERACTION DATA: FACEBOOK AND STUMBLEUPON
Metrics captured for monitoring are one category. There are also metrics about user interaction with a product. How do you keep track of the site activity of millions of people? How do you know which site features are most popular? How do you use one page view to directly influence the next? For example, who saw what, and how many times was a particular button clicked? Remember the Like button in Facebook and the Stumble and +1 buttons in StumbleUpon? Does this smell like a counting problem? They increment a counter every time a user likes a particular topic.
捕獲用戶交互數據:Facebook和StumbleUpon公司
捕獲的監測指標屬於一種類型的數據。此外,還有用戶與產品交互相關的指標數據。你怎麼跟蹤數百萬用戶在網站裏的活動呢?你怎麼知道哪一個網站功能是最受歡迎的?你準備怎麼用一個頁面去直接影響用戶去看下一個頁面?例如,誰看見了什麼,特定的按鈕被點擊了多少次?經過記錄Facebook的連接按鈕點擊事件,記錄StumbleUpon公司+1按鈕的點擊事件?這像不像一個計數問題?當每次用戶喜歡一個特定主題的時候,他們就給相應計數器加1。
StumbleUpon had its start with MySQL, but as the service became more popular, that technology choice failed it. The online demand of this increasing user load was too much for the MySQL clusters, and ultimately StumbleUpon chose HBase to replace those clusters. At the time, HBase didn’t directly support the necessary features. StumbleUpon implemented atomic increment in HBase and contributed it back to the project.
StumbleUpon公司開始是使用MySQL,但隨着他們的服務變得愈來愈流行,該技術的選擇就顯得很失敗了。這種日益增加的在線需求形成的用戶負載對MySQL羣集來說太多了,支撐不了,最終StumbleUpon公司選擇HBase取代了那些MYSQL集羣。當時,HBase並不直接支持所須要的功能。 StumbleUpon公司在HBase裏實現原子性增量技術,並把這項技術併入到HBase的本來項目中來。
Facebook uses the counters in HBase to count the number of times people like a particular page. Content creators and page owners can get near real-time metrics about how many users like their pages. This allows them to make more informed decisions about what content to generate. Facebook built a system called Facebook Insights, which needs to be backed by a scalable storage system. The company looked at various options, including RDBMS, in-memory counters, and Cassandra, before settling on HBase. This way, Facebook can scale horizontally and provide the service to millions of users as well as use its existing experience in running large-scale HBase clusters. The system handles tens of billions of events per day and records hundreds
of metrics.
Facebook使用HBase裏計數器計數來統計用戶標記本身喜歡某個特定頁面的次數。內容發佈者和頁面的擁有者能夠近乎實時獲取到用戶標記喜歡本身革個頁面的次數。這使他們可以在發表什麼內容方面作出更明智的決策。Facebook上創建了一個名爲「Facebook洞察力」的系統,它須要一個可擴展的存儲系統來進行數據備份。在肯定用HBase前,該公司考察了各類方案,包括關係型數據庫,內存計數器和Cassandra。這樣一來,Facebook使用他在運行大型HBase集羣上經驗,可以水平擴展他的系統並提供服務給數百萬的用戶,該系統天天處理數百億的事件信息,並記錄數百個指標。
TELEMETRY: MOZILLA AND TREND MICRO
Operational and software-quality data includes more than just metrics. Crash reports are an example of useful software-operational data that can be used to gain insights into the quality of the software and plan the development roadmap. This isn’t necessarily related to web servers serving applications. HBase has been successfully used to capture and store crash reports that are generated from software crashes on users’ computers.
遙測技術:Mozilla和趨勢科技
業務和軟件質量的數據遠不是隻包含指標數據這麼簡單。崩潰報告是能夠用來深刻了解軟件質量和規劃發展藍圖的有用的數據例子之一。HBase不僅是與運行在Web服務器上的應用程序纔有關係,它已經被成功地用於捕獲和存儲在用戶計算機上軟件崩潰生成的崩潰報告。
The Mozilla Foundation is responsible for the Firefox web browser and Thunderbird email client. These tools are installed on millions of computers worldwide and run on a wide variety of OSs. When one of these tools crashes, it may send a crash report back to Mozilla in the form of a bug report. How does Mozilla collect these reports? What use are they once collected? The reports are collected via a system called Socorro and are used to direct development efforts toward more stable products. Socorro’s data storage and analytics are built on HBase
Mozilla基金會是負責發行火狐網頁瀏覽器和Thunderbird電子郵件客戶端的。這些工具安裝在數百萬世界各地的計算機上,並運行在多種操做系統中。當這些工具當中有一個出現崩潰的時候,它就可能以錯誤報告的形式發送一個崩潰報告給Mozilla。Mozilla如何收集這些報告呢?他們收藏這些報告有什麼用呢? 這些報告將被一個稱爲「索科洛」系統收集起來,並用於爲開發更穩定的產品提供依據。 「索科洛」 系統的數據存儲和分析都是創建在HBase上的。
The introduction of HBase enabled basic analysis over far more data than was previously possible. This analysis was used to direct Mozilla’s developer focus to great effect, resulting in the most bug-free release ever.
HBase的引入,使基於更多的數據進行基本分析的操做成爲了一種可能。分析的結果用來直接影響Mozilla開發重點的調整,Mozilla提高了質量,發佈的版本大多數是0缺陷。
Trend Micro provides internet security and threat-management services to corporate clients. A key aspect of security is awareness, and log collection and analysis are critical for providing that awareness in computer systems. Trend Micro uses HBase to manage its web reputation database, which requires both row-level updates and support for batch processing with MapReduce. Much like Mozilla’s Socorro, HBase is also used to collect and analyze log activity, collecting billions of records every day. The flexible schema in HBase allows data to easily evolve over time, and Trend Micro can add new attributes as analysis processes are refined
趨勢科技提供給企業客戶網絡安全服務和威脅管理服務。安全很重要的一個方面是安全意識,日誌收集和分析在提高人們對計算機系統的安全意識方面是相當重要的。趨勢科技使用HBase管理他的互聯網信譽數據庫,這個數據庫須要支持行級更新和MapReduce批處理,很是像Mozilla的「索科羅」,HBase用來收集和分析日誌活動信息,天天採集數十億條記錄。HBase靈活的數據架構模式容許數據模型隨着時間的推移而快速修改,趨勢科技能夠很方便地添加新屬性做爲分析內容,從而使分析結果更精確。
ADVERTISEMENT IMPRESSIONS AND CLICK STREAM
Over the last decade or so, online advertisements have become a major source of revenue for web-based products. The model has been to provide free services to users but have ads linked to them that are targeted to the user using the service at the time. This kind of targeting requires detailed capturing and analysis of user-interaction data to understand the user’s profile. The ad to be displayed is then selected based on that profile. Fine-grained user-interaction data can lead to building better models, which in turn leads to better ad targeting and hence more revenue. But this kind of data has two properties: it comes in the form of a continuous stream, and it can be easily partitioned based on the user. In an ideal world, this data should be available to use as soon as it’s generated, so the user-profile models can be improved continuously without delay—that is, in an online fashion.
廣告展現次數和點擊流
在過去的十年左右,網絡廣告已成爲Web產品的主要收入來源。 這個模式是,向用戶提供免費服務,向使用該服務的用戶連接針對性的廣告。這種針對性的廣告推送,須要對用戶的交互數據進行詳細採集和分析,瞭解用戶的行爲習慣。應該顯示什麼廣告,後續就是基於用戶的行爲習慣來肯定的。細粒度的用戶交互數據能夠用來創建更好的數據模型,從而帶來更好的廣告推送定位,得到更多收入。這種交互數據具備兩個屬性:它是連續的數據流的形式,它能夠基於用戶很是方便地切分。在理想的狀況下,這種數據分析結果應該儘量地快地提供出來。用戶行爲習慣模型的改善應該是基於在線的方式,連續無延遲的進行的。
Online vs. offline systems
The terms online and offline have come up a couple times. For the uninitiated, these terms describe the conditions under which a software system is expected to perform. Online systems have low-latency requirements. In some cases, it’s better for these systems to respond with no answer than to take too long producing the correct answer. You can think of a system as online if there’s a user at the other end impatiently tapping their foot. Offline systems don’t have this low-latency requirement. There’s a user waiting for an answer, but that response isn’t expected immediately.
The intent to be an online or an offline system influences many technology decisions when implementing an application. HBase is an online system. Its tight integration with Hadoop MapReduce makes it equally capable of offline access as well.
在線系統與離線系統
在線和離線這些術語已經屢次提到了。對於外行來講,這些術語描述了一個軟件系統運行的模式。在線系統具備低時延的要求。在某些狀況下,這些系統不返回響應要比花不少時間來產生正確的響應更重要。你能夠想像一下,若是一個在線系統有個用戶在等得不耐煩地跺腳了,那是什麼滋味。離線系統沒有這個低延遲要求。是有用戶在等待它的處理返回,但並非要那種當即的返回響應。
設計一個應用系統的時候,把它定位爲在線仍是離線,這會很大程度上影響着許多技術決策。 HBase的是一個在線系統,同時它與Hadoop的MapReduce框架緊密集成,使得它一樣可以支持離線訪問模式。
These factors make collecting user-interaction data a perfect fit for HBase, and HBase has been successfully used to capture raw clickstream and user-interaction data incrementally and then process it (clean it, enrich it, use it) using different processing mechanisms (MapReduce being one of them). If you look for companies that do this, you’ll find plenty of examples.
這些因素使收集用戶交互數據很是適合用HBase來實現,HBase已經被成功地應用於捕獲原始點擊流,用戶交互的數據增量,並使用不一樣的處理機制(MapReduce是其中之一)來處理(清洗,填充,使用)這些數據。若是你查找一下作這方面業務的企業,你會發現不少的例子。