Hortworks 做爲Apache Hadoop2.0社區的開拓者,構建了一套本身的Hadoop生態圈,包括存儲數據的HDFS,資源管理框架YARN,計算模型MAPREDUCE、TEZ等,服務於數據平臺的PIG、HIVE&HCATALOG、HBASE,HDFS存儲的數據經過FLUME和SQOOP導入導出,集羣監控AMBARI、數據生命週期管理FALCON、做業調度系統OOZIE。本文簡要介紹了各個系統的概念。另外大多系統都經過Apache開源,讀者能夠自行下載試用。ios
Hortworks Hadoop生態圈架構如圖1所示。算法
圖1 Hortworks hadoop生態 架構圖shell
HDFS,YARN,MAPREDUCE,TEZ這裏再也不介紹。數據庫
1. HDP Hortonworks Data Platform編程
Hortonworks數據平臺,簡稱爲HDP,架構
2. Apache™ Accumuloapp
是一個採用單元級別的高性能數據存儲和檢索系統。框架
Apache™ Accumulo is a high performance data storage and retrieval system with cell-level access control. It is a scalable implementation of Google’s Big Table design that works on top of Apache Hadoop® and Apache ZooKeeper.機器學習
3. Apache™ Flume分佈式
是一個分佈式,穩定,有效的數據收集,聚合工具,並能把大量流式數據存到hdfs。
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.
4. Apache™ HBase
是一個運行在HDFS上NoSQL數據庫,採用列式存儲,提供快速訪問、更新、插入與刪除。
Apache™ HBase is a non-relational (NoSQL) database that runs on top of the Hadoop® Distributed File System (HDFS). It is columnar and provides fault-tolerant storage and quick access to large quantities of sparse data. It also adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes.
5. Apache™ HCatalog
是基於Hadoop的表格和存儲管理層,容許用戶使用不一樣的數據處理工具-Apache Pig, Apache MapReduce, and Apache Hive-在相應框架內更方便的讀取和寫入。
Apache™ HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Apache Pig, Apache MapReduce, and Apache Hive – to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop Distributed File System (HDFS) and ensures that users need not worry about where or in what format their data is stored. HCatalog displays data from RCFile format, text files, or sequence files in a tabular view. It also provides REST APIs so that external systems can access these tables’ metadata.
6. Apache Hive
是一個構建於Hadoop上數據倉庫基礎設施,目的是提供數據摘要、點對點查詢和大數據分析。它提供一種叫作HiveQL(SQL-like的語言)查詢存儲於Hadoop上結構化數據機制。Hive簡化了Hadoop和商業智能與試圖工具的集成。
Apache Hive is data warehouse infrastructure built on top of Apache™ Hadoop® for providing data summarization, ad-hoc query, and analysis of large datasets. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL). Hive eases integration between Hadoop and tools for business intelligence and visualization.
7. Apache™ Mahout
是基於Hadoop使用Mapreduce規範的大規模機器學習算法庫。機器學習是一門專一於讓機器學習而沒有明確編程的商業智能學科,它廣泛基於以前的輸出來完善將來的性能。一旦數據存儲到HDFS,Mahout提供數據科學工具用於自動找出該數據中有用的模式。Apache Mahout項目專一於更快和更容易把大數據轉換爲大量信息。
Apache™ Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoop® and using the MapReduce paradigm. Machine learning is a discipline of artificial intelligence focused on enabling machines to learn without being explicitly programmed, and it is commonly used to improve future performance based on previous outcomes. Once big data is stored on the Hadoop Distributed File System (HDFS), Mahout provides the data science tools to automatically find meaningful patterns in those big data sets. The Apache Mahout project aims to make it faster and easier to turn big data into big information.
8. Apache™ Pig
容許你使用簡單的腳本語言編寫複雜的MapReduce轉換。Pig Latin定義一系列的數據集轉換方法,如聚合、join和排序。Pig 把Pig Latin腳本轉換爲MapReduce,而後就能夠容許在Hadoop上。Pig Latin有時能夠使用UDFs (User Defined Functions)執行,即用戶能夠使用Java或腳本語言寫好後由Pig Latin調用。
Apache™ Pig allows you to write complex MapReduce transformations using a simple scripting language. Pig Latin (the language) defines a set of transformations on a data set such as aggregate, join and sort. Pig translates the Pig Latin script into MapReduce so that it can be executed within Hadoop®. Pig Latin is sometimes extended using UDFs (User Defined Functions), which the user can write in Java or a scripting language and then call directly from the Pig Latin.
9. Apache Sqoop
是一個批量轉換Hadoop和結構化存儲數據如關係型數據庫 的工具。
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop imports data from external structured datastores into HDFS or related systems like Hive and HBase. Sqoop can also be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses. Sqoop works with relational databases such as: Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB.
10. Apache Ambari
Apache Ambari是對Hadoop進行監控、管理和生命週期管理的開源項目。它也是一個爲Hortonworks數據平臺選擇管理組建的項目。Ambari向Hadoop MapReduce、HDFS、 HBase、Pig, Hive、HCatalog以及Zookeeper提供服務。
Apache Ambari is a 100-percent open source operational framework for provisioning, managing and monitoring Apache Hadoop clusters. Ambari includes an intuitive collection of operator tools and a robust set of APIs that hide the complexity of Hadoop, simplifying the operation of clusters.
10.1 Ambari provides tools to simplify cluster management. The Web interface allows you to start/stop/test Hadoop services, change configurations and manage ongoing growth of your cluster.
10.2 Monitor a Hadoop cluster
Gain instant insight into the health of your cluster. Ambari pre-configures alerts for watching Hadoop services and visualizes cluster operational data in a simple Web interface.
監控Hadoop集羣,可配置監控Hadoop的服務
Ambari also includes job diagnostic tools to visualize job interdependencies and view task timelines as a way to troubleshoot historic job performance execution.
監控做業執行情況,監控做業的性能問題。
Integrate Hadoop with other applications
Ambari provides a RESTful API that enables integration with existing tools, such as Microsoft System Center and Teradata Viewpoint. Ambari also leverages standard technologies and protocols with Nagios and Ganglia for deeper customization.
與其它應用集成,
11. Apache™ Falcon
是一個基於Hadoop爲了方便數據生命週期管理和處理的數據管理框架。
Apache™ Falcon is a data management framework for simplifying data lifecycle management and processing pipelines on Apache Hadoop®. It enables users to configure, manage and orchestrate data motion, pipeline processing, disaster recovery, and data retention workflows. Instead of hard-coding complex data lifecycle capabilities, Hadoop applications can now rely on the well-tested Apache Falcon framework for these functions. Falcon’s simplification of data management is quite useful to anyone building apps on Hadoop.
12. Apache™ Oozie
是一個用於調度Hadoop做業的Web應用。
Apache™ Oozie is a Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work. It is integrated with the Hadoop stack and supports Hadoop jobs for Apache MapReduce, Apache Pig, Apache Hive, and Apache Sqoop. It can also be used to schedule jobs specific to a system, like Java programs or shell scripts.
13. Apache ZooKeeper
爲Hadoop集羣提供操做服務。ZooKeeper提供一個分佈式配置服務、一個同步服務和一個用於分佈式系統的名字註冊。分佈式應用使用ZooKeeper存儲和更新重要的配置信息。
Apache ZooKeeper provides operational services for a Hadoop cluster. ZooKeeper provides a distributed configuration service, a synchronization service and a naming registry for distributed systems. Distributed applications use Zookeeper to store and mediate updates to important configuration information.
14. Knox
是一個基於Hadoop集羣提供單點認證和訪問的系統。
The Knox Gateway (「Knox」) is a system that provides a single point of authentication and access for Apache™ Hadoop® services in a cluster. The goal of the project is to simplify Hadoop security for users who access the cluster data and execute jobs, and for operators who control access and manage the cluster. Knox runs as a server (or cluster of servers) that serve one or more Hadoop clusters.