最近工做須要,要看HDInsight部分,這裏要作筆記。天然是官網資料最權威,因此內容都從這裏搬過來:https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-introduction/ web
搞大數據,都知道Hadoop,那麼HDInsight和Hadoop啥關係呢?HDInsight是M$基於Azure的一個軟件架構,主要作大數據分析、管理用的,它使用了HDP(Hortonworks Data Platform)的Hadoop發行版。而後有點要注意,咱們講的Hadoop 通常指的是Hadoop的生態系統,包括Storm/Hbase等,而不僅僅是那個小大象。shell
HDInsight能夠理解爲是Apache Hadoop在微軟Azure上的一個實現,裏面包含了對應的Storm, HBase, Pig, Hive, Sqoop, Oozie, Ambari等等,固然,也捆綁了自家的Excel,SSAS,SSRS。apache
HDInsight支持兩種類型操做系統,Linux和M$本身的Windows,區別主要在這裏:架構
CATEGORY | HADOOP ON LINUX | HADOOP ON WINDOWS |
Cluster OS | Ubuntu 12.04 Long Term Support (LTS) | Windows Server 2012 R2 |
Cluster Type | Hadoop | Hadoop, HBase, Storm |
Deployment | Azure Management Portal, Azure CLI, Azure PowerShell | Azure Management Portal, Azure CLI, Azure PowerShell, HDInsight .NET SDK |
Cluster UI | Ambari | Cluster Dashboard |
Remote Access | Secure Shell (SSH) | Remote Desktop Protocol (RDP) |
一些基本概念及定義框架
Hadoop (the "Query" workload): Provides reliable data storage with HDFS, and a simple MapReduce programming model to process and analyze data in parallel.dom
HBase (the "NoSQL" workload): A NoSQL database built on Hadoop that provides random access and strong consistency for large amounts of unstructured and semi-structured data - potentially billions of rows times millions of columns. See Overview of HBase on HDInsight.機器學習
Apache Storm (the "Stream" workload): A distributed, real-time computation system for processing large streams of data fast. Storm is offered as a managed cluster in HDInsight. See Analyze real-time sensor data using Storm and Hadoop.分佈式
Ambari: Cluster provisioning, management, and monitoring.ide
Avro (Microsoft .NET Library for Avro): Data serialization for the Microsoft .NET environment.oop
Hive & HCatalog: Structured Query Language (SQL)-like querying, and a table and storage management layer.
Mahout: Machine learning.
MapReduce and YARN: Distributed processing and resource management.
Oozie: Workflow management.
Phoenix: Relational database layer over HBase.
Pig: Simpler scripting for MapReduce transformations.
Sqoop: Data import and export.
Tez: Allows data-intensive processes to run efficiently at scale.
ZooKeeper: Coordination of processes in distributed systems.
這貨有兩個版本,一個是Apache HBase,開源、NoSQL、基於Hadoop和狗狗的BigTable,對於海量的結構化及半結構化數據訪問有很好的支撐。另外一個是HDInsight HBase,微軟本身的。數據直接存放於Blob中。
HBase數據,能夠經過hbase shell的create/get/put/scan命令來管理,scan是讀多個行的數據。同時有一個REST方式的C# API能夠供調用。
HBase的使用場景
初衷就是google爲了本身的web search,你搜索三體的時候,它把全部包含三體的頁面都返回給你。除此以外,還包含了:
官網介紹,它分佈式的、容錯的、開源的一個計算系統,能夠實時處理Hadoop的數據。
HDInsight中的Storm,有以下特性:
實時處理的場景
Apache Spark,一個開源的,支持in-memory大數據分析的並行處理框架。
適用場景: