【原創】大數據基礎之Kudu（1）簡介、安裝、使用

時間 2019-11-21

標籤原創數據基礎 kudu 簡介安裝使用简体版

原文原文鏈接

kudu 1.7html

官方：https://kudu.apache.org/git

一簡介

kudu有不少概念，有分佈式文件系統（HDFS），有一致性算法（Zookeeper），有Table（Hive Table），有Tablet（Hive Table Partition），有列式存儲（Parquet），有順序和隨機讀取（HBase），因此看起來kudu是一個輕量級的 HDFS + Zookeeper + Hive + Parquet + HBase，除此以外，kudu還有本身的特色，快速寫入+讀取，使得kudu+impala很是適合OLAP場景，尤爲是Time-series場景。github

A new addition to the open source Apache Hadoop ecosystem, Apache Kudu completes Hadoop's storage layer to enable fast analytics on fast data.web

kudu是hadoop生態的有力補充，使得hadoop存儲層也能夠支持快速變化數據上的快速分析；算法

Streamlined Architecture
- 　　Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer. As a new complement to HDFS and Apache HBase, Kudu gives architects the flexibility to address a wider variety of use cases without exotic workarounds.

kudu提供了快速寫入更新的能力和高效列式掃描的能力，使得直接在存儲層上實現實時分析成爲可能，簡化了傳統技術棧；shell

Faster Analytics
- 　　Kudu is specifically designed for use cases that require fast analytics on fast (rapidly changing) data. Engineered to take advantage of next-generation hardware and in-memory processing, Kudu lowers query latency significantly for Apache Impala (incubating) and Apache Spark (initially, with other execution engines to come).

kudu被設計爲尤爲適合在快速變化的數據上進行快速分析的場景，利用下一代硬件以及內存處理的優點，kudu下降了impala和spark的查詢延遲；apache

Kudu is a columnar storage manager developed for the Apache Hadoop platform. Kudu shares the common technical properties of Hadoop ecosystem applications: it runs on commodity hardware, is horizontally scalable, and supports highly available operation.api

kudu是一個hadoop平臺的列式存儲層，它繼承了hadoop生態的技術特色：通用硬件、水平擴展、高可用；架構

Kudu’s design sets it apart. Some of Kudu’s benefits include:app

Fast processing of OLAP workloads.
Integration with MapReduce, Spark and other Hadoop ecosystem components.
Tight integration with Apache Impala, making it a good, mutable alternative to using HDFS with Apache Parquet.
Strong but flexible consistency model, allowing you to choose consistency requirements on a per-request basis, including the option for strict-serializable consistency.
Strong performance for running sequential and random workloads simultaneously.
Easy to administer and manage with Cloudera Manager.
High availability. Tablet Servers and Masters use the Raft Consensus Algorithm, which ensures that as long as more than half the total number of replicas is available, the tablet is available for reads and writes. For instance, if 2 out of 3 replicas or 3 out of 5 replicas are available, the tablet is available.
Reads can be serviced by read-only follower tablets, even in the event of a leader tablet failure.
Structured data model.

kudu有以上諸多特色：快速OLAP、整合其餘hadoop生態組件（好比spark）、整合Impala、快速順序和隨機讀取、可配置的數據一致性、高可用、結構化數據模型；

By combining all of these properties, Kudu targets support for families of applications that are difficult or impossible to implement on current generation Hadoop storage technologies. A few examples of applications for which Kudu is a great solution are:

Reporting applications where newly-arrived data needs to be immediately available for end users
Time-series applications that must simultaneously support:
- 　　queries across large amounts of historic data
- 　　granular queries about an individual entity that must return very quickly
Applications that use predictive models to make real-time decisions with periodic refreshes of the predictive model based on all historic data

當kudu有了以上特色以後，使得傳統hadoop存儲技術很難解決的一些場景成爲可能，好比：數據快速變化的報表系統、Timer-series應用、實時決策系統；

kudu架構

概念

Table

A table is where your data is stored in Kudu. A table has a schema and a totally ordered primary key. A table is split into segments called tablets.

Table（相似於hive或hbase的table），有schema和primary key，能夠劃分爲多個Tablet；

Tablet

A tablet is a contiguous segment of a table, similar to a partition in other data storage engines or relational databases. A given tablet is replicated on multiple tablet servers, and at any given point in time, one of these replicas is considered the leader tablet. Any replica can service reads, and writes require consensus among the set of tablet servers serving the tablet.

Tablet（相似於hive中的partition或hbase中的region），tablet是多副本的，存放在多個tablet server上，多個副本中有一個是leader tablet；全部的副本均可以讀，可是寫操做只有leader能夠，寫操做利用一致性算法（Raft）；

Tablet Server

A tablet server stores and serves tablets to clients. For a given tablet, one tablet server acts as a leader, and the others act as follower replicas of that tablet. Only leaders service write requests, while leaders or followers each service read requests. Leaders are elected using Raft Consensus Algorithm. One tablet server can serve multiple tablets, and one tablet can be served by multiple tablet servers.

tablet server（相似於hbase中的region server），存放tablet而且相應client請求；一個tablet server存放多個tablet；

Catalog Table

The catalog table is the central location for metadata of Kudu. It stores information about tables and tablets. The catalog table may not be read or written directly. Instead, it is accessible only via metadata operations exposed in the client API.
The catalog table stores two categories of metadata: Tables & Tablets

catalog table存放kudu的metadata（相似於hive和hbase中的metadata），catalog table包含兩類metadata：Tables和Tablets

Master

The master keeps track of all the tablets, tablet servers, the Catalog Table, and other metadata related to the cluster. At a given point in time, there can only be one acting master (the leader). If the current leader disappears, a new master is elected using Raft Consensus Algorithm.
The master also coordinates metadata operations for clients. For example, when creating a new table, the client internally sends the request to the master. The master writes the metadata for the new table into the catalog table, and coordinates the process of creating tablets on the tablet servers.
All the master’s data is stored in a tablet, which can be replicated to all the other candidate masters.
Tablet servers heartbeat to the master at a set interval (the default is once per second).

master（相似於hdfs和hbase的master），負責管理全部的tablet、tablet server、catalog table以及其餘元數據。同一時間集羣中只有一個acting master（leader master），若是leader master掛了，一個新的master會經過Raft算法選舉出來。
全部的master數據都存放在一個tablet中，這個tablet會被複制到全部的candidate master上；
tablet server會按期向master發送心跳。

Raft Consensus Algorithm

Kudu uses the Raft consensus algorithm as a means to guarantee fault-tolerance and consistency, both for regular tablets and for master data. Through Raft, multiple replicas of a tablet elect a leader, which is responsible for accepting and replicating writes to follower replicas. Once a write is persisted in a majority of replicas it is acknowledged to the client. A given group of N replicas (usually 3 or 5) is able to accept writes with at most (N - 1)/2 faulty replicas.

kudu經過Raft一致性算法（相似於zookeeper中的Paxos算法）來保證tablet和master數據的容錯性和一致性。詳見：https://raft.github.io/

Logical Replication

Kudu replicates operations, not on-disk data. This is referred to as logical replication, as opposed to physical replication.

kudu使用的是邏輯副本的概念。

二安裝

1 安裝ntp服務

# vi /etc/ntp.conf
# service ntpd start
# ntpstat

詳見：https://www.cnblogs.com/quchunhui/p/7658853.html

2 增長repo

# cat /etc/yum.repos.d/cdh.repo

[cloudera-cdh5]
# Packages for Cloudera's Distribution for Hadoop, Version 5, on RedHat or CentOS 7 x86_64
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=https://archive.cloudera.com/cdh5/redhat/7/x86_64/cdh/5/
gpgkey =https://archive.cloudera.com/cdh5/redhat/7/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1

這裏沒有指定版本，默認會安裝最新

3 master安裝

# yum install kudu kudu-master kudu-client0 kudu-client-devel

配置文件

/etc/kudu/conf/master.gflagfile

能夠修改數據路徑，若是啓動多個master須要配置

--master_addresses=$master1,$master2

啓動，能夠啓動多個master

# service kudu-master start

4 tserver安裝

# yum install kudu kudu-tserver kudu-client0 kudu-client-devel

配置文件

/etc/kudu/conf/tserver.gflagfile

修改master地址，能夠配置多個

--tserver_master_addrs=$master_server:7051

啓動

# service kudu-tserver start

ps：也能夠手工下載rpm：https://archive.cloudera.com/cdh5/redhat/7/x86_64/cdh/5/RPMS/x86_64/

kudu-1.7.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el7.x86_64.rpm
kudu-client-devel-1.7.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el7.x86_64.rpm
kudu-client0-1.7.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el7.x86_64.rpm
kudu-master-1.7.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el7.x86_64.rpm
kudu-tserver-1.7.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el7.x86_64.rpm

三使用

1 集羣相關

查看集羣總體信息

# sudo -u kudu kudu cluster ksck $master

查看master狀態或flag

# su - kudu kudu master status localhost

# su - kudu kudu master get_flags localhost

查看tserver狀態或flag

# su - kudu kudu tserver status localhost

# su - kudu kudu tserver get_flags localhost

2 數據相關

經過impala-shell讀寫數據

[$impala_server:21000] >
CREATE TABLE impala.test_kudu (
id INT,
name STRING,
PRIMARY KEY (id)
)
PARTITION BY HASH (id) PARTITIONS 4
STORED AS KUDU
TBLPROPERTIES ('kudu.master_addresses'='$kudu_master:7051');
[$impala_server:21000] > select * from test_kudu;
Query: select * from test_kudu
Query submitted at: 2019-01-21 12:53:04 (Coordinator: http://$impala_server:25000)
Query progress can be monitored at: http://$impala_server:25000/query_plan?query_id=e345f450c0dca86a:4769860f00000000
+----+-------+
| id | name |
+----+-------+
| 1 | test |
+----+-------+
Fetched 1 row(s) in 0.13s

在kudu中看到新建立的表：

1）命令行

# kudu -h

Usage: /usr/lib/kudu/bin/kudu <command> [<args>]

<command> can be one of the following:

         cluster   Operate on a Kudu cluster

              fs   Operate on a local Kudu filesystem

   local_replica   Operate on local tablet replicas via the local filesystem

          master   Operate on a Kudu Master

             pbc   Operate on PBC (protobuf container) files

            perf   Measure the performance of a Kudu cluster

remote_replica   Operate on remote tablet replicas on a Kudu Tablet Server

           table   Operate on Kudu tables

          tablet   Operate on remote Kudu tablets

            test   Various test actions

         tserver   Operate on a Kudu Tablet Server

             wal   Operate on WAL (write-ahead log) files