【原創】大數據基礎之Alluxio（1）簡介、安裝、使用

時間 2019-11-21

標籤原創數據基礎 alluxio 簡介安裝使用简体版

原文原文鏈接

Alluxio 1.8.1html

官方：http://www.alluxio.org/node

一簡介

Open Source Memory Speed Virtual Distributed Storage
Alluxio, formerly Tachyon, enables any application to interact with any data from any storage system at memory speed.shell

alluxio是一個開源的擁有內存訪問速度的虛擬分佈式存儲；以前叫Tachyon，可使應用像訪問內存數據同樣訪問任何存儲系統中的數據。apache

1 優點

Storage Unification and Abstraction

Alluxio unifies data access to different systems, and seamlessly bridges computation frameworks and underlying storage.json

Remote Data Acceleration

Decouple compute and storage without any loss in performance.服務器

將計算和存儲分離，而且不會損失性能；app

2 部署結構

Alluxio can be divided into three components: masters, workers, and clients. A typical setup consists of a single leading master, multiple standby masters, and multiple workers. The master and worker processes constitute the Alluxio servers, which are the components a system administrator would maintain. The clients are used to communicate with the Alluxio servers by applications such as Spark or MapReduce jobs, Alluxio command-line, or the FUSE layer.less

alluxio由master、worker組成，其中master若是有多個，只有一個是leading master，其餘爲standby master；ssh

3 角色

Master

The Alluxio master service can be deployed as one leading master and several standby masters for fault tolerance. When the leading master goes down, a standby master is elected to become the new leading master.curl

1）Leading Master

Only one master process can be the leading master in an Alluxio cluster. The leading master is responsible for managing the global metadata of the system. This includes file system metadata (e.g. the file system inode tree), block metadata (e.g. block locations), and worker capacity metadata (free and used space). Alluxio clients interact with the leading master to read or modify this metadata. All workers periodically send heartbeat information to the leading master to maintain their participation in the cluster. The leading master does not initiate communication with other components; it only responds to requests via RPC services. The leading master records all file system transactions to a distributed persistent storage to allow for recovery of master state information; the set of records is referred to as the journal.

alluxio中只有一個leading master，leading master負責管理全部的元數據，包括文件系統元數據、block元數據和worker元數據；worker會按期向leading master發送心跳；leading master會記錄全部的文件操做到日誌中；

2）Standby Masters

Standby masters read journals written by the leading master to keep their own copies of the master state up-to-date. They also write journal checkpoints for faster recovery in the future. They do not process any requests from other Alluxio components.

standby master會及時同步讀取leader master的日誌；

Worker

Alluxio workers are responsible for managing user-configurable local resources allocated to Alluxio (e.g. memory, SSDs, HDDs). Alluxio workers store data as blocks and serve client requests that read or write data by reading or creating new blocks within their local resources. Workers are only responsible for managing blocks; the actual mapping from files to blocks is only stored by the master.

worker負責管理資源，好比內存、ssd等；worker負責將數據存儲爲block同時響應client的讀寫請求；實際的file和block的映射關係保存在master中；

Because RAM usually offers limited capacity, blocks in a worker can be evicted when space is full. Workers employ eviction policies to decide which data to keep in the Alluxio space.

Client

The Alluxio client provides users a gateway to interact with the Alluxio servers. It initiates communication with the leading master to carry out metadata operations and with workers to read and write data that is stored in Alluxio.

client先向leading master請求元數據信息，而後向worker發送讀寫請求；

二安裝

1 下載

$ wget http://downloads.alluxio.org/downloads/files//1.8.1/alluxio-1.8.1-hadoop-2.6-bin.tar.gz
$ tar xvf alluxio-1.8.1-hadoop-2.6-bin.tar.gz
$ cd alluxio-1.8.1-hadoop-2.6

2 配置本機ssh登陸

便可以 ssh localhost
詳見：http://www.javashuo.com/article/p-rndbchju-bd.html

3 配置

$ cp conf/alluxio-site.properties.template conf/alluxio-site.properties
$ vi conf/alluxio-site.properties
alluxio.master.hostname=localhost

4 初始化

$ ./bin/alluxio validateEnv local
$ ./bin/alluxio format

5 啓動

$ ./bin/alluxio-start.sh local SudoMount

若是報錯：

Formatting RamFS: /mnt/ramdisk (44849277610)
ERROR: mkdir /mnt/ramdisk failed

須要添加sudo權限

# visudo -f /etc/sudoers
$user ALL=(ALL) NOPASSWD: /bin/mount * /mnt/ramdisk, /bin/umount * /mnt/ramdisk, /bin/mkdir * /mnt/ramdisk, /bin/chmod * /mnt/ramdisk

三使用

1 命令行

文件系統操做

$ ./bin/alluxio fs
$ ./bin/alluxio fs ls /
$ ./bin/alluxio fs copyFromLocal LICENSE /LICENSE
$ ./bin/alluxio fs cat /LICENSE

看起來和hdfs命令很像

admin操做

$ bin/alluxio fsadmin report

Alluxio cluster summary:

    Master Address: localhost/127.0.0.1:19998

    Web Port: 19999

    Rpc Port: 19998

    Started: 01-24-2019 10:28:59:433

    Uptime: 0 day(s), 1 hour(s), 24 minute(s), and 42 second(s)

    Version: 1.8.1

    Safe Mode: false

    Zookeeper Enabled: false

    Live Workers: 1

    Lost Workers: 0

    Total Capacity: 10.00GB

        Tier: MEM Size: 10.00GB

    Used Capacity: 9.36GB

        Tier: MEM Size: 9.36GB

    Free Capacity: 651.55MB

查看統計信息

$ curl http://$master:19999/metrics/json

2 UFS（Under File Storage）

UFS=LocalFileSystem

1 默認配置

$ cat conf/alluxio-site.properties
alluxio.underfs.address=${alluxio.work.dir}/underFSStorage

2 命令示例

$ ls ./underFSStorage/
$ ./bin/alluxio fs persist /LICENSE
$ ls ./underFSStorage
LICENSE

With the default configuration, Alluxio uses the local file system as its under file storage (UFS). The default path for the UFS is ./underFSStorage.

Alluxio is currently writing data only into Alluxio space, not to the UFS.Configure Alluxio to persist the file from Alluxio space to the UFS by using the persist command.

Alluxio默認用的是本地文件系統做爲UFS，只有執行persist命令以後，文件纔會持久化到UFS中；

UFS=HDFS

1 配置

$ cat conf/alluxio-site.properties
alluxio.underfs.address=hdfs://<NAMENODE>:<PORT>/alluxio/data

若是你想對hdfs上所有數據進行加速而且路徑不變，能夠配置爲hdfs的根目錄

2 配置hadoop

1）連接

$ ln -s $HADOOP_CONF_DIR/core-site.xml conf/core-site.xml
$ ln -s $HADOOP_CONF_DIR/hdfs-site.xml conf/hdfs-site.xml

Copy or make symbolic links from hdfs-site.xml and core-site.xml from your Hadoop installation into ${ALLUXIO_HOME}/conf

2）直接配置路徑

alluxio.underfs.hdfs.configuration=/path/to/hdfs/conf/core-site.xml:/path/to/hdfs/conf/hdfs-site.xml

3 命令

$ bin/alluxio fs ls /

能夠看到hdfs上全部的目錄了

4 文件映射

這時能夠經過訪問

alluxio://$alluxio_server:19998/test.log

來訪問底層存儲

hdfs://$namenode_server/alluxio/data/test.log

注意：這裏須要指定$alluxio_server和端口，存在單點問題，後續ha方式部署以後能夠解決這個問題。

3 Spark訪問

1 準備：（二選一）

1）配置

spark.driver.extraClassPath /<PATH_TO_ALLUXIO>/client/alluxio-1.8.1-client.jar
spark.executor.extraClassPath /<PATH_TO_ALLUXIO>/client/alluxio-1.8.1-client.jar

This Alluxio client jar file can be found at /<PATH_TO_ALLUXIO>/client/alluxio-1.8.1-client.jar

2）拷貝jar

$ cp client/alluxio-1.8.1-client.jar $SPARK_HOME/jars/

2 訪問

$ spark-shell
scala> val s = sc.textFile("alluxio://localhost:19998/derby.log")
s: org.apache.spark.rdd.RDD[String] = alluxio://localhost:19998/derby.log MapPartitionsRDD[1] at textFile at <console>:24

scala> s.foreach(println)
----------------------------------------------------------------
Thu Jan 10 11:05:45 CST 2019:

參考：http://www.alluxio.org/docs/1.8/en/compute/Spark.html

4 hive訪問

拷貝jar

$ cp client/alluxio-1.8.1-client.jar $HIVE_HOME/lib/

$ cp client/alluxio-1.8.1-client.jar $HADOOP_HOME/share/hadoop/common/lib/

重啓metastore和hiveserver2

5 部署方式

1 集羣ha部署

即多worker+多master+zookeeper

1 配置集羣服務器間ssh可達

同上

2 配置

$ cat conf/alluxio-site.properties
#alluxio.master.hostname=<MASTER_HOSTNAME>
alluxio.zookeeper.enabled=true
alluxio.zookeeper.address=<ZOOKEEPER_ADDRESS>
alluxio.master.journal.folder=hdfs://$namenode_server/alluxio/journal/
alluxio.worker.memory.size=20GB

將配置同步到集羣全部服務器

3 配置masters和workers