學習GlusterFS（三）

時間 2019-11-21

標籤學習 glusterfs 简体版

原文原文鏈接

glusterfs，GNU cluster file system，創始人Anand Babu Periasamy，目標：代替開源Lustre和商業產品GPFS，glusterfs是什麼：node

cloud storage；linux

分佈式文件系統（POSIX兼容）；ios

elasticity（flexibility adapt to growth/reduction,add,delete volumes&users without disruption）；c++

無中心架構（無元數據server），eliminate metadata（improve file access speed）；web

scale out橫向擴展（容量、性能）、高性能、高可用，scale linearly（multiple dimentions(performance;capacity)；aggregated resources）；算法

集羣式NAS存儲系統；vim

採用異構的標準商業硬件、infiniband；後端

資源池（聚合存儲和內存）；centos

全局統一命名空間；api

自動複製和自動修復；

易於部署和使用，simplicity（ease of management,no complex kernel pathces,run in user space）；

glusterfs是開源的分佈式FS，具備強大的橫向擴展能力，支持數PB級存儲容量和處理數千客戶端，藉助tcp/ip或infiniband RDMA將物理分佈的存儲資源聚焦在一塊兒，使用單一全局命名空間管理數據，基於可堆疊的用戶空間設計，可爲各類不一樣的數據負載提供優異的性能；

優勢（無元數據據服務設計，彈性hash，scale out；高性能，PB級容量，GB級吞吐量，數百集羣規模；用戶空間模塊化堆棧設計；高可用性，支持自動複製和自動修復；適合大文件存儲）；

不足（小文件性能表現差；系統OPS表現差；複製存儲利用率低（HA和糾刪碼方案））

2006年有此項目：

06-09年，glusterfs v1.0-3.0（分佈式FS、自修復、同步副本、條帶、彈性hash算法）；

10年，glusterfs v3.1（彈性雲能力）；

11年，glusterfs v3.2（遠程複製、監控、quota、redhat 1.36億$收購）；

12年，glusterfs v3.3（對象存儲、HDFS兼容、主動自修復、細粒度鎖、複製優化）；

13年，glusterfs v3.4（libfapi、quorum機制、虛擬機存儲優化、同步複製優化、POSIX ACL支持）；

glusterfs架構優點：軟件定義、無中心架構、全局統一命名空間、高性能、高可用、堆棧式用戶空間設計、彈性橫向擴展、高速網絡通訊、數據自動修復

glusterfs高性能記錄32GBs（server-side，64 bricks with ib-verbs transport；client-side，cluster of 220 servers）

http://blog.csdn.net/liuaigui/

應用場景：

結構化和半結構化數據；非結構化數據存儲（文件）；歸檔、容災；虛擬機存儲；雲存儲；內容雲；大數據；

解決方案：

媒體/cdn；備份、歸檔、容災；海量數據共享；用戶home目錄；高性能計算；雲存儲

glusterfs彈性卷管理：

彈性hash算法（無集中式元數據服務（消除性能瓶頸、提升可靠性）；使用davies-meyer算法計算32bit hash值，輸入參數爲文件名；根據hash值在集羣中選擇子卷（存儲服務器），進行文件定位；對所選的子捲進行數據訪問；例如brick1（00000000-3FFFFFFF），brick2（4FFFFFFF-7FFFFFFF），brick3（8FFFFFFF-BFFFFFFF））；

採用hash算法定位文件（基於路徑和文件名；DHT，distributed hash table，一致性hash）；

彈性卷管理（文件存儲在邏輯卷中；邏輯卷從物理存儲池中劃分；邏輯卷可在線進行擴容和縮減）；

DHT（glusterfs彈性擴展的基礎；肯定目標hash和brick之間的映射關係）：

添加node後（最小化數據重分配；老數據分配模式不變，新數據分佈到全部node上；執行rebalance（在非訪問高峯時段操做），數據從新分佈）

glusterfs整體架構：

堆棧式軟件架構：

全局統一命名空間（經過分佈式FS將物理分散的存儲資源虛擬化成統一的存儲池）：

無集中元數據服務：

基本概念：

brick（a file system mountpoint; a unit of storage used as a glusterfs building block）；

translator（logic between the bits and the global namespace; layered to provide glusterfs functionality）；

volume（bricks combined and passed through translators）；

node/peer（server running the gluster daemon and sharing volumes）；

glusterfs卷類型（基本卷、複合卷）：

基本卷：

哈希卷（distributed volume，文件經過hash算法在全部brick上分佈，文件級raid0，不具備容錯能力）；

複製卷（replicated volume，生產經常使用，文件同步複製到多個brick上，文件級raid1，具備容錯能力，w性能降低r性能提高）；

條帶卷（striped volume，不建議使用，單個文件分佈到多個brick上，支持超大文件，相似raid0，rr方式round-robin，一般用於HPC(high performance compute)中的超大文件(>10G的單個文件)及高併發環境(多人同時訪問同一個文件)）；

複合卷：

哈希複製卷（distributed replicated volume，生產經常使用，同時具備哈希卷和複製卷的特色）；

哈希條帶卷（distributed striped volume）；

複製條帶卷（replicated striped vlume）；

哈希複製條帶卷（distributed replicated striped volume）；

glusterfs訪問接口：

fuse architecture：

gluster數據流：

fuse w，libgfapi訪問：

libgfapi訪問：

數據自修復：

按需同步進行-->徹底人工掃描-->併發自動修復-->基於日誌

鏡像文件副本保持一致性；

觸發時機（訪問文件目錄時）；

判斷依據（擴展屬性）；

腦殘問題（報錯或按規則處理）；

容量LB：

rebalance後hash範圍均衡分佈，如添加一node會全局都變更；

目標（優化數據分佈，最小化數據遷移）；

數據遷移自動化、智能化、並行化

文件改名：

fileA-->fileB，原先的hash映射關係失效，大文件難以實時遷移；

大量採用文件符號連接，訪問時解析重定向；

容量負載優先：

設置容量閾值，優先選擇可用容量充足的brick；

hash目標brick上建立文件符號連接，訪問時重定向

glusterfs測試方法（功能性測試（廣義&狹義）、數據一致性測試、POSIX語義兼容性測試、部署方式測試、可用性測試、擴展性測試、穩定性測試、壓力測試、性能測試）：

功能性測試（手動或測試腳本；glusterfs（建立、啓動、中止、刪除卷操做，設置等）；FS的功能性測試（fstest文件控制與操做；系統API調用LTP；鎖應用locktest）；

數據一致性測試（測試存入與讀出的數據是否一致，方法：md5加密、diff、編譯內核等）

POSIX語義測試（PCTS、LTP）；

部署方式測試（測試不一樣場景下的系統部署方式，自動安裝配置，集羣規模，網絡、存儲等配置）；

可用性測試（測試系統的高可用性，集羣中某些server或disk、network等錯誤狀況下系統是否可用，管理是否簡單可靠，覆蓋功能點（副本、自修復、管理服務））；

擴展性測試（測試系統的彈性擴展功能；擴展系統後的性能影響；線性擴展能力）；

穩定性測試（驗證系統在長時間運行下，是否正常，功能是否正常，使用LTP、iozone、postmark進行自動化測試）；

壓力測試（驗證在大壓力下，系統運行及資源消耗狀況，iozone、postmark工具進行自動化測試；top、iostat、sar等進行系統監控）；

性能測試（系統在不一樣負載狀況下的性能，iozone(帶寬)、postmark(ops)、fio(iops)、dd工具進行自動化測試；關鍵點（順序rw、隨機rw、目錄操做（建立、刪除、查找、更新）、大量小文件rw、大文件rw）；主要衡量指標（iops隨機小文件隨機rw能力、帶寬、大文件連續rw能力）；其它衡量指標（cpu利用率、iowait））；

dd（大文件，順序rw，帶寬，單進程，臨時文件，手動記錄結果，沒法重定向）：

#dd if=/dev/zero of=/mnt/mountpoint/filebs=1M count=100 #（w）

#dd if=/mnt/mountpoint/file of=/dev/nullbs=1M #（r）

iozone（順序/隨機rw，帶寬，多進程，臨時文件可選留存，可自動生成excel表記錄結果值）：

#iozone -t 1 -s 1g -r 128k -i 0 -i 1 -i 2-R -b /result.xls -F /mnt/mountpoint/file

-t（進程數）；

-s（測試的文件大小）；

-r（文件塊大小）；

-i #（用來指定測試內容）；

-R（產生excel格式的輸出日誌）；

-b（產生二進制的excel日誌）；

-F（指定測試的臨時文件組）；

-g（指定最大測試文件大小）；

postmark（ops，元數據操做（建立、r、w、附加、刪除），小文件，單進程，可重定向結果，無遺留臨時文件，使用方法（配置文件或CLI））：

經常使用參數：

set size min_size max_size（設置文件大小的上下限）

set number XXX（設置併發文件數）

set seed XXX（設置隨機數種子）

set transactions XXX（設置事務數）

set location（設置工做目錄，要是已有目錄，默認當前目錄）

set subdirectory n n（爲每一個工做目錄下的子目錄個數）

set read n（設置rw塊大小）

set write n

fio（iops，元數據操做（建立、r、w、附加、刪除），小文件，多進程，可重定向結果，無遺留臨時文件，使用方法（配置文件或CLI））：

參數：

filename=/tmp/file（測試文件名）

direct=1（測試過程繞過機器自帶的buffer）

rw=randrw（測試隨機r和w的io）

bs=16k（單次io的塊文件大小爲16k）

bsrange=512-2048（同上，指定數據塊的大小範圍）

size=5g（測試文件大小爲5g）

numjobs=30（測試線程數）

runtime=1000（測試時間1000s，若不寫則寫完爲止）

ioengine=sync（io引擎使用sync方式）

rwmixwrite=30（在混合rw的模式下，寫佔30%）

其它性能測試：

FS（make、mount、umount、remount）；

copy、recopy、remove（大文件，>=4g）；

extract、tar（linux內核源碼樹）；

copy、recopy、remove（linux內核源碼樹）；

list、find（linux內核源碼樹）；

編譯linux內核；

create、copy、remove（海量文件目錄，>=1000000）

FS分類：

分佈式FS（c/s架構或網絡FS；數據不是本地直連方式）；

集羣FS（分佈式FS的一個子集；多node協同服務，不存在單點）；

並行FS（支持MPI等並行應用；併發rw，全部node可同時rw同一個文件）；

產品：

商業：EMC的isilon；IBM的sonas；HP的X9000；huawei的oceanstor9000；blue whale的BWFS；loongcun的LoongStore；

開源：Lustre；glusterfs；ceph；moosefs；HDFS；fastDFS；TFS

moosefs：

moosefs是一個高容錯性的分佈式FS，它可以將資源分佈存儲在幾臺不一樣的物理介質，對外只提供給用戶一個訪問接口；高可靠性（數據可被存儲於幾個不一樣的地方）；可擴展性（可動態的添加server或disk來增長容量）；高可控性（系統能設置刪除文件的時間間隔）；可追溯性（能根據文件的不一樣操做，r or w，生成文件快照；

lustreFS：

LustreFS是一個基於對象存儲的開源分佈式FS，提供與POSIX兼容的FS接口；目前lustreFS最多可支持10w個client，1K個oss和2個MDS節點；實驗與應用已證實，lustreFS的性能和可擴展性都不錯；還擁有基於對象的智能化存儲、安全的認證機制、完善的容錯機制，並且實現了文件鎖功能；SUN說lustre是目前全球具備最佳可擴展性的並行FS，現全球十大超級計算機中的6個以及top100中的40%的超級計算機都採用了這個系統；

lustre組成：

元數據存儲管理（MDS負責管理元數據，提供一個全局的命名空間，client可經過MDS讀取到保存於MDT之上的元數據，在lustre中MDS可有2個，採用了active-standby的容錯機制，當其中一個MDS故障另外一個MDS啓動服務接替，MDT只能有1個，不一樣MDS之間共享訪問同一個MDT）；

文件數據存儲與管理（OSS負責提供i/o服務，接受並服務來自網絡的請求，經過OSS，可訪問到保存在OST上的文件數據，一個OSS對應2-8個OST，OST上的文件數據是以分條的形式保存的，文件的分條可在一個OSS之中，也可保存在多個OSS中，lustre的特點之一是其數據是基於對象的職能存儲的，與傳統的基於塊的存儲方式有所不一樣）；

lustre系統訪問入口（經過client來訪問系統，client爲掛載了lustreFS的任意node，client提供了linux下VFS與lustre系統之間的接口，經過client用戶可訪問操做lustre系統中的文件）；

ceph：

ceph是一個開源的分佈式塊、對象和文件統一存儲平臺，sage weil專爲其博士論文設計的新一代自由軟件分佈式FS，2010年，linus torvalds將ceph client合併到2.6.34的kernel中；優勢：元數據集羣、動態元數據分區、智能對象存儲系統、支持PB級存儲、高可靠性、支持複製、自動故障探測與修改、自適應知足不一樣應用負載、大文件和小文件均表現好；不足：數據可用性更多依賴底層FS，btrfs，複製存儲利用率低，設計和實現太過複雜，管理也複雜，目前仍不成熟，不建議用於生產環境

開源並行FS比較（glusterfs VS moosefs VS lustre VS ceph）：

比較維度	Glusterfs	Moosefs	Lustre	Ceph
成熟度	2005年發佈第一個GA版1.2.3，2013年GA版3.3.2，具備成熟的系統架構和完整的工程代碼	2008年發佈第一個開源版本v1.5，13年發佈GA版v1.6.27，穩定，比較成熟的開源DFS	2003發佈第一個版本lustre1.0，2013年發佈v2.4.0，至關成熟，在HPC領域佔有絕大比例	2013年發佈v0.71，並已添加到linux kernel中做實驗內核，目前不成熟有較多bug，更新穩定都是實驗版
穩定性	較穩定，無重大bug，已有較多組織或機構應用於生產環境	較穩定，無重大bug	很穩定，在HPC領域應用不少	核心組件RADOS較穩定，每3個月更新一次穩定版，有部分企業用於生產環境
複雜度	簡單，無元數據服務，用戶空間實現，架構清晰，xlator樹形結構	簡單，用戶空間實現，代碼規模較小，高度模塊化	複雜度較高，依賴內核實現	較複雜，c++實現，功能較多
高性能	解除元數據瓶頸，並行化數據訪問	元數據單點瓶頸	高性能，HPC領域表現卓越	數據分佈均衡，並行化度高
擴展性	彈性hash代替元數據服務，線性擴展，可輕鬆擴展到數百PB量級，支持動態擴容	可增長存儲server，不能增長MDS	高擴展性，容量可達數百PB，可在不打斷任何操做的狀況下，經過增長新的OSS來實現動態擴展	高擴展性，支持10-1000臺server，支持TB到PB的擴展，當組件發生變化時（添加或刪除），自動進行數據的重分佈
可用性	多元數據服務設計，數據分佈提供三種方式的分割：AFR、DHT、stripe，支持自動複製和自動修復	元數據+日誌服務器，保障元數據server，運行時元數據放內存，可設置副本	元數據集羣，可部署主備切換工做方式，無副本設計，OSS之間可利用共享存儲實現自動故障恢復	元數據集羣，沒有單點故障，多數據副本，自動管理、自動修復，monitor監控集羣中全部節點狀態，且可有多個monitor保證可靠性
可管理性	部署簡單，易於管理和維護，使用底層FS，ext3/zfs，客戶端負載增長；提供管理工具，如卷的擴容、數據LB、目錄配額及相關監控等	部署簡單，提供web gui監控界面，元數據恢復，文件恢復，回收站功能，快照	部署複雜，需升級kernel等，提供管理工具，如設置目錄stripe	部署較複雜，提供工具對集羣進行監控管理，包括集羣狀態，各組件狀態等
研發成本	用戶空間實現，模塊化堆棧式架構	用戶空間實現，小規模	高，內核空間實現，且代碼規模大	較高，代碼規模大，功能多
適用性	適用以文件爲對象的存儲體系，適合大文件存儲	小規模集羣，元數據瓶頸，內存消耗大	大文件，HPC領域
NAS兼容	支持NFS、CIFS、HTTP、FTP、gluster原生協議，與POSIX標準兼容	支持CIFS、NFS，支持標準POSIX接口	支持CIFS、NFS，支持標準POSIX接口	支持CIFS、NFS，支持標準POSIX接口
採用指數	☆☆☆☆	☆☆☆	☆☆☆	☆☆

##########################################################################################

操做：

準備三臺虛擬機：

client（10.96.20.118/24，測試掛載使用）

server1（eth0：10.96.20.113/24，eth1：192.168.10.113/24，/dev/sdb（5G））

server2（eth0：10.96.20.114/24，eth1：192.168.10.114/24，/dev/sdb（5G））

/etc/hosts內容：

10.96.20.113 server1

10.96.20.114 server2

10.96.20.118 client

準備安裝軟件包的yum源：

http://download.gluster.org/pub/gluster/glusterfs/3.6/LATEST/CentOS/glusterfs-epel.repo

軟件包位置：

http://download.gluster.org/pub/gluster/glusterfs/版本號/

準備測試工具：

atop-1.27-2.el6.x86_64.rpm

fio-2.0.13-2.el6.x86_64.rpm

iperf-2.0.5-11.el6.x86_64.rpm

iozone-3-465.i386.rpm

[root@server1 ~]# cat /etc/redhat-release

Red Hat Enterprise Linux Server release 6.5(Santiago)

[root@server1 ~]# uname -rm

2.6.32-431.el6.x86_64 x86_64

[root@server1 ~]# cat /proc/sys/net/ipv4/ip_forward

[root@server1 ~]# mount | grep brick1

/dev/sdb on /brick1 type ext4 (rw)

[root@server1 ~]# df -h | grep brick1

/dev/sdb 5.0G 138M 4.6G 3% /brick1

[root@server2 ~]# mount | grep brick1

/dev/sdb on /brick1 type ext4 (rw)

[root@server2 ~]# df -h | grep brick1

/dev/sdb 5.0G 138M 4.6G 3% /brick1

server1和server2均執行：

[root@server1 ~]# wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-6.repo

[root@server1 ~]# vim /etc/yum.repos.d/CentOS-Base.repo

:%s/$releasever/6/g

[root@server1 ~]# yum -y install rpcbind libaio lvm2-devel （用centos或epel或aliyum的yum源安裝依賴的包，這些源僅能安裝用於client的glusterfs包，沒有glusterfs-server包）

[root@server1 ~]# wget -P /etc/yum.repos.d/ http://download.gluster.org/pub/gluster/glusterfs/3.6/LATEST/CentOS/glusterfs-epel.repo

[root@server1 ~]# yum -y install glusterfs-server

Installing:

glusterfs-server x86_64 3.6.9-1.el6 glusterfs-epel 720 k

Installing for dependencies:

glusterfs x86_64 3.6.9-1.el6 glusterfs-epel 1.4 M

glusterfs-api x86_64 3.6.9-1.el6 glusterfs-epel 64 k

glusterfs-cli x86_64 3.6.9-1.el6 glusterfs-epel 143 k

glusterfs-fuse x86_64 3.6.9-1.el6 glusterfs-epel 93 k

glusterfs-libs x86_64 3.6.9-1.el6 glusterfs-epel 282 k

……

[root@server1 ~]# cd glusterfs/

[root@server1 glusterfs]# ll

total 1208

-rw-r--r--. 1 root root 108908 Jan 17 2014 atop-1.27-2.el6.x86_64.rpm

-rw-r--r--. 1 root root 232912 Dec 22 2015 fio-2.0.13-2.el6.x86_64.rpm

-rw-r--r--. 1 root root 833112 Sep 11 18:41iozone-3-465.i386.rpm

-rw-r--r--. 1 root root 54380 Jan 3 2014iperf-2.0.5-11.el6.x86_64.rpm

[root@server1 glusterfs]# rpm -ivh atop-1.27-2.el6.x86_64.rpm

[root@server1 glusterfs]# rpm -ivh fio-2.0.13-2.el6.x86_64.rpm

[root@server1 glusterfs]# rpm -ivh iozone-3-465.i386.rpm

[root@server1 glusterfs]# rpm -ivh iperf-2.0.5-11.el6.x86_64.rpm

[root@server1 ~]# service glusterd start

Starting glusterd: [ OK ]

server2：

[root@server2 ~]# service glusterd start

Starting glusterd: [ OK ]

server1：

[root@server1 ~]# gluster help #（gluster命令有交互模式）

peer probe<HOSTNAME> - probe peer specified by<HOSTNAME> #（增長node，組建集羣，主機名或IP都可）

peer detach <HOSTNAME> [force] -detach peer specified by <HOSTNAME> #（刪除node）

peer status - list status of peers

volume info [all|<VOLNAME>] - list informationof all volumes （查看卷信息）

volume create <NEW-VOLNAME> [stripe <COUNT>] [replica <COUNT>] [disperse[<COUNT>]] [redundancy <COUNT>] [transport<tcp|rdma|tcp,rdma>] <NEW-BRICK>?<vg_name>... [force] - create a new volume of specified type with mentioned bricks #（建立卷）

volume delete <VOLNAME> - deletevolume specified by <VOLNAME>

volume start<VOLNAME> [force] - start volume specified by<VOLNAME> #（啓動卷）

volume stop <VOLNAME> [force] - stopvolume specified by <VOLNAME>

volume add-brick <VOLNAME> [<stripe|replica> <COUNT>] <NEW-BRICK> ...[force] - add brick to volume <VOLNAME> （增長brick）

volume remove-brick <VOLNAME>[replica <COUNT>] <BRICK> ...<start|stop|status|commit|force> - remove brick from volume<VOLNAME>

volume rebalance <VOLNAME> {{fix-layout start} | {start [force]|stop|status}} - rebalance operations

[root@server1 ~]# gluster peer probe server2

peer probe: success.

[root@server1 ~]# gluster peer status

Number of Peers: 1

Hostname: server2

Uuid: 4762db74-3ddc-483a-a510-5756d7402afb

State: Peer in Cluster (Connected)

server2：

[root@server2 ~]# gluster peer status

Number of Peers: 1

Hostname: server1

Uuid: b38bd899-6667-4253-9313-7538fcb5153f

State: Peer in Cluster (Connected)

server1：

[root@server1 ~]# gluster volume create testvol server1:/brick1/b1 server2:/brick1/b1 #（默認建立的是hash卷；此步也可分開執行，先執行#gluster volume create testvol server1:/brick1/b1，再執行#glustervolume create testvol server2:/brick1/b1）

volume create: testvol: success: pleasestart the volume to access data

[root@server1 ~]# gluster volume start testvol

volume start: testvol: success

[root@server1 ~]# gluster volume info

Volume Name: testvol

Type: Distribute

Volume ID:095708cc-3520-49f7-89f8-070687c28245

Status: Started

Number of Bricks: 2

Transport-type: tcp

Bricks:

Brick1: server1:/brick1/b1

Brick2: server2:/brick1/b1

client（掛載使用）：

[root@client ~]# wget -P /etc/yum.repos.d/http://download.gluster.org/pub/gluster/glusterfs/3.6/LATEST/CentOS/glusterfs-epel.repo

[root@client ~]# yum -y install glusterfs glusterfs-fuse #（客戶端僅需安裝glusterfs、glusterfs-libs、glusterfs-fuse）

Installing:

glusterfs x86_64 3.6.9-1.el6 glusterfs-epel 1.4 M

glusterfs-fuse x86_64 3.6.9-1.el6 glusterfs-epel 93 k

Installing for dependencies:

glusterfs-api x86_64 3.6.9-1.el6 glusterfs-epel 64 k

glusterfs-libs x86_64 3.6.9-1.el6 glusterfs-epel 282 k

……

[root@client ~]# mount -t glusterfs server1:/testvol /mnt/glusterfs/

[root@client ~]# mount | grep gluster

server1:/testvol on /mnt/glusterfs typefuse.glusterfs (rw,default_permissions,allow_other,max_read=131072)

[root@client ~]# df -h | grep glusterfs #（總容量是server1和server2的和）

server1:/testvol 9.9G 277M 9.1G 3% /mnt/glusterfs

[root@client ~]# cd /mnt/glusterfs/

[root@client glusterfs]# for i in `seq 150` ; do touch test$i.txt ; done

server1:

[root@server1 ~]# ls /brick1/b1

test10.txt test17.txt test1.txt test24.txt test27.txt test30.txt test32.txt test35.txt test38.txt test43.txt test4.txt

test16.txt test18.txt test22.txt test26.txt test29.txt test31.txt test34.txt test37.txt test3.txt test46.txt test7.txt

[root@server1 ~]# ll /brick1/b1 | wc -l

server2:

[root@server2 ~]# ls /brick1/b1

test11.txt test14.txt test20.txt test25.txt test33.txt test40.txt test44.txt test48.txt test5.txt test9.txt

test12.txt test15.txt test21.txt test28.txt test36.txt test41.txt test45.txt test49.txt test6.txt

test13.txt test19.txt test23.txt test2.txt test39.txt test42.txt test47.txt test50.txt test8.txt

[root@server2 mnt]# ll /brick1/b1/ | wc -l

總結：

配置信息（/etc/glusterd/*）；日誌信息（/var/log/gluster/*；I，info；E，error）；

#gluster peer probe server2 #（組建集羣，在一個node上操做便可，如果添加一個新node要在已組成集羣中的任意一個node上操做；能夠是主機名或IP，如果主機名要有/etc/hosts解析）

#gluster peer probe server3

#gluster peer status

#gluster volume create testvol server1:/brick1/b1 server2:/brick1/b1 #（建立卷，僅在一個node上操做，默認是hash卷；也可用此種方式建立複製卷#gluster volume create testvol replica 2 server1:/brick1/b2 server2:/brick1/b2；建立條帶卷用stripe 2）

#gluster volume start testvol

#gluster volume info

#mount -t glusterfs server1:/testvol/mnt/glusterfs

刪除卷：

#gluster volume stop testvol

#gluster volume delete testvol #（卷刪除後底層的內容還在）

#gluster volume info

#rm -rf /brick1/b1

#rm -rf /brick1/b2

#rm -rf /brick1/b3

將機器移除集羣：

#gluster peer detach IP|HOSTNAME

增長集羣機器：

#gluster peer probe server11

#gluster peer probe server12

#gluster peer status

#gluster volume add-brick testvol server11:/brick1/b1server12:/brick1/b1

#gluster volume rebalance testvol start #（從新LB，此操做要在非訪問高峯時作，分兩步，先fix-layout將hash算法重分配，再將數據重分配；#gluster volume rebalance <VOLNAME> {{fix-layout start} |{start [force]|stop|status}} - rebalance operations）

#gluster volume rebalance testvol status

卷信息同步（在複製捲上操做）：

#gluster volume sync server1 [all|VOLUME] #（若server2的數據故障，指定與server1數據同步；all表示同步全部的卷；若只是某個卷的數據有問題指定VOLNAME便可）

修復磁盤數據（在使用server1時宕機，使用server2替換，執行數據同步）：

#gluster volume replace-brick testvol server1:/brick1/b1 server2:/brick1/b1 commit force

#gluster volume heal testvol full

當複製卷數據不一致時（解決辦法：遍歷並訪問文件，觸發自修復）：

#find /mnt/glusterfs -type f -print0 |xargs -0 head -c 1

複製卷中一個brick損壞，解決辦法：

#getfattr -d -m -e hex /brick1/b1 #（在正常的一個node上查看擴展屬性；查看以下三個屬性信息，並paste到省略號位置處）

#setfattr -n trusted.gfid -v 0x000…… /brick1/b1

#setfattr -n trusted.glusterfs.dht -v 0x000…… /brick1/b1

#setfattr -n trusted.glusterfs.volume-id -v0x000…… /brick1/b1

#getfattr -d -m . -e hex /brick1/b1

#service glusterd restart #（僅在出問題的node上重啓）

#ps aux | grep gluster

卷參數配置（#gluster volume set <VOLNAME> <KEY> <VALUE> - set options for volume <VOLNAME>）：

<KEY> <VALUE>有以下：

auth.reject <IP> #（IP訪問受權，默認allowall）

auth.allow <IP>

cluster.min-free-disk <百分比> #（剩餘磁盤空間閾值，默認10%）

cluster.strip-block-size <NUM> #（條帶大小，默認128KB）

network.frame-timeout <0-1800> #（請求等待時間，默認1800s）

network.ping-timeout <0-42> #（客戶端等待時間，默認42s）

nfs.disabled <off|on> #（關閉nfs服務，默認off爲開啓）

performance.io-thread-count<0-65> #（IO線程數，默認16）

performance.cache-refresh-timeout<0-61> #（緩存校驗週期，默認1s）

performance.cache-size <NUM> #（讀緩存大小，默認32MB）

網絡配置測試：

IP檢測（#ip addr；#ifconfig）；

網關測試（#ip route show；#route -n）；

DNS測試（#cat/etc/resolv.conf；#nslookup）；

連通性（#ping IP）；

網絡性能（在server-side執行#iperf -s；在client-side執行#iperf -c SERVER_IP [-P #]，-P，--parallel指定線程數）；

gluster自身配置測試：

#gluster peer status #（集羣狀態）

#gluster volume info #（卷配置）

#gluster volume status #（卷狀態）

#gluster volume profile testvol start|info

性能測試（基本性能、帶寬測試、iops測試、ops測試、系統監控）：

基本性能：

#dd if=/dev/zero of=dd.dat bs=1M count=1k

#dd if=dd.dat of=/dev/null bs=1M count=1k

帶寬測試：

iozone是目前應用很是普遍的文件系統測試標準工具，它可以產生並測量各類的操做性能，包括read, write, re-read, re-write, read backwards, read strided, fread,fwrite, random read, pread ,mmap, aio_read, aio_write等操做；Iozone目前已經被移植到各類體系結構計算機和操做系統上，普遍用於文件系統性能測試、分析與評估的標準工具

[root@server1 ~]# /opt/iozone/bin/iozone -h

-r # record size in Kb

or -r #k .. size in kB

or -r #m .. size in MB

or -r #g .. size in GB

-s # file size in Kb

or -s #k .. size in kB

or -s #m .. size in MB

or -s #g .. size in GB

-t # Number of threads or processes to use inthroughput test

-i # Test to run(0=write/rewrite, 1=read/re-read, 2=random-read/write

3=Read-backwards,4=Re-write-record, 5=stride-read, 6=fwrite/re-fwrite

7=fread/Re-fread,8=random_mix, 9=pwrite/Re-pwrite, 10=pread/Re-pread

11=pwritev/Re-pwritev,12=preadv/Re-preadv)

[root@server1 ~]# /opt/iozone/bin/iozone -r 1m -s 128m -t 4 -i 0 -i 1

……

RecordSize 1024 kB

Filesize set to 131072 kB

Commandline used: /opt/iozone/bin/iozone -r 1m -s 128m -t 4 -i 0 -i 1

Outputis in kBytes/sec

TimeResolution = 0.000001 seconds.

Processorcache size set to 1024 kBytes.

Processorcache line size set to 32 bytes.

Filestride size set to 17 * record size.

Throughputtest with 4 processes

Eachprocess writes a 131072 kByte file in 1024 kByte records

Childrensee throughput for 4 initial writers = 49269.04 kB/sec

Parentsees throughput for 4 initial writers = 41259.88 kB/sec

Minthroughput per process = 9069.08 kB/sec

Maxthroughput per process = 14695.71 kB/sec

Avgthroughput per process = 12317.26 kB/sec

Minxfer = 80896.00 kB

……

iops測試：

fio是一個I/O標準測試和硬件壓力驗證工具，它支持13種不一樣類型的I/O引擎（sync, mmap, libaio, posixaio, SG v3, splice, null, network, syslet,guasi, solarisaio等），I/O priorities (for newer Linux kernels), rate I/O, forked orthreaded jobs等等；fio能夠支持塊設備和文件系統測試，普遍用於標準測試、QA、驗證測試等，支持Linux, FreeBSD, NetBSD, OS X, OpenSolaris, AIX, HP-UX, Windows等操做系統

sata盤通常iops80

[root@server1 ~]# vim fio.conf

[global]

ioengine=libaio

direct=1

thread=1

norandommap=1

randrepeat=0

filename=/mnt/fio.dat

size=100m

[rr]

stonewall

group_reporting

bs=4k

rw=randread

numjobs=8

iodepth=4

[root@server1 ~]# fio fio.conf

rr: (g=0): rw=randread,bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=4

...

rr: (g=0): rw=randread,bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=4

fio-2.0.13

Starting 8 threads

rr: Laying out IO file(s) (1 file(s) /100MB)

Jobs: 1 (f=1): [_____r__] [99.9% done][721K/0K/0K /s] [180 /0 /0 iops] [eta00m:01s]

rr: (groupid=0, jobs=8): err= 0: pid=11470:Tue Sep 13 00:44:36 2016

read : io=819296KB, bw=1082.4KB/s, iops=270 , runt=756956msec

slat (usec): min=1 , max=35298 , avg=39.61, stdev=169.88

clat (usec): min=2 , max=990254 , avg=117927.97, stdev=92945.30

lat (usec): min=277 , max=990259 , avg=117967.87, stdev=92940.88

clat percentiles (msec):

| 1.00th=[ 5], 5.00th=[ 13], 10.00th=[ 20], 20.00th=[ 32],

| 30.00th=[ 48], 40.00th=[ 72], 50.00th=[ 98], 60.00th=[ 126],

| 70.00th=[ 159], 80.00th=[ 198],90.00th=[ 251], 95.00th=[ 293],

| 99.00th=[ 379], 99.50th=[ 416], 99.90th=[ 545], 99.95th=[ 603],

| 99.99th=[ 750]

bw (KB/s) : min= 46, max= 625, per=12.53%, avg=135.60, stdev=28.57

lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 100=0.01%, 250=0.01%

lat (usec) : 500=0.03%, 750=0.02%, 1000=0.02%

lat (msec) : 2=0.10%, 4=0.67%, 10=2.56%, 20=6.94%, 50=20.90%

lat (msec) : 100=19.70%, 250=39.06%, 500=9.84%, 750=0.16%, 1000=0.01%

cpu : usr=0.00%, sys=0.10%, ctx=175327,majf=18446744073709551560, minf=18446744073709449653

IOdepths : 1=0.1%, 2=0.1%, 4=100.0%,8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%

submit : 0=0.0%, 4=100.0%,8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

issued :total=r=204824/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):

READ: io=819296KB, aggrb=1082KB/s, minb=1082KB/s, maxb=1082KB/s,mint=756956msec, maxt=756956msec

Disk stats (read/write):

sda: ios=204915/36, merge=35/13, ticks=24169476/9352004,in_queue=33521518, util=100.00%

ops測試：

#yum -y install gcc

#gcc -o postmark postmark-1.52.c #（postmark軟件包是.c的文件，要使用gcc編譯）

#cp postmark /usr/bin/

postmark 是由著名的 NAS 提供商 NetApp 開發，用來測試其產品的後端存儲性能，主要用於測試文件系統在郵件系統或電子商務系統中性能，這類應用的特色是：須要頻繁、大量地存取小文件；Postmark 的測試原理是建立一個測試文件池；文件的數量和最大、最小長度能夠設定，數據總量是必定的，建立完成後，postmark 對文件池進行一系列的transaction操做，根據從實際應用中統計的結果，設定每個事務包括一次建立或刪除操做和一次讀或添加操做，在有些狀況下，文件系統的緩存策略可能對性能形成影響，postmark 能夠經過對建立/刪除以及讀/添加操做的比例進行修改來抵消這種影響；事務操做進行完畢後，post 對文件池進行刪除操做，並結束測試，輸出結果；postmark是用隨機數來產生所操做文件的序號，從而使測試更加貼近於現實應用，輸出結果中比較重要的輸出數據包括測試總時間、每秒鐘平均完成的事務數、在事務處理中平均每秒建立和刪除的文件數，以及讀和寫的平均傳輸速度

#vim postmark.conf #（此例10000個文件，100個目錄每一個目錄下100個文件，默認會在當前路徑下生成報告文件）

set size 1k

set number 10000

set location /mnt/

set subdirectories 100

set read 1k

set write 1k

run 60

show

quit

#postmark postmark.conf #（postmark有交互模式）

系統監控：