參考文章: html
https://ieevee.com/tech/2018/05/16/k8s-rbd.html
https://zhangchenchen.github.io/2017/11/17/kubernetes-integrate-with-ceph/
https://docs.openshift.com/container-platform/3.5/install_config/storage_examples/ceph_rbd_dynamic_example.html
https://jimmysong.io/kubernetes-handbook/practice/using-ceph-for-persistent-storage.htmlnode
感謝以上做者提供的技術參考,這裏我加以整理,分別實現了多主數據庫集羣和主從數據庫結合Ceph RDB的實現方式。如下配置只爲測試使用,不能作爲生產配置。mysql
在K8S的持久化存儲中主要有如下幾種分類:nginx
volume: 就是直接掛載在pod上的組件,k8s中全部的其餘存儲組件都是經過volume來跟pod直接聯繫的。volume有個type屬性,type決定了掛載的存儲是什麼,常見的好比:emptyDir,hostPath,nfs,rbd,以及下文要說的persistentVolumeClaim等。跟docker裏面的volume概念不一樣的是,docker裏的volume的生命週期是跟docker牢牢綁在一塊兒的。這裏根據type的不一樣,生命週期也不一樣,好比emptyDir類型的就是跟docker同樣,pod掛掉,對應的volume也就消失了,而其餘類型的都是永久存儲。詳細介紹能夠參考Volumesgit
Persistent Volumes:顧名思義,這個組件就是用來支持永久存儲的,Persistent Volumes組件會抽象後端存儲的提供者(也就是上文中volume中的type)和消費者(即具體哪一個pod使用)。該組件提供了PersistentVolume和PersistentVolumeClaim兩個概念來抽象上述二者。一個PersistentVolume(簡稱PV)就是後端存儲提供的一塊存儲空間,具體到ceph rbd中就是一個image,一個PersistentVolumeClaim(簡稱PVC)能夠看作是用戶對PV的請求,PVC會跟某個PV綁定,而後某個具體pod會在volume 中掛載PVC,就掛載了對應的PV。github
添加ceph的yum源:redis
[Ceph] name=Ceph packages for $basearch baseurl=https://mirrors.aliyun.com/ceph/rpm-mimic/el7/$basearch enabled=1 gpgcheck=1 type=rpm-md gpgkey=https://download.ceph.com/keys/release.asc [Ceph-noarch] name=Ceph noarch packages baseurl=https://mirrors.aliyun.com/ceph/rpm-mimic/el7/noarch enabled=1 gpgcheck=1 type=rpm-md gpgkey=https://download.ceph.com/keys/release.asc [ceph-source] name=Ceph source packages baseurl=https://mirrors.aliyun.com/ceph/rpm-mimic/el7/SRPMS enabled=1 gpgcheck=1 type=rpm-md gpgkey=https://download.ceph.com/keys/release.asc
安裝ceph-common:sql
yum install ceph-common -y
若是安裝過程出現依賴報錯,能夠經過以下方式解決:docker
yum install -y yum-utils && \ yum-config-manager --add-repo https://dl.fedoraproject.org/pub/epel/7/x86_64/ && \ yum install --nogpgcheck -y epel-release && \ rpm --import /etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7 && \ rm -f /etc/yum.repos.d/dl.fedoraproject.org* yum -y install ceph-common
將ceph配置文件拷貝到各個k8s的node節點數據庫
[root@ceph-1 ~]# scp /etc/ceph k8s-node:/etc/
經過使用一個簡單的volume,測試集羣環境是否正常,在實際的應用中,須要永久保存的數據不能使用volume的方式。
建立新的鏡像時,須要禁用某些不支持的屬性:
rbd create foobar -s 1024 -p k8s rbd feature disable k8s/foobar object-map fast-diff deep-flatten
查看鏡像信息:
# rbd info k8s/foobar rbd image 'foobar': size 1 GiB in 256 objects order 22 (4 MiB objects) id: ad9b6b8b4567 block_name_prefix: rbd_data.ad9b6b8b4567 format: 2 features: layering, exclusive-lock op_features: flags: create_timestamp: Tue Apr 23 17:37:39 2019
這裏指定了ceph的 admin.keyring文件做爲認證密鑰:
# cat test.yaml apiVersion: v1 kind: Pod metadata: name: rbd spec: containers: - image: nginx name: rbd-rw volumeMounts: - name: rbdpd mountPath: /mnt volumes: - name: rbdpd rbd: monitors: - '192.168.20.41:6789' pool: k8s image: foobar fsType: xfs readOnly: false user: admin keyring: /etc/ceph/ceph.client.admin.keyring
若是須要永久保存數據(當pod刪除後數據不會丟失),咱們須要使用PV(PersistentVolume),和PVC(PersistentVolumeClaim)的方式。
rbd create -s 1024 k8s/pv rbd feature disable k8s/pv object-map fast-diff deep-flatten
查看鏡像信息:
# rbd info k8s/pv rbd image 'pv': size 1 GiB in 256 objects order 22 (4 MiB objects) id: adaa6b8b4567 block_name_prefix: rbd_data.adaa6b8b4567 format: 2 features: layering, exclusive-lock op_features: flags: create_timestamp: Tue Apr 23 19:09:58 2019
grep key /etc/ceph/ceph.client.admin.keyring |awk '{printf "%s", $NF}'|base64
apiVersion: v1 kind: Secret metadata: name: ceph-secret type: "kubernetes.io/rbd" data: key: QVFBbk1MaGNBV2laSGhBQUVOQThRWGZyQ3haRkJDNlJaWTNJY1E9PQ== ---
# cat ceph-rbd-pv.yaml apiVersion: v1 kind: PersistentVolume metadata: name: ceph-rbd-pv spec: capacity: storage: 1Gi accessModes: - ReadWriteOnce rbd: monitors: - '192.168.20.41:6789' pool: k8s image: pv user: admin secretRef: name: ceph-secret fsType: xfs readOnly: false persistentVolumeReclaimPolicy: Recycle # cat ceph-rbd-pvc.yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: ceph-rbd-pv-claim spec: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi
# cat test3-pvc.yaml apiVersion: v1 kind: Pod metadata: name: rbd-nginx spec: containers: - image: nginx name: rbd-rw volumeMounts: - name: rbd-pvc mountPath: /mnt volumes: - name: rbd-pvc persistentVolumeClaim: claimName: ceph-rbd-pv-claim
簡單來講,storage配置了要訪問ceph RBD的IP/Port、用戶名、keyring、pool,等信息,咱們不須要提早建立image;當用戶建立一個PVC時,k8s查找是否有符合PVC請求的storage class類型,若是有,則依次執行以下操做:
經過這種方式管理員只要建立好storage class就好了,後面的事情用戶本身就能夠搞定了。若是想要防止資源被耗盡,能夠設置一下Resource Quota。
當pod須要一個卷時,直接經過PVC聲明,就能夠根據需求建立符合要求的持久卷。
# cat storageclass.yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: fast provisioner: kubernetes.io/rbd parameters: monitors: 192.168.20.41:6789 adminId: admin adminSecretName: ceph-secret pool: k8s userId: admin userSecretName: ceph-secret fsType: xfs imageFormat: "2" imageFeatures: "layering"
RBD只支持 ReadWriteOnce 和 ReadOnlyAll,不支持ReadWriteAll。注意這二者的區別點是,不一樣nodes之間是否能夠同時掛載。同一個node上,即便是ReadWriteOnce,也能夠同時掛載到2個容器上的。
建立應用的時候,須要同時建立 pv和pod,兩者經過storageClassName關聯。pvc中須要指定其storageClassName爲上面建立的sc的name(即fast)。
# cat pvc.yaml kind: PersistentVolumeClaim apiVersion: v1 metadata: name: rbd-pvc-pod-pvc spec: accessModes: - ReadWriteOnce volumeMode: Filesystem resources: requests: storage: 1Gi storageClassName: fast
建立pod
# cat pod.yaml apiVersion: v1 kind: Pod metadata: labels: test: rbd-pvc-pod name: ceph-rbd-sc-pod1 spec: containers: - name: ceph-rbd-sc-nginx image: nginx volumeMounts: - name: ceph-rbd-vol1 mountPath: /mnt readOnly: false volumes: - name: ceph-rbd-vol1 persistentVolumeClaim: claimName: rbd-pvc-pod-pvc
在使用Storage Class時,除了使用PVC的方式聲明要使用的持久卷,還可經過建立一個volumeClaimTemplates進行聲明建立(StatefulSets中的存儲設置),若是涉及到多個副本,可使用StatefulSets配置:
apiVersion: apps/v1 kind: StatefulSet metadata: name: nginx spec: selector: matchLabels: app: nginx serviceName: "nginx" replicas: 3 template: metadata: labels: app: nginx spec: terminationGracePeriodSeconds: 10 containers: - name: nginx image: nginx volumeMounts: - name: www mountPath: /usr/share/nginx/html volumeClaimTemplates: - metadata: name: www spec: accessModes: [ "ReadWriteOnce" ] storageClassName: "fast" resources: requests: storage: 1Gi
但注意不要用Deployment。由於,若是Deployment的副本數是1,那麼仍是能夠用的,跟Pod一致;但若是副本數 >1 ,此時建立deployment後會發現,只啓動了1個Pod,其餘Pod都在ContainerCreating狀態。過一段時間describe pod能夠看到,等volume等好久都沒等到。
官方文檔:https://kubernetes.io/docs/tasks/run-application/run-replicated-stateful-application/
statefulset(1.5以前叫作petset),statefulset與deployment,replicasets是一個級別的。不過Deployments和ReplicaSets是爲無狀態服務而設計。statefulset則是爲了解決有狀態服務的問題。它的應用場景以下:
由應用場景可知,statefuleset特別適合mqsql,redis等數據庫集羣。相應的,一個statefuleset有如下三個部分:
若是k8s集羣中已經建立了ceph 的secret能夠跳過此步
生成一個加密的key
grep key /etc/ceph/ceph.client.admin.keyring |awk '{printf "%s", $NF}'|base64
將生成的key建立一個Secret
apiVersion: v1 kind: Secret metadata: name: ceph-secret namespace: galera type: "kubernetes.io/rbd" data: key: QVFBbk1MaGNBV2laSGhBQUVOQThRWGZyQ3haRkJDNlJaWTNJY1E9PQ== ---
# cat storageclass.yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: fast provisioner: kubernetes.io/rbd parameters: monitors: 192.168.20.41:6789,192.168.20.42:6789,192.168.20.43:6789 adminId: admin adminSecretName: ceph-secret pool: k8s userId: admin userSecretName: ceph-secret fsType: xfs imageFormat: "2" imageFeatures: "layering"
galera-service.yaml
apiVersion: v1 kind: Service metadata: annotations: service.alpha.kubernetes.io/tolerate-unready-endpoints: "true" name: galera namespace: galera labels: app: mysql spec: ports: - port: 3306 name: mysql # *.galear.default.svc.cluster.local clusterIP: None selector: app: mysql
這裏使用V1版本的StatefulSet,和以前的版本相比,v1版本是當前的穩定版本,同時與以前的beta版的區別是v1版本須要添加spec.selector.matchLabels的參數,此參數須要與spec.template.metadata.labels保持一致。
apiVersion: apps/v1 kind: StatefulSet metadata: name: mysql namespace: galera spec: selector: matchLabels: app: mysql serviceName: "galera" replicas: 3 template: metadata: labels: app: mysql spec: initContainers: - name: install image: mirrorgooglecontainers/galera-install:0.1 imagePullPolicy: Always args: - "--work-dir=/work-dir" volumeMounts: - name: workdir mountPath: "/work-dir" - name: config mountPath: "/etc/mysql" - name: bootstrap image: debian:jessie command: - "/work-dir/peer-finder" args: - -on-start="/work-dir/on-start.sh" - "-service=galera" env: - name: POD_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace volumeMounts: - name: workdir mountPath: "/work-dir" - name: config mountPath: "/etc/mysql" containers: - name: mysql image: mirrorgooglecontainers/mysql-galera:e2e ports: - containerPort: 3306 name: mysql - containerPort: 4444 name: sst - containerPort: 4567 name: replication - containerPort: 4568 name: ist args: - --defaults-file=/etc/mysql/my-galera.cnf - --user=root readinessProbe: # TODO: If docker exec is buggy just use gcr.io/google_containers/mysql-healthz:1.0 exec: command: - sh - -c - "mysql -u root -e 'show databases;'" initialDelaySeconds: 15 timeoutSeconds: 5 successThreshold: 2 volumeMounts: - name: datadir mountPath: /var/lib/ - name: config mountPath: /etc/mysql volumes: - name: config emptyDir: {} - name: workdir emptyDir: {} volumeClaimTemplates: - metadata: name: datadir annotations: volume.beta.kubernetes.io/storage-class: "fast" spec: accessModes: [ "ReadWriteOnce" ] resources: requests: storage: 1Gi
查看pod狀態已經正常
[root@master-1 ~]# kubectl get pod -n galera NAME READY STATUS RESTARTS AGE mysql-0 1/1 Running 0 48m mysql-1 1/1 Running 0 43m mysql-2 1/1 Running 0 38m
數據庫集羣創建:
[root@master-1 ~]# kubectl exec mysql-1 -n galera -- mysql -uroot -e 'show status like "wsrep_cluster_size";' Variable_name Value wsrep_cluster_size 3
查看pv綁定:
[root@master-1 mysql-cluster]# kubectl get pvc -l app=mysql -n galera NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE datadir-mysql-0 Bound pvc-6e5a1c45-666b-11e9-ad20-000c29016590 1Gi RWO fast 3d20h datadir-mysql-1 Bound pvc-25683cfd-666c-11e9-ad20-000c29016590 1Gi RWO fast 3d20h datadir-mysql-2 Bound pvc-c024b422-666c-11e9-ad20-000c29016590 1Gi RWO fast 3d20h
測試數據庫:
kubectl exec mysql-2 -n galera -- mysql -uroot -e <<EOF 'CREATE DATABASE demo; CREATE TABLE demo.messages (message VARCHAR(250)); INSERT INTO demo.messages VALUES ("hello");' EOF
查看數據:
# kubectl run mysql-client --image=mysql:5.7 -i -t --rm --restart=Never -- mysql -h 10.2.58.7 -e "SELECT * FROM demo.messages" If you don't see a command prompt, try pressing enter. +---------+ | message | +---------+ | hello | +---------+ pod "mysql-client" deleted
若是pod之間互相訪問,查詢數據庫就須要定義一個svc, 這裏定義一個鏈接mysql的svc:
apiVersion: v1 kind: Service metadata: name: mysql-read namespace: galera labels: app: mysql spec: ports: - name: mysql port: 3306 selector: app: mysql
經過使用Pod來訪問數據庫:
# kubectl run mysql-client --image=mysql:5.7 -i -t --rm --restart=Never -- mysql -h mysql-read.galera -e "SELECT * FROM demo.messages" +---------+ | message | +---------+ | hello | +---------+ pod "mysql-client" deleted
在ceph 集羣中建立一個kube的pool,用於數據庫的存儲池:
[root@ceph-1 ~]# ceph osd pool create kube 128 pool 'kube' created
新定義一個storageclass:
apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: mysql provisioner: kubernetes.io/rbd parameters: monitors: 192.168.20.41:6789,192.168.20.42:6789,192.168.20.43:6789 adminId: admin adminSecretName: ceph-secret pool: kube userId: admin userSecretName: ceph-secret fsType: xfs imageFormat: "2" imageFeatures: "layering"
因爲要使用statefulSet進行主從數據庫的部署,這裏須要建立一個headless的service,和一個用於讀庫的service:
# Headless service for stable DNS entries of StatefulSet members. apiVersion: v1 kind: Service metadata: name: mysql labels: app: mysql spec: ports: - name: mysql port: 3306 clusterIP: None selector: app: mysql --- # Client service for connecting to any MySQL instance for reads. # For writes, you must instead connect to the master: mysql-0.mysql. apiVersion: v1 kind: Service metadata: name: mysql-read labels: app: mysql spec: ports: - name: mysql port: 3306 selector: app: mysql
因爲要進行主從同步,因此必須主庫和從庫必需要有相應的配置:
apiVersion: v1 kind: ConfigMap metadata: name: mysql labels: app: mysql data: master.cnf: | # Apply this config only on the master. [mysqld] log-bin slave.cnf: | # Apply this config only on slaves. [mysqld] super-read-only
這裏指定了使用StorageClass,使用RBD存儲,同時須要使用一個xtrabackup的鏡像進行數據同步:
apiVersion: apps/v1 kind: StatefulSet metadata: name: mysql spec: selector: matchLabels: app: mysql serviceName: mysql replicas: 3 template: metadata: labels: app: mysql spec: initContainers: - name: init-mysql image: mysql:5.7 command: - bash - "-c" - | set -ex # Generate mysql server-id from pod ordinal index. [[ `hostname` =~ -([0-9]+)$ ]] || exit 1 ordinal=${BASH_REMATCH[1]} echo [mysqld] > /mnt/conf.d/server-id.cnf # Add an offset to avoid reserved server-id=0 value. echo server-id=$((100 + $ordinal)) >> /mnt/conf.d/server-id.cnf # Copy appropriate conf.d files from config-map to emptyDir. if [[ $ordinal -eq 0 ]]; then cp /mnt/config-map/master.cnf /mnt/conf.d/ else cp /mnt/config-map/slave.cnf /mnt/conf.d/ fi volumeMounts: - name: conf mountPath: /mnt/conf.d - name: config-map mountPath: /mnt/config-map - name: clone-mysql image: tangup/xtrabackup:1.0 command: - bash - "-c" - | set -ex # Skip the clone if data already exists. [[ -d /var/lib/mysql/mysql ]] && exit 0 # Skip the clone on master (ordinal index 0). [[ `hostname` =~ -([0-9]+)$ ]] || exit 1 ordinal=${BASH_REMATCH[1]} [[ $ordinal -eq 0 ]] && exit 0 # Clone data from previous peer. ncat --recv-only mysql-$(($ordinal-1)).mysql 3307 | xbstream -x -C /var/lib/mysql # Prepare the backup. xtrabackup --prepare --target-dir=/var/lib/mysql volumeMounts: - name: data mountPath: /var/lib/mysql subPath: mysql - name: conf mountPath: /etc/mysql/conf.d containers: - name: mysql image: mysql:5.7 env: - name: MYSQL_ALLOW_EMPTY_PASSWORD value: "1" ports: - name: mysql containerPort: 3306 volumeMounts: - name: data mountPath: /var/lib/mysql subPath: mysql - name: conf mountPath: /etc/mysql/conf.d resources: requests: cpu: 500m memory: 1Gi livenessProbe: exec: command: ["mysqladmin", "ping"] initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 readinessProbe: exec: # Check we can execute queries over TCP (skip-networking is off). command: ["mysql", "-h", "127.0.0.1", "-e", "SELECT 1"] initialDelaySeconds: 5 periodSeconds: 2 timeoutSeconds: 1 - name: xtrabackup image: tangup/xtrabackup:1.0 ports: - name: xtrabackup containerPort: 3307 command: - bash - "-c" - | set -ex cd /var/lib/mysql # Determine binlog position of cloned data, if any. if [[ -f xtrabackup_slave_info ]]; then # XtraBackup already generated a partial "CHANGE MASTER TO" query # because we're cloning from an existing slave. mv xtrabackup_slave_info change_master_to.sql.in # Ignore xtrabackup_binlog_info in this case (it's useless). rm -f xtrabackup_binlog_info elif [[ -f xtrabackup_binlog_info ]]; then # We're cloning directly from master. Parse binlog position. [[ `cat xtrabackup_binlog_info` =~ ^(.*?)[[:space:]]+(.*?)$ ]] || exit 1 rm xtrabackup_binlog_info echo "CHANGE MASTER TO MASTER_LOG_FILE='${BASH_REMATCH[1]}',\ MASTER_LOG_POS=${BASH_REMATCH[2]}" > change_master_to.sql.in fi # Check if we need to complete a clone by starting replication. if [[ -f change_master_to.sql.in ]]; then echo "Waiting for mysqld to be ready (accepting connections)" until mysql -h 127.0.0.1 -e "SELECT 1"; do sleep 1; done echo "Initializing replication from clone position" # In case of container restart, attempt this at-most-once. mv change_master_to.sql.in change_master_to.sql.orig mysql -h 127.0.0.1 <<EOF $(<change_master_to.sql.orig), MASTER_HOST='mysql-0.mysql', MASTER_USER='root', MASTER_PASSWORD='', MASTER_CONNECT_RETRY=10; START SLAVE; EOF fi # Start a server to send backups when requested by peers. exec ncat --listen --keep-open --send-only --max-conns=1 3307 -c \ "xtrabackup --backup --slave-info --stream=xbstream --host=127.0.0.1 --user=root" volumeMounts: - name: data mountPath: /var/lib/mysql subPath: mysql - name: conf mountPath: /etc/mysql/conf.d resources: requests: cpu: 100m memory: 100Mi volumes: - name: conf emptyDir: {} - name: config-map configMap: name: mysql volumeClaimTemplates: - metadata: name: data spec: accessModes: ["ReadWriteOnce"] storageClassName: "mysql" resources: requests: storage: 1Gi
查看pod:
[root@master-1 ~]# kubectl get po NAME READY STATUS RESTARTS AGE mysql-0 2/2 Running 2 110m mysql-1 2/2 Running 0 109m mysql-2 2/2 Running 0 16m
pvc:
[root@master-1 ~]# kubectl get pvc |grep mysql|grep -v fast data-mysql-0 Bound pvc-3737108a-6a2a-11e9-ac56-000c296b46ac 1Gi RWO mysql 5h43m data-mysql-1 Bound pvc-279bdca0-6a4a-11e9-ac56-000c296b46ac 1Gi RWO mysql 114m data-mysql-2 Bound pvc-fbe153bc-6a52-11e9-ac56-000c296b46ac 1Gi RWO mysql 51m
Ceph集羣上自動建立的鏡像:
[root@ceph-1 ~]# rbd list kube kubernetes-dynamic-pvc-2ee47370-6a4a-11e9-bb82-000c296b46ac kubernetes-dynamic-pvc-39a42869-6a2a-11e9-bb82-000c296b46ac kubernetes-dynamic-pvc-fbead120-6a52-11e9-bb82-000c296b46ac
向主庫寫入數據,使用headless server所提供的 podname.headlessname 的形式就能夠直接訪問POD, 這在DNS解析中是固定的。這裏訪問mysql-0就使用mysql-0.mysql:
kubectl run mysql-client --image=mysql:5.7 -i --rm --restart=Never --\ mysql -h mysql-0.mysql <<EOF CREATE DATABASE test; CREATE TABLE test.messages (message VARCHAR(250)); INSERT INTO test.messages VALUES ('hello'); EOF
使用mysql-read去訪問數據庫數據:
# kubectl run mysql-client --image=mysql:5.7 -i -t --rm --restart=Never -- mysql -h mysql-read -e "SELECT * FROM test.messages" +---------+ | message | +---------+ | hello | +---------+
可使用以下命令去循環的查看當前是mysql-read鏈接的數據庫:
kubectl run mysql-client-loop --image=mysql:5.7 -i -t --rm --restart=Never --\ bash -ic "while sleep 1; do mysql -h mysql-read -e 'SELECT @@server_id,NOW()'; done" +-------------+---------------------+ | @@server_id | NOW() | +-------------+---------------------+ | 102 | 2019-04-28 20:24:11 | +-------------+---------------------+ +-------------+---------------------+ | @@server_id | NOW() | +-------------+---------------------+ | 101 | 2019-04-28 20:27:35 | +-------------+---------------------+ +-------------+---------------------+ | @@server_id | NOW() | +-------------+---------------------+ | 100 | 2019-04-28 20:18:38 | +-------------+---------------------+