一、確認服務器系統版本爲16.04.02 (每臺都須要操做)
預安裝準備參考官網:https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#pre-installation-actionshtml
for i in xsgpu81 xsgpu82 xsgpu83 xsgpu84 xsgpu85; do qssh root@$i 'cat /etc/issue;uname -r';done Ubuntu 16.04.2 LTS \n \l 4.4.0-62-genericmodprobe
二、下載nvidia driver驅動並安裝
可能須要 service lighted stop, 若是機器不乾淨(以前裝過gpu相關的東西)的話linux
wget http://us.download.nvidia.com/XFree86/Linux-x86_64/375.26/NVIDIA-Linux-x86_64-375.26.run root@xsgpu81:~# sudo sh NVIDIA-Linux-x86_64-375.26.run Accept OK OK OK
三、安裝cudagit
wget http://ogo0b6qe6.bkt.clouddn.com/cuda_8.0.61_375.26_linux.run chmod +x cuda_8.0.61_375.26_linux.run sudo sh cuda_8.0.61_375.26_linux.run --silent echo "PATH=/usr/local/cuda-8.0/bin:$PATH" >> /root/.bashrc echo "LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH" >> /root/.bashrc source /root/.bashrc
四、拷貝測試文件github
qscp NVIDIA_CUDA-8.0_Samples/0_Simple/vectorAdd/vectorAdd root@xsgpu81:/root/ root@xsgpu81:~# ./vectorAdd [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done
人肉部署含GPU設備的mesos-agent節點
按照標準流程在GPU機器上部署mesos-agent及其它基礎服務(boots-docker, consul, logbeat)
人肉流程:
停含有GPU機器上的mesos-agent服務 supervisorctl stop mesos-agent
清理mesos-agent work_dir
rm -rf cat /home/qboxserver/mesos-agent/current/conf/mesos-agent/work_dir
進入到mesos-agent配置文件目錄 /home/qboxserver/mesos-agent/current/conf/mesos-agent更新配置
獲取機器上的GPU設備數和型號nvidia-smi -L, 列出的GPU設備數即爲設備總數
將設備型號寫入到attributes文件 echo "NETWORK:BRIDGE;GPU_MODEL:$MODEL」 > attributes
增長isolation配置 echo "cgroups/devices,gpu/nvidia「 > isolation
標識可用的gpu設備編號 echo 「0, 1, …, 設備總數 - 1」 > nvidia_gpu_devices
resources中增長gpu資源{"name":"gpus","type":"SCALAR","scalar":{"value」:設備總數}}
進入/home/qboxserver/mesos-agent/current/libexec/mesos替換executor
保留原始的executor mv mesos-docker-executor mesos-docker-executor.cpp
下載gpu executor docker
wget http://ogo0b6qe6.bkt.clouddn.com/mesos-docker-executor-2017-11-18 mv mesos-docker-executor-2017-11-18 mesos-docker-executor; chown qboxserver.qboxserver mesos-docker-executor cp mesos-docker-executor.go mesos-docker-executor
安裝nvidia-docker-plugin
cd /home/qboxserver && mkdir nvidia-docker
cd /home/qboxserver/nvidia-docker
wget http://ogo0b6qe6.bkt.clouddn.com/nvidia-docker-plugin.2016-11-22-20-45-30.tar.gz
tar zxf nvidia-docker-plugin.2016-11-22-20-45-30.tar.gz
ln -s 2016-11-22-20-45-30 current
./current/bin/start.sh
curl -s http://localhost:3476/v1.0/gpu/info 查看gpu設備信息
啓動mesos-agent服務
升級GPU 驅動(嘗試使用apt-get安裝驅動)apache
apt-get purge nvidia* add-apt-repository ppa:graphics-drivers apt-get update apt-get install nvidia-<version> reboot
安裝配套的cadvisor ubuntu
cd /home/qboxserver/boots-cadvisor/current/bin && \ mv cadvisor cadvisor.bak && \ wget http://ogo0b6qe6.bkt.clouddn.com/cadvisor && \ chmod +x cadvisor && \ chown qboxserver:qboxserver cadvisor && \ ./start.sh
原理:
http://www.linuxandubuntu.com/home/how-to-install-latest-nvidia-drivers-in-linux
http://mesos.apache.org/documentation/latest/gpu-support/
https://github.com/NVIDIA/nvidia-docker/wikibash
xs區域新上線GPU計算節點7臺
版本升級步驟:
有些服務會佔用gpu, 升級以前這些服務要停掉:服務器
wget http://us.download.nvidia.com/tesla/396.44/NVIDIA-Linux-x86_64-396.44.run sh NVIDIA-Linux-x86_64-396.44.run --slient
執行完畢後:
nvidia-smi 查看是否安裝成功
重啓機器less
升級實例:
一、查看原來的版本
root@xsgpu9:~# nvidia-smi NVIDIA-SMI 375.26
二、查看正在使用的模塊
root@xsgpu9:~# lsmod | grep -i nvidia nvidia_drm 53248 0 nvidia_modeset 790528 1 nvidia_drm nvidia 11943936 1 nvidia_modeset drm_kms_helper 143360 2 ast,nvidia_drm drm 360448 5 ast,ttm,drm_kms_helper,nvidia_drm
三、卸載相關的模塊modprobe -r nvidia_drm nvidia_modeset nvidia
四、下載新的版本root@xsgpu9:~# wget http://us.download.nvidia.com/tesla/396.44/NVIDIA-Linux-x86_64-396.44.run
五、安裝新版本sh NVIDIA-Linux-x86_64-396.44.run --silent
六、查看新版本
nvidia-smi | NVIDIA-SMI 396.44 Driver Version: 396.44 |
xs311 apt -get安裝了nvidia的驅動,刪除命令,apt-get --purge remove nvidia-*
dora.內部計算 --> dora.內部計算GPU 問題記錄:
root@jjh1569:/var/log# cat /home/qboxserver/mesos-agent/current/conf/mesos-agent/attributes
NETWORK:HOST
修改成:
NETWORK:HOST;GPU_MODEL:QSV
以後重啓dockerd和mesos-agent服務
發現啓動mesos-agent服務失敗
剛纔那個mesos-agent問題,是配置不一致,致使的啓動失敗(mesos-agent會保持重連機制,配置不一樣會失敗)
刪除work目錄,/disk1/mesos
root@jjh1569:/var/log# cd /home/qboxserver/mesos-agent/current/conf/mesos-agent/ root@jjh1569:/home/qboxserver/mesos-agent/current/conf/mesos-agent# cat work_dir /disk1/mesos 而後執行: rm -rf /disk1/mesos
root@jjh1569:/var/log# less syslog Oct 24 18:55:11 jjh1569 mesos-agent[9615]: I1024 18:55:11.627574 9662 slave.cpp:519] Agent resources: cpus(*):7; mem(*):12288; disk(*):445440; ports(*):[10000-20000] Oct 24 18:55:11 jjh1569 mesos-agent[9615]: I1024 18:55:11.627622 9662 slave.cpp:527] Agent attributes: [ NETWORK=HOST, GPU_MODEL=QSV ] Oct 24 18:55:11 jjh1569 mesos-agent[9615]: I1024 18:55:11.627645 9662 slave.cpp:532] Agent hostname: 10.20.78.29 Oct 24 18:55:11 jjh1569 mesos-agent[9615]: I1024 18:55:11.630751 9660 state.cpp:57] Recovering state from '/disk1/mesos/meta' Oct 24 18:55:11 jjh1569 mesos-agent[9615]: Failed to perform recovery: Incompatible agent info detected. Oct 24 18:55:11 jjh1569 mesos-agent[9615]: Old agent info: Oct 24 18:55:11 jjh1569 mesos-agent[9615]: attributes { Oct 24 18:55:11 jjh1569 mesos-agent[9615]: name: "NETWORK" Oct 24 18:55:11 jjh1569 mesos-agent[9615]: type: TEXT Oct 24 18:55:11 jjh1569 mesos-agent[9615]: text { Oct 24 18:55:11 jjh1569 mesos-agent[9615]: value: "HOST" Oct 24 18:55:11 jjh1569 mesos-agent[9615]: } Oct 24 18:55:11 jjh1569 mesos-agent[9615]: } Oct 24 18:55:11 jjh1569 mesos-agent[9615]: New agent info: Oct 24 18:55:11 jjh1569 mesos-agent[9615]: attributes { Oct 24 18:55:11 jjh1569 mesos-agent[9615]: name: "NETWORK" Oct 24 18:55:11 jjh1569 mesos-agent[9615]: type: TEXT Oct 24 18:55:11 jjh1569 mesos-agent[9615]: text { Oct 24 18:55:11 jjh1569 mesos-agent[9615]: value: "HOST" Oct 24 18:55:11 jjh1569 mesos-agent[9615]: } Oct 24 18:55:11 jjh1569 mesos-agent[9615]: } Oct 24 18:55:11 jjh1569 mesos-agent[9615]: attributes { Oct 24 18:55:11 jjh1569 mesos-agent[9615]: name: "GPU_MODEL" Oct 24 18:55:11 jjh1569 mesos-agent[9615]: type: TEXT Oct 24 18:55:11 jjh1569 mesos-agent[9615]: text { Oct 24 18:55:11 jjh1569 mesos-agent[9615]: value: "QSV" #多出的一部分 Oct 24 18:55:11 jjh1569 mesos-agent[9615]: } Oct 24 18:55:11 jjh1569 mesos-agent[9615]: }
而後修改attributes和resources(QSV是自定義的gpu類型,gpus是GPU個數,須要對應修改)
再重啓dockerd和mesos-agent服務(若是啓動失敗,刪除workdir: /disk1/mesos目錄再重啓mesos-agent)
#!/bin/bash
if grep -q QSV /home/qboxserver/mesos-agent/current/conf/mesos-agent/attributes
then echo QSV is exit
else
sed -i "s/NETWORK:HOST/NETWORK:HOST;GPU_MODEL:QSV/g" /home/qboxserver/mesos-agent/current/conf/mesos-agent/attributes
fi
/home/qboxserver/mesos-agent/current/conf/mesos-agent/resources
cat << EOF >> /home/qboxserver/mesos-agent/current/conf/mesos-agent/resources
[
{
"name": "cpus",
"type": "SCALAR",
"scalar": {
"value": 7
}
},
{
"name": "mem",
"type": "SCALAR",
"scalar": {
"value": 14336
}
},
{
"name" : "disk",
"type" : "SCALAR",
"scalar" : { "value" : 20480 }
},
{
"name": "ports",
"type": "RANGES",
"ranges": {
"range": [
{
"begin": 10000,
"end": 20000
}
]
}
},
{
"name": "gpus",
"type": "SCALAR",
"scalar": {
"value": 1
}
},
{
"name": "gpuset",
"type": "SET",
"set": {
"item": ["0"]
}
}
]
EOF
gpu插件相關腳本:
root@xs313:~# cat /tmp/gpu.sh
#!/bin/bash
#usage: 部署 dora gpu 機器 gpu 相關配置的腳本
supervisorctl stop mesos-agent supervisorctl stop boots-cadvisor supervisorctl stop dockerd
#安裝自定義 cadviser
cd /home/qboxserver/boots-cadvisor/current/bin mv cadvisor cadvisor.bak wget http://ogo0b6qe6.bkt.clouddn.com/cadvisor chmod +x cadvisor chown qboxserver:qboxserver cadvisor
#安裝自定義的 mesos-docker-executor
cd /home/qboxserver/mesos-agent/current/libexec/mesos wget http://ogo0b6qe6.bkt.clouddn.com/mesos-docker-executor-2018-09-10-15-05-00 mv mesos-docker-executor mesos-docker-executor.bak mv mesos-docker-executor-2018-09-10-15-05-00 mesos-docker-executor chown qboxserver:qboxserver mesos-docker-executor chmod +x mesos-docker-executor
#meos-agent 參數
#Part #1** 修改 attributes
MODEL=$(nvidia-smi -L | cut -d" " -f4 | xargs | cut -d" " -f1) sed -i "s/NETWORK:HOST/NETWORK:HOST;GPU_MODEL:${MODEL}/g" /home/qboxserver/mesos-agent/current/conf/mesos-agent/attributes nvidia-smi -L
#Part #2** 添加 isolation
echo "cgroups/devices,gpu/nvidia" > /home/qboxserver/mesos-agent/current/conf/mesos-agent/isolation
#Part #3** 添加 nvidia_gpu_devices
echo "0,1,2,3,4,5,6,7" > /home/qboxserver/mesos-agent/current/conf/mesos-agent/nvidia_gpu_devices
#Part #4** 添加 resources
for i in `seq 2`; do sed -i '$d' /home/qboxserver/mesos-agent/current/conf/mesos-agent/resources ; done cat << EOF >> /home/qboxserver/mesos-agent/current/conf/mesos-agent/resources }, { "name": "gpus", "type": "SCALAR", "scalar": { "value": 8 } }, { "name": "gpuset", "type": "SET", "set": { "item": ["0", "1", "2", "3", "4", "5", "6", "7"] } } ] EOF
#安裝 nvidia-docker-plugin
cd /home/qboxserver && mkdir nvidia-docker && cd nvidia-docker wget http://ogo0b6qe6.bkt.clouddn.com/nvidia-docker-plugin.2016-11-22-20-45-30.tar.gz tar zxf nvidia-docker-plugin.2016-11-22-20-45-30.tar.gz ln -s 2016-11-22-20-45-30 current ./current/bin/start.sh
#最後上線
rm -rf $(cat /home/qboxserver/mesos-agent/current/conf/mesos-agent/work_dir) supervisorctl start dockerd supervisorctl start mesos-agent supervisorctl start boots-cadvisor
查看nvidia顯卡驅動
目前dora使用的gpu有k80和p4兩種類型,查看方法:
nvidia-smi -L root@xs991:~# nvidia-smi -L GPU 0: Tesla P4 (UUID: GPU-50850be7-c49e-4693-e20e-a677d2adeb82) GPU 1: Tesla P4 (UUID: GPU-22e9fbe2-9170-4548-c301-579b786858b6) GPU 2: Tesla P4 (UUID: GPU-c8132e0e-c8a4-defc-fea3-01b5c930667e) GPU 3: Tesla P4 (UUID: GPU-762546f1-0b48-c963-954e-fa74b4f7e76f) GPU 4: Tesla P4 (UUID: GPU-2fdb3d5e-dd66-1f6d-a814-5265df4fa1f4) GPU 5: Tesla P4 (UUID: GPU-a4011f72-78c2-ab13-c6b8-3e58e9093773) GPU 6: Tesla P4 (UUID: GPU-84d2bbd4-c3e0-d7ed-6628-5528878de6ea) GPU 7: Tesla P4 (UUID: GPU-fa3933c0-3cb3-4e8c-a84a-75342a15cc24) root@xs313:~# nvidia-smi -L GPU 0: Tesla K80 (UUID: GPU-a457c419-bcfd-538b-d993-e443d28dcd24) GPU 1: Tesla K80 (UUID: GPU-07f9795d-3917-b804-a6c5-621e27c239f8) GPU 2: Tesla K80 (UUID: GPU-78197899-b007-1e74-29a8-3f27958e7d28) GPU 3: Tesla K80 (UUID: GPU-d594f478-261b-e139-b87f-cf1d7b076f42) GPU 4: Tesla K80 (UUID: GPU-8df7cf81-e51a-3a88-a4b8-6075d18a9365) GPU 5: Tesla K80 (UUID: GPU-c9931f33-32c0-da73-aa8f-6109989b129c) GPU 6: Tesla K80 (UUID: GPU-0830ceaa-f860-b717-67ac-e4e7fec25a26) GPU 7: Tesla K80 (UUID: GPU-9b509b1c-a186-cf05-8aa3-4ba73aed1eb1)
顯卡有nvidia和Intel集成兩種類型
root@xsgpu81:~# lspci | grep -i nvidia 04:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1) 05:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1) 08:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1) 09:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1) 84:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1) 85:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1) 88:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1) 89:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1) qboxserver@jjh1569:~$ lspci | grep -i vga 00:13.0 Non-VGA unclassified device: Intel Corporation Sunrise Point-H Integrated Sensor Hub (rev 31) 07:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 30)