這篇文章涉及mesos如何在原生的mesoscontainerizer和docker containerizer上支持gpu的,以及若是本身實現一個mesos之上的framework capos支持gpu調度的實現原理,(capos是hulu內部的資源調度平臺 refer to https://www.cnblogs.com/yanghuahui/p/9304302.html)。html
mesos slave在啓動的時候須要初始化containerizer的resource,包含cpu/mem/gpu等,這對於mesos containerizer和docker containerizer都是通用的docker
void Slave::initialize() { ... Try<Resources> resources = Containerizer::resources(flags); ... }
而後到了src/slave/containerizer/containerizer.cpp 代碼塊中, 根據mesos-slave/agent的啓動參數flags,調用allocator邏輯數據結構
Try<Resources> Containerizer::resources(const Flags& flags) { ... // GPU resource. Try<Resources> gpus = NvidiaGpuAllocator::resources(flags); if (gpus.isError()) { return Error("Failed to obtain GPU resources: " + gpus.error()); } // When adding in the GPU resources, make sure that we filter out // the existing GPU resources (if any) so that we do not double // allocate GPUs. resources = gpus.get() + resources.filter( [](const Resource& resource) { return resource.name() != "gpus"; }); ... }
src/slave/containerizer/mesos/isolators/gpu/allocator.cpp 會用nvidia的管理gpu的命令nvml以及根據啓動參數,返回這臺機器上gpu的資源,供以後的調度使用。ide
// To determine the proper number of GPU resources to return, we // need to check both --resources and --nvidia_gpu_devices. // There are two cases to consider: // // (1) --resources includes "gpus" and --nvidia_gpu_devices is set. // The number of GPUs in --resources must equal the number of // GPUs within --nvidia_gpu_resources. // // (2) --resources does not include "gpus" and --nvidia_gpu_devices // is not specified. Here we auto-discover GPUs using the // NVIDIA management Library (NVML). We special case specifying // `gpus:0` explicitly to not perform auto-discovery. // static Try<Resources> enumerateGpuResources(const Flags& flags) { ... }
由於gpu資源是須要綁定gpu卡number的,gpu資源在調度的數據結構中,是一個set<Gpu>, allocator.go提供allocate和deallocate接口的實現ui
Future<Nothing> allocate(const set<Gpu>& gpus) { set<Gpu> allocation = available & gpus; if (allocation.size() < gpus.size()) { return Failure(stringify(gpus - allocation) + " are not available"); } available = available - allocation; allocated = allocated | allocation; return Nothing(); } Future<Nothing> deallocate(const set<Gpu>& gpus) { set<Gpu> deallocation = allocated & gpus; if (deallocation.size() < gpus.size()) { return Failure(stringify(gpus - deallocation) + " are not allocated"); } allocated = allocated - deallocation; available = available | deallocation; return Nothing(); }
可是封裝到上層,供containerizer調用的時候,指定須要allocate的gpu number就能夠spa
Future<set<Gpu>> NvidiaGpuAllocator::allocate(size_t count) { // Need to disambiguate for the compiler. Future<set<Gpu>> (NvidiaGpuAllocatorProcess::*allocate)(size_t) = &NvidiaGpuAllocatorProcess::allocate; return process::dispatch(data->process, allocate, count); }
可是deallocate仍然須要顯示指定須要釋放哪些gpucode
Future<Nothing> NvidiaGpuAllocator::deallocate(const set<Gpu>& gpus) { return process::dispatch( data->process, &NvidiaGpuAllocatorProcess::deallocate, gpus); }
而後若是做業是用docker containerizer,能夠看到src/slave/containerizer/docker.cpp中調用gpu的邏輯orm
Future<Nothing> DockerContainerizerProcess::allocateNvidiaGpus( const ContainerID& containerId, const size_t count) { if (!nvidia.isSome()) { return Failure("Attempted to allocate GPUs" " without Nvidia libraries available"); } if (!containers_.contains(containerId)) { return Failure("Container is already destroyed"); } return nvidia->allocator.allocate(count) .then(defer( self(), &Self::_allocateNvidiaGpus, containerId, lambda::_1)); }
因此之上,就是在slave中啓動的時候加載確認gpu資源,而後在啓動containerizer的時候,能夠利用slave中維護的gpu set資源池,去拿到資源,以後啓動做業。htm
那capos是如何實現的呢,capos是hulu內部的資源調度平臺(refer to https://www.cnblogs.com/yanghuahui/p/9304302.html),由於本身實現了mesos的capos containerizer,咱們的作法是,在mesos slave註冊的時候顯示的經過參數或者自動探測的機制,發現gpu資源,而後用--resources=gpu range的形式啓動mesos agent,這樣offer資源的gpu在capos看來就是一個range,能夠相似使用port資源的方式,來調度gpu,在capos containerizer中,根據調度器指定的gpu range,去綁定一個或者多個gpu資源到docker nvidia runtime中。完成gpu調度功能。blog