[miyamoto] 読み進めメモ #8

DaisukeMiyamoto · 2019-08-03T04:20:07Z

目的: EFA (Elastic Fabric Adapter) 実装の深堀り。

EFAのドライバはlibfabric内でプロバイダとして提供されている。
NCCL自体は、デフォルトではlibfabricを使用していないため、NCCLのplug-inとして、aws-ofi-nccl (https://github.com/aws/aws-ofi-nccl) が提供されている。

疑問：NCCLのplug-inとはなにか。どういう実装か

DaisukeMiyamoto · 2019-08-03T04:26:44Z

マッピングはここでやっている。
この関数群を実装すれば、デバイス側は置き換えられる？

https://github.com/aws/aws-ofi-nccl/blob/master/src/nccl_ofi_net.c#L1450

const ncclNet_t NCCL_PLUGIN_SYMBOL = {
	.name = "AWS Libfabric",
	.init = ofi_init,
	.devices = ofi_devices,
	.pciPath = ofi_pciPath,
	.ptrSupport = ofi_ptrSupport,
	.listen = ofi_listen,
	.connect = ofi_connect,
	.accept = ofi_accept,
	.regMr = ofi_regMr,
	.deregMr = ofi_deregMr,
	.isend = ofi_isend,
	.irecv = ofi_irecv,
	.flush = ofi_flush,
	.test = ofi_test,
	.closeSend = ofi_closeSend,
	.closeRecv = ofi_closeRecv,
	.closeListen = ofi_closeListen,
};

NCCL側はこらへん

https://github.com/NVIDIA/nccl/blob/9db4b1d801624a00591b7aafd426d6dd23547443/ext-net/dummy/plugin.c

おおよそ

net_ib.cc
net_socket.cc

に対応しそう？

DaisukeMiyamoto · 2019-08-03T04:57:27Z

https://github.com/aws/aws-ofi-nccl/blob/657f438c80f68542ec272d38074dac2feaddca39/src/nccl_ofi_net.c#L678-L682

気になるコメント

DaisukeMiyamoto · 2019-08-03T05:45:32Z

ringの作り方

nccl/src/misc/rings.cc

Lines 67 to 101 in 0ceaec9

    
           /* 
        
            * Ring creation algorithm 
        
            * 
        
            * First, we establish hierarchical coordinates depending on the way ranks can 
        
            * communicate. After fillCoords, we have for each rank a unique 3-int array 
        
            * {   node, pci_domain,   rank } corresponding to the three transports : 
        
            * { 2[NET],     1[SHM], 0[P2P] }. 
        
            * Also, we renumber ranks (to indexes) based on their growing coordinates. 
        
            * 
        
            * Then, we ask transports to connect groups together. We start with net, then 
        
            * shm, then p2p. We maintain two arrays, prev and next, where values are equal 
        
            * to -1 when ranks are not yet connected, and a rank otherwise. We never 
        
            * connect ranks outside our group, meaning that on 4 nodes of 2 sockets of 4 
        
            * ranks, if we are rank 13, we should see something like (provided we have a 
        
            * single net interface, hence a single ring) : 
        
            * 
        
            * Connecting all nodes                                <13> 
        
            * 2[NET] : prev 31 -1 -1 -1 -1 -1 -1 -1  7 -1 -1 -1 -1 -1 -1 -1 15 -1 -1 -1 -1 -1 -1 -1 23 -1 -1 -1 -1 -1 -1 -1 
        
            *          next -1 -1 -1 -1 -1 -1 -1  8 -1 -1 -1 -1 -1 -1 -1 16 -1 -1 -1 -1 -1 -1 -1 24 -1 -1 -1 -1 -1 -1 -1  0 
        
            * 
        
            * Connecting P2P domains with shared memory           <13> 
        
            * 1[SHM] : prev 31 -1 -1 -1 -1 -1 -1 -1  7 -1 -1 -1 11 -1 -1 -1 15 -1 -1 -1 -1 -1 -1 -1 23 -1 -1 -1 -1 -1 -1 -1 
        
            *          next -1 -1 -1 -1 -1 -1 -1  8 -1 -1 -1 12 -1 -1 -1 16 -1 -1 -1 -1 -1 -1 -1 24 -1 -1 -1 -1 -1 -1 -1  0 
        
            * 
        
            * Connecting ranks (only inside the P2P domain)       <13> 
        
            * 0[P2P] : prev 31 -1 -1 -1 -1 -1 -1 -1  7 -1 -1 -1 11 12 13 14 15 -1 -1 -1 -1 -1 -1 -1 23 -1 -1 -1 -1 -1 -1 -1 
        
            *          next -1 -1 -1 -1 -1 -1 -1  8 -1 -1 -1 12 13 14 15 16 -1 -1 -1 -1 -1 -1 -1 24 -1 -1 -1 -1 -1 -1 -1  0 
        
            * 
        
            * Hence, when we ask a transport to connect groups, we provide it with a subview of the ranks (except for net 
        
            * which always sees the full world). That way, P2P can bruteforce all combinations inside the node without 
        
            * risking to explode in terms of combinations, and we scale better. 
        
            * 
        
            * Finally, we loop over Network scores to try to create rings with high scores (=locality) and decrease until 
        
            * we get at least one ring. 
        
            */

DaisukeMiyamoto · 2019-08-03T07:13:03Z

plug-in関係、結局 net_ib.cc も net_socket.cc も最終的に ncclNet_t 型の構造体？を作ってる。

nccl/src/transport/net_ib.cc

Lines 849 to 867 in 0ceaec9

    
           ncclNet_t ncclNetIb = { 
        
             "IB", 
        
             ncclIbInit, 
        
             ncclIbDevices, 
        
             ncclIbPciPath, 
        
             ncclIbPtrSupport, 
        
             ncclIbListen, 
        
             ncclIbConnect, 
        
             ncclIbAccept, 
        
             ncclIbRegMr, 
        
             ncclIbDeregMr, 
        
             ncclIbIsend, 
        
             ncclIbIrecv, 
        
             ncclIbFlush, 
        
             ncclIbTest, 
        
             ncclIbCloseSend, 
        
             ncclIbCloseRecv, 
        
             ncclIbCloseListen 
        
           };

nccl/src/include/nccl_net.h

Lines 61 to 106 in 0ceaec9

    
           typedef struct { 
        
             // Name of the network (mainly for logs) 
        
             const char* name; 
        
             // Initialize the network. 
        
             ncclResult_t (*init)(ncclDebugLogger_t logFunction); 
        
             // Return the number of adapters. 
        
             ncclResult_t (*devices)(int* ndev); 
        
             // Return the device path in /sys. NCCL will call free on this path. 
        
             ncclResult_t (*pciPath)(int dev, char** path); 
        
             // Return whether this device supports host pointers and/or CUDA pointers 
        
             // as data from the current GPU. Supported types should be composed with 
        
             // NCCL_PTR_HOST and NCCL_PTR_CUDA. 
        
             ncclResult_t (*ptrSupport)(int dev, int* supportedTypes); 
        
             // Create a receiving object and provide a handle to connect to it. The 
        
             // handle can be up to NCCL_NET_HANDLE_MAXSIZE bytes and will be exchanged 
        
             // between ranks to create a connection. 
        
             ncclResult_t (*listen)(int dev, void* handle, void** listenComm); 
        
             // Connect to a handle and return a sending comm object for that peer. 
        
             ncclResult_t (*connect)(int dev, void* handle, void** sendComm); 
        
             // Finalize connection establishment after remote peer has called connectHandle 
        
             ncclResult_t (*accept)(void* listenComm, void** recvComm); 
        
             // Register/Deregister memory. Comm can be either a sendComm or a recvComm. 
        
             // Type is either NCCL_PTR_HOST or NCCL_PTR_CUDA. 
        
             ncclResult_t (*regMr)(void* comm, void* data, int size, int type, void** mhandle); 
        
             ncclResult_t (*deregMr)(void* comm, void* mhandle); 
        
             // Asynchronous send to a peer. 
        
             // May return request == NULL if the call cannot be performed (or would block) 
        
             ncclResult_t (*isend)(void* sendComm, void* data, int size, void* mhandle, void** request); 
        
             // Asynchronous recv from a peer. 
        
             // May return request == NULL if the call cannot be performed (or would block) 
        
             ncclResult_t (*irecv)(void* recvComm, void* data, int size, void* mhandle, void** request); 
        
             // Perform a flush/fence to make sure all data received with NCCL_PTR_CUDA is 
        
             // visible to the GPU 
        
             ncclResult_t (*flush)(void* recvComm, void* data, int size, void* mhandle); 
        
             // Test whether a request is complete. If size is not NULL, it returns the 
        
             // number of bytes sent/received. 
        
             ncclResult_t (*test)(void* request, int* done, int* size); 
        
             // Close and free send/recv comm objects 
        
             ncclResult_t (*closeSend)(void* sendComm); 
        
             ncclResult_t (*closeRecv)(void* recvComm); 
        
             ncclResult_t (*closeListen)(void* listenComm); 
        
           } ncclNet_v2_t; 
        
           typedef ncclNet_v2_t ncclNet_t; 
        
           #define NCCL_PLUGIN_SYMBOL ncclNetPlugin_v2

DaisukeMiyamoto · 2019-08-03T07:21:59Z

GPU Directのサポート

nccl/src/transport/net.cc

Lines 249 to 282 in 0ceaec9

    
           static ncclResult_t netGetGdrSupport(int dev, int read, int* useGdr) { 
        
             *useGdr = 0; 
        
             int cudaDev, nvmlDev; 
        
             CUDACHECK(cudaGetDevice(&cudaDev)); 
        
             NCCLCHECK(getNvmlDevice(cudaDev, &nvmlDev)) 
        
             if (read) { // For reads (sends) only enable under certain conditions 
        
               int gdrReadParam = ncclParamNetGdrRead(); 
        
               if (gdrReadParam == 0) return ncclSuccess; 
        
               if (gdrReadParam < 0) { 
        
                  int nvlink; 
        
                  NCCLCHECK(ncclNvlinkGpu(&nvlink)); 
        
                  if (!nvlink) return ncclSuccess; 
        
               } 
        
             } 
        
             // Check if we are close enough that it makes sense to enable GDR 
        
             int netGdrLevel = ncclParamNetGdrLevel(); 
        
             short distance; 
        
             NCCLCHECK(netDistance(cudaDev, dev, &distance)); 
        
             if (distance >= netGdrLevel) { 
        
               INFO(NCCL_NET,"NET/%s : GPU Direct RDMA Disabled for GPU %d[%d] / HCA %d (distance %d >= %d)", ncclNetName(), cudaDev, nvmlDev, dev, distance, netGdrLevel); 
        
               return ncclSuccess; 
        
             } 
        
             // Finally, check if the NIC supports it 
        
             int flags; 
        
             NCCLCHECK(ncclNetPtrSupport(dev, &flags)); 
        
             if ((flags & NCCL_PTR_CUDA) == 0) return ncclSuccess; 
        
             *useGdr = 1; 
        
             INFO(NCCL_NET,"NET/%s : GPU Direct RDMA Enabled for GPU %d[%d] / HCA %d (distance %d < %d), read %d", ncclNetName(), cudaDev, nvmlDev, dev, distance, netGdrLevel, read); 
        
             return ncclSuccess; 
        
           }

nccl/src/transport/net_socket.cc

Lines 50 to 53 in 0ceaec9

    
           ncclResult_t ncclSocketPtrSupport(int dev, int* supportedTypes) { 
        
             *supportedTypes = NCCL_PTR_HOST; 
        
             return ncclSuccess; 
        
           }

nccl/src/transport/net_ib.cc

Lines 218 to 231 in 0ceaec9

    
           ncclResult_t ncclIbPtrSupport(int dev, int* supportedTypes) { 
        
             *supportedTypes = NCCL_PTR_HOST; 
        
             int cudaDev, nvmlDev; 
        
             CUDACHECK(cudaGetDevice(&cudaDev)); 
        
             NCCLCHECK(getNvmlDevice(cudaDev, &nvmlDev)) 
        
             if (ncclIbGdrSupport(dev) != ncclSuccess) { 
        
               INFO(NCCL_NET,"NET/IB : GPU Direct RDMA Disabled for GPU %d[%d] / HCA %d '%s' (no module or not supported by GPU)", cudaDev, nvmlDev, dev, ncclIbDevs[dev].devName); 
        
               return ncclSuccess; 
        
             } 
        
             *supportedTypes |= NCCL_PTR_CUDA; 
        
             return ncclSuccess; 
        
           }

aws-ofi-nccl

https://github.com/aws/aws-ofi-nccl/blob/master/src/nccl_ofi_net.c#L770

static ncclResult_t ofi_ptrSupport(int dev, int *supportedTypes)
{
	*supportedTypes = NCCL_PTR_HOST;
	return ncclSuccess;
}

aokomoriuta assigned DaisukeMiyamoto Sep 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[miyamoto] 読み進めメモ #8

[miyamoto] 読み進めメモ #8

DaisukeMiyamoto commented Aug 3, 2019

DaisukeMiyamoto commented Aug 3, 2019 •

edited

Loading

DaisukeMiyamoto commented Aug 3, 2019

DaisukeMiyamoto commented Aug 3, 2019

DaisukeMiyamoto commented Aug 3, 2019 •

edited

Loading

DaisukeMiyamoto commented Aug 3, 2019

[miyamoto] 読み進めメモ #8

[miyamoto] 読み進めメモ #8

Comments

DaisukeMiyamoto commented Aug 3, 2019

DaisukeMiyamoto commented Aug 3, 2019 • edited Loading

DaisukeMiyamoto commented Aug 3, 2019

DaisukeMiyamoto commented Aug 3, 2019

DaisukeMiyamoto commented Aug 3, 2019 • edited Loading

DaisukeMiyamoto commented Aug 3, 2019

DaisukeMiyamoto commented Aug 3, 2019 •

edited

Loading

DaisukeMiyamoto commented Aug 3, 2019 •

edited

Loading