Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[miyamoto] 読み進めメモ #8

Open
DaisukeMiyamoto opened this issue Aug 3, 2019 · 5 comments
Open

[miyamoto] 読み進めメモ #8

DaisukeMiyamoto opened this issue Aug 3, 2019 · 5 comments
Assignees

Comments

@DaisukeMiyamoto
Copy link
Member

目的: EFA (Elastic Fabric Adapter) 実装の深堀り。

EFAのドライバはlibfabric内でプロバイダとして提供されている。
NCCL自体は、デフォルトではlibfabricを使用していないため、NCCLのplug-inとして、aws-ofi-nccl (https://github.com/aws/aws-ofi-nccl) が提供されている。

疑問:NCCLのplug-inとはなにか。どういう実装か

@DaisukeMiyamoto
Copy link
Member Author

DaisukeMiyamoto commented Aug 3, 2019

マッピングはここでやっている。
この関数群を実装すれば、デバイス側は置き換えられる?

https://github.com/aws/aws-ofi-nccl/blob/master/src/nccl_ofi_net.c#L1450

const ncclNet_t NCCL_PLUGIN_SYMBOL = {
	.name = "AWS Libfabric",
	.init = ofi_init,
	.devices = ofi_devices,
	.pciPath = ofi_pciPath,
	.ptrSupport = ofi_ptrSupport,
	.listen = ofi_listen,
	.connect = ofi_connect,
	.accept = ofi_accept,
	.regMr = ofi_regMr,
	.deregMr = ofi_deregMr,
	.isend = ofi_isend,
	.irecv = ofi_irecv,
	.flush = ofi_flush,
	.test = ofi_test,
	.closeSend = ofi_closeSend,
	.closeRecv = ofi_closeRecv,
	.closeListen = ofi_closeListen,
};

NCCL側はこらへん

https://github.com/NVIDIA/nccl/blob/9db4b1d801624a00591b7aafd426d6dd23547443/ext-net/dummy/plugin.c

おおよそ

  • net_ib.cc
  • net_socket.cc

に対応しそう?

@DaisukeMiyamoto
Copy link
Member Author

@DaisukeMiyamoto
Copy link
Member Author

ringの作り方

nccl/src/misc/rings.cc

Lines 67 to 101 in 0ceaec9

/*
* Ring creation algorithm
*
* First, we establish hierarchical coordinates depending on the way ranks can
* communicate. After fillCoords, we have for each rank a unique 3-int array
* { node, pci_domain, rank } corresponding to the three transports :
* { 2[NET], 1[SHM], 0[P2P] }.
* Also, we renumber ranks (to indexes) based on their growing coordinates.
*
* Then, we ask transports to connect groups together. We start with net, then
* shm, then p2p. We maintain two arrays, prev and next, where values are equal
* to -1 when ranks are not yet connected, and a rank otherwise. We never
* connect ranks outside our group, meaning that on 4 nodes of 2 sockets of 4
* ranks, if we are rank 13, we should see something like (provided we have a
* single net interface, hence a single ring) :
*
* Connecting all nodes <13>
* 2[NET] : prev 31 -1 -1 -1 -1 -1 -1 -1 7 -1 -1 -1 -1 -1 -1 -1 15 -1 -1 -1 -1 -1 -1 -1 23 -1 -1 -1 -1 -1 -1 -1
* next -1 -1 -1 -1 -1 -1 -1 8 -1 -1 -1 -1 -1 -1 -1 16 -1 -1 -1 -1 -1 -1 -1 24 -1 -1 -1 -1 -1 -1 -1 0
*
* Connecting P2P domains with shared memory <13>
* 1[SHM] : prev 31 -1 -1 -1 -1 -1 -1 -1 7 -1 -1 -1 11 -1 -1 -1 15 -1 -1 -1 -1 -1 -1 -1 23 -1 -1 -1 -1 -1 -1 -1
* next -1 -1 -1 -1 -1 -1 -1 8 -1 -1 -1 12 -1 -1 -1 16 -1 -1 -1 -1 -1 -1 -1 24 -1 -1 -1 -1 -1 -1 -1 0
*
* Connecting ranks (only inside the P2P domain) <13>
* 0[P2P] : prev 31 -1 -1 -1 -1 -1 -1 -1 7 -1 -1 -1 11 12 13 14 15 -1 -1 -1 -1 -1 -1 -1 23 -1 -1 -1 -1 -1 -1 -1
* next -1 -1 -1 -1 -1 -1 -1 8 -1 -1 -1 12 13 14 15 16 -1 -1 -1 -1 -1 -1 -1 24 -1 -1 -1 -1 -1 -1 -1 0
*
* Hence, when we ask a transport to connect groups, we provide it with a subview of the ranks (except for net
* which always sees the full world). That way, P2P can bruteforce all combinations inside the node without
* risking to explode in terms of combinations, and we scale better.
*
* Finally, we loop over Network scores to try to create rings with high scores (=locality) and decrease until
* we get at least one ring.
*/

@DaisukeMiyamoto
Copy link
Member Author

DaisukeMiyamoto commented Aug 3, 2019

plug-in関係、結局 net_ib.ccnet_socket.cc も最終的に ncclNet_t 型の構造体?を作ってる。

ncclNet_t ncclNetIb = {
"IB",
ncclIbInit,
ncclIbDevices,
ncclIbPciPath,
ncclIbPtrSupport,
ncclIbListen,
ncclIbConnect,
ncclIbAccept,
ncclIbRegMr,
ncclIbDeregMr,
ncclIbIsend,
ncclIbIrecv,
ncclIbFlush,
ncclIbTest,
ncclIbCloseSend,
ncclIbCloseRecv,
ncclIbCloseListen
};

typedef struct {
// Name of the network (mainly for logs)
const char* name;
// Initialize the network.
ncclResult_t (*init)(ncclDebugLogger_t logFunction);
// Return the number of adapters.
ncclResult_t (*devices)(int* ndev);
// Return the device path in /sys. NCCL will call free on this path.
ncclResult_t (*pciPath)(int dev, char** path);
// Return whether this device supports host pointers and/or CUDA pointers
// as data from the current GPU. Supported types should be composed with
// NCCL_PTR_HOST and NCCL_PTR_CUDA.
ncclResult_t (*ptrSupport)(int dev, int* supportedTypes);
// Create a receiving object and provide a handle to connect to it. The
// handle can be up to NCCL_NET_HANDLE_MAXSIZE bytes and will be exchanged
// between ranks to create a connection.
ncclResult_t (*listen)(int dev, void* handle, void** listenComm);
// Connect to a handle and return a sending comm object for that peer.
ncclResult_t (*connect)(int dev, void* handle, void** sendComm);
// Finalize connection establishment after remote peer has called connectHandle
ncclResult_t (*accept)(void* listenComm, void** recvComm);
// Register/Deregister memory. Comm can be either a sendComm or a recvComm.
// Type is either NCCL_PTR_HOST or NCCL_PTR_CUDA.
ncclResult_t (*regMr)(void* comm, void* data, int size, int type, void** mhandle);
ncclResult_t (*deregMr)(void* comm, void* mhandle);
// Asynchronous send to a peer.
// May return request == NULL if the call cannot be performed (or would block)
ncclResult_t (*isend)(void* sendComm, void* data, int size, void* mhandle, void** request);
// Asynchronous recv from a peer.
// May return request == NULL if the call cannot be performed (or would block)
ncclResult_t (*irecv)(void* recvComm, void* data, int size, void* mhandle, void** request);
// Perform a flush/fence to make sure all data received with NCCL_PTR_CUDA is
// visible to the GPU
ncclResult_t (*flush)(void* recvComm, void* data, int size, void* mhandle);
// Test whether a request is complete. If size is not NULL, it returns the
// number of bytes sent/received.
ncclResult_t (*test)(void* request, int* done, int* size);
// Close and free send/recv comm objects
ncclResult_t (*closeSend)(void* sendComm);
ncclResult_t (*closeRecv)(void* recvComm);
ncclResult_t (*closeListen)(void* listenComm);
} ncclNet_v2_t;
typedef ncclNet_v2_t ncclNet_t;
#define NCCL_PLUGIN_SYMBOL ncclNetPlugin_v2

@DaisukeMiyamoto
Copy link
Member Author

GPU Directのサポート

nccl/src/transport/net.cc

Lines 249 to 282 in 0ceaec9

static ncclResult_t netGetGdrSupport(int dev, int read, int* useGdr) {
*useGdr = 0;
int cudaDev, nvmlDev;
CUDACHECK(cudaGetDevice(&cudaDev));
NCCLCHECK(getNvmlDevice(cudaDev, &nvmlDev))
if (read) { // For reads (sends) only enable under certain conditions
int gdrReadParam = ncclParamNetGdrRead();
if (gdrReadParam == 0) return ncclSuccess;
if (gdrReadParam < 0) {
int nvlink;
NCCLCHECK(ncclNvlinkGpu(&nvlink));
if (!nvlink) return ncclSuccess;
}
}
// Check if we are close enough that it makes sense to enable GDR
int netGdrLevel = ncclParamNetGdrLevel();
short distance;
NCCLCHECK(netDistance(cudaDev, dev, &distance));
if (distance >= netGdrLevel) {
INFO(NCCL_NET,"NET/%s : GPU Direct RDMA Disabled for GPU %d[%d] / HCA %d (distance %d >= %d)", ncclNetName(), cudaDev, nvmlDev, dev, distance, netGdrLevel);
return ncclSuccess;
}
// Finally, check if the NIC supports it
int flags;
NCCLCHECK(ncclNetPtrSupport(dev, &flags));
if ((flags & NCCL_PTR_CUDA) == 0) return ncclSuccess;
*useGdr = 1;
INFO(NCCL_NET,"NET/%s : GPU Direct RDMA Enabled for GPU %d[%d] / HCA %d (distance %d < %d), read %d", ncclNetName(), cudaDev, nvmlDev, dev, distance, netGdrLevel, read);
return ncclSuccess;
}

ncclResult_t ncclSocketPtrSupport(int dev, int* supportedTypes) {
*supportedTypes = NCCL_PTR_HOST;
return ncclSuccess;
}

ncclResult_t ncclIbPtrSupport(int dev, int* supportedTypes) {
*supportedTypes = NCCL_PTR_HOST;
int cudaDev, nvmlDev;
CUDACHECK(cudaGetDevice(&cudaDev));
NCCLCHECK(getNvmlDevice(cudaDev, &nvmlDev))
if (ncclIbGdrSupport(dev) != ncclSuccess) {
INFO(NCCL_NET,"NET/IB : GPU Direct RDMA Disabled for GPU %d[%d] / HCA %d '%s' (no module or not supported by GPU)", cudaDev, nvmlDev, dev, ncclIbDevs[dev].devName);
return ncclSuccess;
}
*supportedTypes |= NCCL_PTR_CUDA;
return ncclSuccess;
}

  • aws-ofi-nccl

https://github.com/aws/aws-ofi-nccl/blob/master/src/nccl_ofi_net.c#L770

static ncclResult_t ofi_ptrSupport(int dev, int *supportedTypes)
{
	*supportedTypes = NCCL_PTR_HOST;
	return ncclSuccess;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant