Jax Distributed Demo

This code demonstrates using JAX in an idiomatic distributed fashion at NERSC.

It starts a job running over several nodes, with one process per GPU per node. Each process generates some local data, small enough to fit in their GPU, then the data is aggregated into a larger sharded array. We run a computation over that sharded data and return.

The code is mainly inspired by this tutorial.

Usage

To run, start the container.slurm script in the same folder as the source (you will need to change the account name):

sbatch container.slurm

Notice that the script loads an NVIDIA JAX container with GPU enabled instead of using modules. This ensures that we have a working installation of JAX that is compatible with distributed computation (something that can be complicated to set up from scratch).

See the NERSC Shifter documentation for further information on container usage (such as using the mpich module to enable MPI use).

Files

container.slurm contains our Slurm script,
distributed.py contains the demo Python script (make_array_from_single_device_arrays based),
distributed_local_to_global.py contains an alernative implementation (multihost_utils.host_local_array_to_global_array based),
output.out contains a typical output.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
container.slurm		container.slurm
distributed.py		distributed.py
distributed_local_to_global.py		distributed_local_to_global.py
output.out		output.out

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Jax Distributed Demo

Usage

Files

About

Languages

License

nestordemeure/jax_nersc_distributed_demo

Folders and files

Latest commit

History

Repository files navigation

Jax Distributed Demo

Usage

Files

About

Topics

Resources

License

Stars

Watchers

Forks

Languages