Skip to content

Commit

Permalink
Merge pull request #10 from aishjayashankar/aishwarya/add-help-for-fi…
Browse files Browse the repository at this point in the history
…rst-run

Add helper notes for first experiment run
  • Loading branch information
tremblerz authored Jul 12, 2024
2 parents 055264f + 3c5713a commit 8862ce0
Show file tree
Hide file tree
Showing 3 changed files with 21 additions and 5 deletions.
Binary file added resources/images/TensorboardSample.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
23 changes: 18 additions & 5 deletions src/README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,28 @@
### Running the code
Let's say you want to run the model training of 3 nodes on a machine. That means there will be 4 nodes in total because there is 1 more node in addition to the clients --- server.
The whole point of this project is to eventually transition to a distributed system where each node can be a separate machine and a server is simply another node. But for now, this is how things are done.
You can do execute the 3 node simulation by running the following command:

You can execute the 3 node simulation by running the following command:<br>
`mpirun -np 4 -host localhost:11 python main.py`

Depending on the environment you're running the experiment on, you may have to update the config files based on the number of GPUs available. Refer to the [Config file](#config-file) section for more information.


### Config file
The config file is the most important file when running the code. Always be sure of what config you are using. Our `main.py` file uses `non_iid_clients.py` by default and that file has multiple configurations with one of it assigned as a default. We have intentionally kept configuration files as a python file which is typically a big red flag in software engineering. But we did this because it enables plenty of quick automations and flexibility. Be very careful with the config file because it is easy to overlook some of the configurations such as device ids, number of clients etc.
The config file is the most important file when running the code. Always be sure of what config you are using. Our `main.py` file uses a combination of `algo_config.py` and `sys_config.py` by default. These files can have multiple configurations defined within them and the provision to select one as default.

We have intentionally kept configuration files as a python file which is typically a big red flag in software engineering. But we did this because it enables plenty of quick automations and flexibility. Be very careful with the config file because it is easy to overlook some of the configurations such as device ids, number of clients etc.

### Reproducability
### Reproducibility
One of the awesome things about this project is that whenever you run an experiment, all the source code, logs, and model weights are saved in a separate folder. This is done to ensure that you can reproduce the results by looking at the code that was responsible for the results. The naming of the folder is based on the keys inside the config file. That also means you can not run the same experiment again without renaming/deleting the previous experimental run. The code automatically asks you to press `r` to remove and create a new folder. Be careful you are not overwriting someone else's results.

### Logging
We log the results in the console and also in a log file that captures the same information. We also log a few metrics for the tensorboard. The tensorboard logs can be viewed by running tensorboard as follows:
`tensorboard --logdir=expt_dump/ --host 0.0.0.0`. Assuming `expt_dump` is the folder where the experiment logs are stored.
We log the results in the console and also in a log file that captures the same information. We also log a few metrics for the tensorboard.

The tensorboard logs can be viewed by running tensorboard as follows:<br>
`tensorboard --logdir=expt_dump/ --host 0.0.0.0`<br>
Assuming `expt_dump` is the folder where the experiment logs are stored.

After a successful run with 50 epochs, the Tensorboard experiment log should look something like below:

<img src="../resources/images/TensorboardSample.png" width=50% height=50%>
3 changes: 3 additions & 0 deletions src/configs/sys_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@
"dpath": "./datasets/imgs/cifar10/",
"seed": 2,
# node_0 is a server currently
# The device_ids dictionary depicts the GPUs on which the nodes reside.
# For a single-GPU environment, the config will look as follows (as it follows a 0-based indexing):
# "device_ids": {"node_0": [0], "node_1": [0],"node_2": [0], "node_3": [0]}
"device_ids": {"node_0": [5], "node_1": [5],"node_2": [5], "node_3": [2]},
"samples_per_user": 500, #TODO: To model scenarios where different users have different number of samples
# we need to make this a dictionary with user_id as key and number of samples as value
Expand Down

0 comments on commit 8862ce0

Please sign in to comment.