85 lines
2.3 KiB
Markdown
85 lines
2.3 KiB
Markdown
# Parallelization for standalone mode
|
|
|
|
To facilitate developers to quickly verify their algorithms, we designed and implemented `StandaloneMultiGPURunner` with torch distributed data parallel (DDP). The new runner can better utilize the computing resources of multiple GPUs and accelerate training in standalone mode of FederatedScope.
|
|
|
|
## When to use
|
|
Use `StandaloneMultiGPURunner` when you have **multiple GPUs (>=2)** in your machine and need quick verification with **standalone mode**.
|
|
|
|
|
|
## Configuration
|
|
|
|
Add `federate.process_num` item in the configuration file to parallelize the training.
|
|
|
|
> Note: `federate.process_num` only takes effect when `use_gpu=True`, `backend='torch'`, `federate.mode='standalone'` and `federate.share_local_model=False`, and the value is required to be not greater than the number of GPUs.
|
|
|
|
```yaml
|
|
use_gpu: True
|
|
backend: 'torch'
|
|
device: 0
|
|
early_stop:
|
|
patience: 5
|
|
seed: 12345
|
|
federate:
|
|
mode: standalone
|
|
client_num: 100
|
|
total_round_num: 20
|
|
sample_client_rate: 0.2
|
|
share_local_model: False
|
|
process_num: 4 # run 4 processes simultaneously
|
|
...
|
|
```
|
|
|
|
## Use cases
|
|
|
|
Here we give an example to demonstrate the efficiency of `StandaloneMultiGPURunner` compared to `StandaloneRunner`. The configuration file and experiment result are listed below.
|
|
The experiment result shows that the totoal running time of `StandaloneMultiGPURunner` is only 1/3 of `StandaloneRunner` in the case of 8 GPUs.
|
|
|
|
```yaml
|
|
use_gpu: True
|
|
device: 0
|
|
early_stop:
|
|
patience: 5
|
|
seed: 12345
|
|
federate:
|
|
mode: standalone
|
|
client_num: 100
|
|
total_round_num: 10
|
|
sample_client_rate: 0.4
|
|
share_local_model: False
|
|
# use StandaloneMultiGPURunner with 8 GPUs
|
|
process_num: 8
|
|
# use StandaloneRunner
|
|
# process_num: 1
|
|
|
|
data:
|
|
root: data/
|
|
type: femnist
|
|
splits: [0.6,0.2,0.2]
|
|
batch_size: 10
|
|
subsample: 0.05
|
|
num_workers: 0
|
|
transform: [['ToTensor'], ['Normalize', {'mean': [0.1307], 'std': [0.3081]}]]
|
|
model:
|
|
type: convnet2
|
|
hidden: 2048
|
|
out_channels: 62
|
|
train:
|
|
local_update_steps: 1
|
|
batch_or_epoch: epoch
|
|
optimizer:
|
|
lr: 0.01
|
|
weight_decay: 0.0
|
|
grad:
|
|
grad_clip: 5.0
|
|
criterion:
|
|
type: CrossEntropyLoss
|
|
trainer:
|
|
type: cvtrainer
|
|
eval:
|
|
freq: 10
|
|
metrics: ['acc', 'correct']
|
|
```
|
|
|
|
| | StandaloneMultiGPURunner | StandaloneRunner |
|
|
| :---: | :---: | :---: |
|
|
| Total running time (minute) | 0.2406 | 0.7292 | |