FS-TFP/federatedscope/core/parallel/README.md

# Parallelization for standalone mode

To facilitate developers to quickly verify their algorithms, we designed and implemented `StandaloneMultiGPURunner` with torch distributed data parallel (DDP). The new runner can better utilize the computing resources of multiple GPUs and accelerate training in standalone mode of FederatedScope.

## When to use
Use `StandaloneMultiGPURunner` when you have **multiple GPUs (>=2)** in your machine and need quick verification with **standalone mode**.


## Configuration

Add `federate.process_num` item in the configuration file to parallelize the training.

> Note: `federate.process_num` only takes effect when `use_gpu=True`, `backend='torch'`, `federate.mode='standalone'` and `federate.share_local_model=False`, and the value is required to be not greater than the number of GPUs.

```yaml
use_gpu: True
backend: 'torch'
device: 0
early_stop:
  patience: 5
seed: 12345
federate:
  mode: standalone
  client_num: 100
  total_round_num: 20
  sample_client_rate: 0.2
  share_local_model: False
  process_num: 4 # run 4 processes simultaneously
...
```

## Use cases

Here we give an example to demonstrate the efficiency of `StandaloneMultiGPURunner` compared to `StandaloneRunner`. The configuration file and experiment result are listed below.
The experiment result shows that the totoal running time of `StandaloneMultiGPURunner` is only 1/3 of `StandaloneRunner` in the case of 8 GPUs.

```yaml
use_gpu: True
device: 0
early_stop:
  patience: 5
seed: 12345
federate:
  mode: standalone
  client_num: 100
  total_round_num: 10
  sample_client_rate: 0.4
  share_local_model: False
  # use StandaloneMultiGPURunner with 8 GPUs
  process_num: 8
  # use StandaloneRunner
  # process_num: 1

data:
  root: data/
  type: femnist
  splits: [0.6,0.2,0.2]
  batch_size: 10
  subsample: 0.05
  num_workers: 0
  transform: [['ToTensor'], ['Normalize', {'mean': [0.1307], 'std': [0.3081]}]]
model:
  type: convnet2
  hidden: 2048
  out_channels: 62
train:
  local_update_steps: 1
  batch_or_epoch: epoch
  optimizer:
    lr: 0.01
    weight_decay: 0.0
grad:
  grad_clip: 5.0
criterion:
  type: CrossEntropyLoss
trainer:
  type: cvtrainer
eval:
  freq: 10
  metrics: ['acc', 'correct']
```

|  | StandaloneMultiGPURunner | StandaloneRunner |
| :---: | :---: | :---: |
| Total running time (minute) | 0.2406 | 0.7292 |