Trying PyTorch DDP
DDP = DistributedDataParallel
DP = DataParallel
I am a bit late to the party (PyTorch now even has its own "redis" key-value DB analog, its own RPC framework and numerous bells and whistles ... likely targeted at enterprise with over 9000 GPUs) but let me write my first impression here.
I usually always was just able to optimize my models and code not to require 4+ GPUs (DDP becomes essential after 4-5 GPUs
, for 2-3 it does not really matter and DP just works
, for 4 it is arguable):
- Docs are detailed, simple and clean
- Examples in the docs ... are just too plain, but there are guides now, which are also a bit simplistic
- The best way to start is to find some high quality boilerplate. There is lots of shitty boilerplate written in 2018 - PyTorch has evolved and polished its interfaces, so just look out for fresh boilerplate (see last update and cross-reference API invocations)
- Looks like DDP is not the most popular feature, but I did not really face the issues everyone (hangs and freezes, failure to kill the processes gracefully) claimed to faceTurning Your DP Script into a DDP
- Your code has to be properly structured
and refactored - then migrating to DDP becomes a weekend project tops
- You need to understand the concepts of rank, world size, communication backend, gradient synchronization
- They finally included it in the docs
- use NCCL backend for distributed GPU, Gloo backend for distributed CPU training
- You need to pass is_leader param to your logging functions to suppress some logging and checkpoints for non-master nodes (rank > 0). Each process has an almost exactly the same model copy anyway
- Do not forget to use barrier() to avoid hangs and for more transparent syncing
- You need to rewrite your main function to accept rank and args
- You need to spawn several processes using the provided utils and setup the process communication utils, i.e. something like:
import torch.distributed as dist
def setup_distributed(rank, args):
def spawn_main(main, args):
main, args=(args,), nprocs=args.ddp.world_size, join=True
- I am still not exactly sure why, but best boilerplate does .to(device, non_blocking=True) instead of to(device)Is it faster?
In my case technically yes (but it has nothing to do with reasons why people use DDP usually). But in general case, it just solves bottleneck issues that arise out of having 6-8+ GPUs.
So you should optimize, refactor and profile your code first and only then, if you see some unsolvable issues or you need over9000 GPUs - then you should switch to DDP.Is It Worth it?
100% for 6-8 GPUs.
It depends for 2-5 GPUs.
If your code is properly written, then there is little difference for 2-4 GPUs.Major Design Drawbacks
DDP implies 1 GPU (at least) per process.
You can have 1+ GPUs per process.
You cannot share 1 GPU between 2 processes.
To do so, you would need an Ampere GPU with multi-instance GPU, but it is still not clear whether 3090 or Quadro GPUs will have it.
(I hope team Red will catch up here as well soon!)Going Deeper
For now I opted for just splicing my train datasets into N parts as easy as dataset[rank :: world_size], but you can use the provided `key-value`stores for some advanced syncing, but in this case you would really have to care about there seed for random number generators (also double the memory footprint).
Also trying their RPC framework would be nice, but too much work for me.