Some Additional Thoughts on DDP
DDP docs say that you cannot use multiple DDP processes on one GPU (otherwise you would have to use their RPC framework, which is a bit too much hassle and complication, at least for now for me personally!).
Turns out you can. But the speed up was negligible in my case:
- GPU utilization 70-80% 1 process per GPU => GPU utilization 90%-100%;
- Total epoch time decreased by 3-5%;
- Interestingly, I tried 2 DDP workers on 2 GPUs vs 4 DDP workers on 2 GPUs ans 3 DDP workers on 2 GPUs (1 on master, 2 on other GPU), and 3 workers were much slower, so probably it is the compute bottleneck, not the communication bottleneck (we will see with Ampere GPUs!);
- Following advice from Nvidia, I also tried MPS (which is supposed help several processes run smoothly on one GPU), but I just could not make it work with DDP, it failed with cryptic errors at first after cuda.empty.cache() and then just randomly. Sad times;
#deep_learning
DDP docs say that you cannot use multiple DDP processes on one GPU (otherwise you would have to use their RPC framework, which is a bit too much hassle and complication, at least for now for me personally!).
Turns out you can. But the speed up was negligible in my case:
- GPU utilization 70-80% 1 process per GPU => GPU utilization 90%-100%;
- Total epoch time decreased by 3-5%;
- Interestingly, I tried 2 DDP workers on 2 GPUs vs 4 DDP workers on 2 GPUs ans 3 DDP workers on 2 GPUs (1 on master, 2 on other GPU), and 3 workers were much slower, so probably it is the compute bottleneck, not the communication bottleneck (we will see with Ampere GPUs!);
- Following advice from Nvidia, I also tried MPS (which is supposed help several processes run smoothly on one GPU), but I just could not make it work with DDP, it failed with cryptic errors at first after cuda.empty.cache() and then just randomly. Sad times;
#deep_learning