Spark in me - Internet, data science, math, deep learning, philosophy

@snakers4 Нравится 0

All this - lost like tears in rain.
Data science, ML, a bit of philosophy and math. No bs.
Our website
- http://spark-in.me
Our chat
- https://t.me/joinchat/Bv9tjkH9JHbxiV5hr91a0w
DS courses review
- http://goo.gl/5VGU5A
- https://goo.gl/YzVUKf
Гео и язык канала
Россия, Русский
Категория
Технологии


Написать автору
Гео канала
Россия
Язык канала
Русский
Категория
Технологии
Добавлен в индекс
09.05.2017 23:31
реклама
Погода в твоём городе
Пришлю погоду тебе лично или в твою группу/канал
Биржа рекламы в Telegram №1
+4268 заказов в системе за месяц +428 новых каналов
CPA-сеть в сфере финансов и гемблинга
Лучшие условия ⚙️ Полезные инструменты
2 441
подписчиков
~954
охват 1 публикации
~574
дневной охват
~5
постов / нед.
39.1%
ERR %
5.13
индекс цитирования
Репосты и упоминания канала
15 упоминаний канала
31 упоминаний публикаций
61 репостов
Transhumanism_in_our_hearts
TranshumBlock
TranshumBlock
TranshumBlock
Data Science by ODS.ai
Just links
Neural Networks Engineering
Блог Шмакова
Блог Шмакова
Just links
Just links
Data Science by ODS.ai
Data Science by ODS.ai
Just links
Data Science by ODS.ai
Data Science by ODS.ai
Data Science by ODS.ai
Нейронач
Just links
Нейронач
Physics Blues
Нейронач
Just links
Just links
Just links
Админим с Буквой
Нейронач
Нейронач
RE:post
Нейронач
Just links
Нейронач
Нейронач
Нейронач
Админим с Буквой
Just links
Anscombe's Quartet
Админим с Буквой
Админим с Буквой
Food-stained hoodie
Каналы, которые цитирует @snakers4
Silero API news
Silero API news
Profunctor Jobs
NVIDIA Inception
Data Science by ODS.ai
Profunctor Jobs
Data Science by ODS.ai
DL in NLP
Data Kitchen
DL in NLP
DL in NLP
Just links
Hacker News
Data Science by ODS.ai
Silero API news
Silero API news
Silero API news
Silero API news
Silero API news
Silero API news
Silero API news
Silero API news
Silero API news
Матчасть
addmeto
OpenAI
Just links
Just links
DL in NLP
Just links
Just links
Just links
Just links
Neural Networks Engineering
Just links
Just links
Neural Networks Engineering
Neural Networks Engineering
Just links
Just links
Just links
Админим с Буквой
Вастрик.Пынь
Bird Born
Loss function porn
Последние публикации
Удалённые
С упоминаниями
Репосты
Translate into English?
Опрос
  • Yes
  • No
  • Google Translate works fine
110 голосов
Trying Out New Ampere GPUs and MIG (RU)

Играемся с Новыми GPU на базе Ampere от Nvidia и пробуем MIG

https://habr.com/ru/post/530986/

Please like / share / repost!

#hardware
#deep_learning
First Experience With A100 GPUs

(0)
Under 100% load they are indeed 15-20 degrees cooler, i.e. 60 - 70C (similar to 3090).

(1)
./gpu_burn 120

- 1080 Ti 8000 - 8,500
- Titan X (Maxwell) ~4,300
- 3090 (Ampere) ~16,500
- A100 (wo MIG) ~16,700 Gflop/s

./gpu-burn -tc 120

- 3090 (Ampere) ~38,500
- A100 (wo MIG) ~81,500 Gflop/s

(2)
Using MIG is kind of straight-forward, but obviously it does not work properly with gpu-burn out of the box.

Obviously, the most interesting thing is to test MIG 2,3,7 setups against 2x 3090 / 1080 Ti / Titan X.

#deep_learning
Читать полностью
2020 DS / ML Digest 13

Highlights
:

- Silero models now has an experimental Ukrainian model
- CV inference 101
- High-Resolution 3D Human Digitization
- Background Features in Google Meet
- How to Build an Open-Domain Question Answering System?
- A case for … Keeping encryption elitist
- Objectron dataset
- See the above posts about 3090 ... and hopefully new posts comparing Titan X / 1080 Ti / 3090 / A100 =)

Please like / share / repost!

https://spark-in.me/post/2020_ds_ml_digest_13

#digest
Читать полностью
Some More Observations About 3090

- torch.cuda.empty_cache() does not seem to do anything for networks with variable depth / sequence length / girth

- DDP + AMP ... seems 3x slower instead of 2x faster (lol) for some networks, we are looking for the cause

- For some networks, 2x speed bump using AMP out of the box

- Now DDP prevents me from using 2 processes on 1 GPU with

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729096996/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

- Looks like they are much more efficient in parallelizing and keeping high utilization (80-100%), same networks train ~2x-3x faster compared to Titan X (Maxwell) and 1080 Ti without any tweaks to the code

- Same networks use more RAM with 3090 compared to 1080 Ti (?)

- I kind of was afraid that these cards would be under-utilized (50%), but they are just faster. Magic


#deep_learning
Читать полностью
Anyone used this - https://t.me/snakers4/2351 - is this worth an update (also plz comment)?
Опрос
  • Yes
  • Just dockerfiles
  • No
  • What is this?
56 голосов
Update

After migrating to CUDA 11 and CUDNN 8, now:

./gpu_burn 120

- 1080 Ti 8000 - 8,500
- Titan X (Maxwell) ~4,300
- 3090 (Ampere) ~16,500

./gpu-burn -tc 120

- 3090 (Ampere) ~38,500

Magic
First Experience With 3090 Gpus

(0)
Under 100% load they are indeed 15-20 degrees cooler.

(1)
Lol, gpu-burn shows strange results using default settings - 2x less Gflops compared to 1080 Ti

./gpu_burn 600:

- 1080 Ti 8000 - 8500
- Titan X (Maxwell) ~4300
- 3090 (Ampere) ~3000

./gpu-burn -tc 600
- 3090 (Ampere) ~3000

Idk, maybe it's me, maybe it's gpu-test, need to test on real tasks!

PS
I had an old image, maybe bumping CUDA / CUDNN will help.

#deep_learning
Читать полностью
#интересно
Оказывается, iPavlova больше нет...
https://www.facebook.com/olga.kairova/posts/10157719960593034
Some Additional Thoughts on DDP

DDP docs say that you cannot use multiple DDP processes on one GPU (otherwise you would have to use their RPC framework, which is a bit too much hassle and complication, at least for now for me personally!).

Turns out you can. But the speed up was negligible in my case:

- GPU utilization 70-80% 1 process per GPU => GPU utilization 90%-100%;
- Total epoch time decreased by 3-5%;
- Interestingly, I tried 2 DDP workers on 2 GPUs vs 4 DDP workers on 2 GPUs ans 3 DDP workers on 2 GPUs (1 on master, 2 on other GPU), and 3 workers were much slower, so probably it is the compute bottleneck, not the communication bottleneck (we will see with Ampere GPUs!);
- Following advice from Nvidia, I also tried MPS (which is supposed help several processes run smoothly on one GPU), but I just could not make it work with DDP, it failed with cryptic errors at first after cuda.empty.cache() and then just randomly. Sad times;

#deep_learning
Attached file
Читать полностью
Репост из: Silero API news
2020-11-03 [Experimental] Ukrainian Model V1 Released

- An experimental model
- Trained from a small community contributed corpus
- New Full model size reduced to 85 MB
- New Quantized model is only 25 MB
- No TF or ONNX models
- Will be re-released a fine-tuned model from a larger - Russian corpus upon V3 release

https://github.com/snakers4/silero-models
Репост из: Silero API news
Silero Models EN V2 Released

Almost forgot to announce it!

- New EN V2 model - https://github.com/snakers4/silero-models/issues/20#issuecomment-720932378
- Quality benchmarks - https://github.com/snakers4/silero-models/wiki/Quality-Benchmarks#en-v2

A minor release, i.e. other models not affected.

English model was made much more robust to certain dialects. The model should generalize much better in general.
Читать полностью
Trying PyTorch DDP Again

Just a quick note. DDP expects to have a gradient / backward pass on each worker (or not to have it on all workers). Otherwise it hangs.

So do not forget to use grad scaler with native PyTorch AMP.

In my particular case, DDP worked well with AMP, but when I added grad scaler it stopped exploding / de-syncing and started converging even faster. If only I had GPUs with FP16 support =)

I guess nice work, Nvidia?

#deep_learning
Читать полностью
Torch Dataloader With Workers Leaking RAM

Everyone has faced this issue for HUGE datasets. Is is just because of python itself. If you faced it - you know what I am talking about.

I do not claim this to be a definitive solution, but it worked for me.

import time
import torch
import random
import string
from multiprocessing import Manager
from torch.utils.data import Dataset, DataLoader


def id_gen(size=6,
chars=string.ascii_uppercase):
return ''.join(random.choice(chars)
for _ in range(size))


class DataIter(Dataset):
def __init__(self):
m = Manager()
self.data = m.dict({i: {'key': random.random(),
'path': id_gen(size=10)}
for i in range(1000000)})

def __len__(self):
return len(self.data)

def __getitem__(self, idx):
data = self.data[idx]
return torch.tensor(data['cer']), data['path']


train_data = DataIter()

train_loader = DataLoader(train_data,
batch_size=60,
shuffle=False,
drop_last=False,
pin_memory=False,
num_workers=10)

tic = time.time()

for i, item in enumerate(train_loader):
if (i + 1) % 1000 == 0:
toc = time.time()
print(f"Time for 1000 batches in {toc - tic} s")
tic = time.time()

Be careful with manager dict though. Though it behaves like a dict, if you just try to iterate over its keys, it will be slow, because it has some overhead for inter-process communication.

If you just need the whole dict, it has some methods to access the whole dict in one big object, which is fast.

#pytorch
#deep_learning
Читать полностью
Trying PyTorch DDP

DDP = DistributedDataParallel
DP = DataParallel

I am a bit late to the party (PyTorch now even has its own "redis" key-value DB analog, its own RPC framework and numerous bells and whistles ... likely targeted at enterprise with over 9000 GPUs) but let me write my first impression here.

I usually always was just able to optimize my models and code not to require 4+ GPUs (DDP becomes essential after 4-5 GPUs, for 2-3 it does not really matter and DP just works, for 4 it is arguable):

- Docs are detailed, simple and clean

- Examples in the docs ... are just too plain, but there are guides now, which are also a bit simplistic

- The best way to start is to find some high quality boilerplate. There is lots of shitty boilerplate written in 2018 - PyTorch has evolved and polished its interfaces, so just look out for fresh boilerplate (see last update and cross-reference API invocations)

- Looks like DDP is not the most popular feature, but I did not really face the issues everyone (hangs and freezes, failure to kill the processes gracefully) claimed to face

Turning Your DP Script into a DDP

- Your code has to be properly structured and refactored - then migrating to DDP becomes a weekend project tops

- You need to understand the concepts of rank, world size, communication backend, gradient synchronization

- They finally included it in the docs - use NCCL backend for distributed GPU, Gloo backend for distributed CPU training

- You need to pass is_leader param to your logging functions to suppress some logging and checkpoints for non-master nodes (rank > 0). Each process has an almost exactly the same model copy anyway

- Do not forget to use barrier() to avoid hangs and for more transparent syncing

- You need to rewrite your main function to accept rank and args

- You need to spawn several processes using the provided utils and setup the process communication utils, i.e. something like:

import torch
import torch.distributed as dist

def setup_distributed(rank, args):
dist.init_process_group(backend=args.ddp.dist_backend,
rank=rank,
init_method=args.ddp.dist_url,
world_size=args.ddp.world_size)


def spawn_main(main, args):
if args.ddp.enabled:
torch.multiprocessing.spawn(
main, args=(args,), nprocs=args.ddp.world_size, join=True
)
else:
main(0, args)

- I am still not exactly sure why, but best boilerplate does .to(device, non_blocking=True) instead of to(device)

Is it faster?

In my case technically yes (but it has nothing to do with reasons why people use DDP usually). But in general case, it just solves bottleneck issues that arise out of having 6-8+ GPUs.

So you should optimize, refactor and profile your code first and only then, if you see some unsolvable issues or you need over9000 GPUs - then you should switch to DDP.

Is It Worth it?

100% for 6-8 GPUs.
It depends for 2-5 GPUs.
If your code is properly written, then there is little difference for 2-4 GPUs.

Major Design Drawbacks

DDP implies 1 GPU (at least) per process.
You can have 1+ GPUs per process.
You cannot share 1 GPU between 2 processes.
To do so, you would need an Ampere GPU with multi-instance GPU, but it is still not clear whether 3090 or Quadro GPUs will have it.

(I hope team Red will catch up here as well soon!)

Going Deeper

For now I opted for just splicing my train datasets into N parts as easy as dataset[rank :: world_size], but you can use the provided `key-value`stores for some advanced syncing, but in this case you would really have to care about there seed for random number generators (also double the memory footprint).

Also trying their RPC framework would be nice, but too much work for me.

#deep_learning
#pytorch
Spark in me - Internet, data science, math, deep learning, philosophy
PyTorch NLP best practices Very simple ideas, actually. (1) Multi GPU parallelization and FP16 training Do not bother reinventing the wheel. Just use nvidia's apex, DistributedDataParallel, DataParallel. Best examples [here](https://github.com/huggingface/pytorch-pretrained-BERT). (2) Put as much as possible INSIDE of the model Implement the as much as possible of your logic inside of nn.module. Why? So that you can seamleassly you all the abstractions from (1) with ease. Also models are more abstract and reusable in general. (3) Why have a separate train/val loop? PyTorch 0.4 introduced context handlers. You can simplify your train / val / test loops, and merge them into one simple function. context = torch.no_grad() if loop_type=='Val' else torch.enable_grad() if loop_type=='Train': model.train() elif loop_type=='Val': model.eval() with context: for i, (some_tensor) in enumerate(tqdm(train_loader)): # do your stuff here pass (4) EmbeddingBag Use EmbeddingBag layer for…
Читать полностью
https://youtu.be/CyfBEULRprM
Radeon 6000, Ryzen 5000 и другие новости октября | InfoCAST #037
В этом выпуске: новинки от AMD. новые видеокарты и процессоры, изображение процессоров Intel на LGA 1700, VIA передала лицензию на x86 процессоры, Intel избавляется от производства памяти, новости по Windows 10X, Windows для ARM и поддержка 64-х битных приложений, ожидаемые функции в Windows 10. 0:00 Вступление 0:17 Ryzen 5000 серии 1:17 Про судьбу 4000 серии Ryzen 2:37 Что теперь с Intel? 3:54 Лейкфилд и Windows 10X 4:32 Развитие ситуации с ARM архитектурой 4:55 Поддержка 64-х битных процессоров на Windows 10 ARM 5:10 Новые продукты Apple на ARM процессорах 5:40 Новые функции Windows 10 в следующих обновлениях 6:25 Про LGA 1700 7:24 intel избавляется от работ по SSD 7:58 VIA передала свою лицензию на производство x86 совм. проц. 8:48 Про radeon 6000 серии 9:13 Описание трёх новых карт 9:51 Про подсистему памяти новых карт 10:11 Про задержки отрисовки и вывода 10:21 Про новые возможности карт 10:56 Потребление и энергоэффективность карт 11:10 Моё недоверие к официальным цифрам от AMD 14:43 Итого по картам 15:54…