Spark in me - Internet, data science, math, deep learning, philosophy

@snakers4 Нравится 0
Это ваш канал? Подтвердите владение для дополнительных возможностей

All this - lost like tears in rain.
Data science, ML, a bit of philosophy and math. No bs.
Our website
- http://spark-in.me
Our chat
- https://t.me/joinchat/Bv9tjkH9JHbxiV5hr91a0w
DS courses review
- http://goo.gl/5VGU5A
- https://goo.gl/YzVUKf
Гео и язык канала
Россия, Русский
Категория
Технологии


Гео канала
Россия
Язык канала
Русский
Категория
Технологии
Добавлен в индекс
09.05.2017 23:31
реклама
СысоевFM — канал о ресторанах.
Самый популярный канал о еде со скидками в ресторанах.
Ищешь платёжеспособную аудиторию?
Покупай рекламу на Wow Sale
Невменяемый Бухгалтер
Отличный путеводитель в мир грамотного бизнеса.
1 824
подписчиков
~1.1k
охват 1 публикации
~373
дневной охват
~3
постов / нед.
61.8%
ERR %
1.8
индекс цитирования
Репосты и упоминания канала
8 упоминаний канала
30 упоминаний публикаций
45 репостов
Нейронач
Just links
Нейронач
Physics Blues
Нейронач
Just links
Just links
Just links
Админим с Буквой
Main ML_KZ
Нейронач
Нейронач
RE:post
Нейронач
Just links
Нейронач
Нейронач
Нейронач
Админим с Буквой
Just links
Anscombe's Quartet
Админим с Буквой
Админим с Буквой
Food-stained hoodie
Anscombe's Quartet
Блог Шмакова
Anscombe's Quartet
Anscombe's Quartet
Machinelearning
Отраженный свет
Dato ML
Dato ML
DeepLearning ru
Dato ML
Dato ML
Dato ML
Каналы, которые цитирует @snakers4
Just links
Just links
Just links
Just links
Neural Networks Engineering
Just links
Just links
Neural Networks Engineering
Neural Networks Engineering
Just links
Just links
Just links
Админим с Буквой
Вастрик.Пынь
Bird Born
Loss function porn
Just links
Just links
Админим с Буквой
Bird Born
NVIDIA
Админим с Буквой
Just links
Just links
Just links
Just links
Админим с Буквой
Hacker News
Just links
Админим с Буквой
NVIDIA
Just links
Hacker News
Just links
Админим с Буквой
Just links
Hacker News
Roem.ru
2ch/Двач
Админим с Буквой
Админим с Буквой
Админим с Буквой
Админим с Буквой
Админим с Буквой
Data Science
Just links
Linuxgram 🐧
addmeto
Arseniy's channel
Arseniy's channel
Последние публикации
Удалённые
С упоминаниями
Репосты
Репост из: dilyara_tchk
Jetsons are... not cool

The case
For some of you who might think jetsons are cool... This is a pure rant, so be prepared or skip the post altogether.
Well, maybe, they seemed cool until you finally tried to use them for edge-computing. I don't refer to your pet project in you garage like tracking your cat or chickens or something, I refer to real applications, like dealing with no network connection whatsoever, where any possible support work results in huge expenses. It's not comparable to sending humans into space, but mistakes are more expensive than your average case nonetheless. I would describe dealing with jetsons as just pure frustration. That is definitely the case because expectations do not meet reality. You expect a good solid product well-suited for it's purpose. What you get is 'nope, not even close'.

Installation
From the very moment of installing OS and SDK it seems like this product is just raw. The only one that is simple enough is Nano , but we did not try it in production, so not much info to share. TX2 and Xavier are just pain in the ass though. 5 tries it took me to install everything, and then halfway through it turned out we needed a sixth one. I still wonder how you deal with updates. Not to mention that you need a dev account and host ubuntu machine (well, this one is tolerable) to install everything. After I went through registration, I didn't get the email for verification, and they blocked me saying check your email for unblocking instructions. They blocked me, because I did not verify my email and tried logging in too many times. Well, I got my email for verification in a couple of hours, but never got one with unblocking instructions. So I don't have a dev account) Luckily, my colleague had one.

Production behavior
So far so good, I never went back to dealing with these machines, but our engineers did) To save your time: they still encounter surprises. TX2 was rejected as Xavier has less issues, but still a pain in the ass too. Even when everything works on a test stand, nothing works on site. Common troubles are related to power supply and autostart. There are also troubles with finding good alternatives for common libraries and frameworks (or trying to fix and tweak existing ones) that do not have a version for jetsons, sometimes because of their architecture.

Alternatives?
Why did we try jetson? Because industrial pc with 1060 is $5k-$8k and some good 10 weeks to ship to Russia. In a limited time window we had, we decided to try jetson's dev kits, they seemed like a possible alternative. But industrial version turned out to cost $5k and same good 10 weeks to ship. Blegh.

Is it hopeless?
I hope they'll make jetsons a really good product, it just takes time. And for now jetsons are definitely not a good product. Having qualified engineers to deal with this box will ease some pain I guess (not for engineers though), but that's just ridiculous.
Читать полностью
2019 DS / ML digest 15

Link

Highlights of the week(s):

- Facebook's upcoming deep fake detection challenge;
- Lyft competion on Kaggle;
- Waymo open-sources its data;
- Cool ways to deal with imbalanced data and noisy data;

#digest
#deep_learning
Репост из: Just links
https://arxiv.org/abs/1901.05555
https://github.com/vandit15/Class-balanced-loss-pytorch
Support Open STT

Now you can support Open STT on our github page via opencollective!
https://github.com/snakers4/open_stt

Opencollective seemed to be the best platform supported by GitHub for now.

#dataset
Now they stack ... normalization!

Tough to choose between BN / LN / IN?
Now a stacked version with attention exists!
https://github.com/switchablenorms/Switchable-Normalization

Also, their 1D implementation does not work, but you can hack their 2D (actually BxCxHxW) layer to work with 1D (actually BxCxW) data =)

#deep_learning
ML without train / val split

Yeah, I am not crazy. But probably this applies only to NLP.
Sometimes you just need your pipeline to be flexible enough to work with any possible "in the wild" data.

A cool and weird trick - if you can make your dataset so large that your model just MUST generalize to work on it, then you do not need a validation set.

If you sample data randomly and your data generator is good enough, each new batch is just random and can serve as validation.

#deep_learning
Читать полностью
Poor man's ensembling techniques

So you want to improve your model's performance a bit.
Ensembling helps. But as is ... it's useful only on Kaggle competitions, where people stack over9000 networks trained on 100MB of data.

But for real life usage / production, there exist ensembling techniques, that do not require significant computation cost increase (!).
All of this is not mainstream yet, but it may work on you dataset!
Especially if your task is easy and the dataset is small.

- SWA (proven to work, usually used as a last stage when training a model);
- Lookahead optimizer (kind of new, not thoroughly tested);
- Multi-Sample Dropout (seems like a cheap ensemble, should work for classification);

Applicability will vary with your task.
Plain vanilla classification can use all of these, s2s networks probably only partially.

#data_science
#deep_learning
Читать полностью
2019 DS / ML digest 14

Link

Highlights of the week(s):

- FAIR embraces embedding bags for misspellings;
- New version of Adam - RAdam. But on the only real test author has concluded (Imagenet) - SGD is better;
- Yet another LSTM replacement - SRU. Similar to QRRN - it requires additional dependencies;

#digest
#deep_learning
Читать полностью
Sampler - visualization for any shell command

A cool mix between glances and prometheus
https://github.com/sqshq/sampler

#linux
My foray into the STT Dark Forest

My tongue-in-cheek article on ML in general, and how to make your STT model train 3-4x faster with 4-5x less weights with the same quality

https://spark-in.me/post/stt-dark-forest

#data_science
#deep_learning
#stt
Extreme NLP network miniaturization

Tried some plain RNNs on a custom in the wild NER task.
The dataset is huge - literally infinite, but manually generated to mimick in-the-wild data.

I use EmbeddingBag + 1m n-grams (an optimal cut-off). Yeah, on NER / classification it is a handy trick that makes your pipeline totally misprint / error / OOV agnostic. Also FAIR themselves just guessed this too. Very cool! Just add PyTorch and you are golden.

What is interesting:
- Model works with embedding sizes 300, 100, 50 and even 5! 5 is dangerously close to OHE, but doing OHE on 1m n-grams kind-of does not make sense;
- Model works with various hidden sizes
- Naturally all of the models run on CPU very fast, but the smallest model also is very light in terms of its weights;
- The only difference is - convergence time. It kind of scales as a log of model size, i.e. model with 5 takes 5-7x more time to converge compared to model with 50. I wonder what if I use embedding size of 1?;

As added bonus - you can just store such miniature model in git w/o lfs.
What is with training transformers on US$250k worth of compute credits you say?)

#nlp
#data_science
#deep_learning
Читать полностью
PyTorch 1.2 release

Link

Key features:
- Tensorboard logging in now out of beta;
- They continue improving JIT and ONNX;
- NN.Transformer is a layer now;
- Looks like SyncBn is also more or less stable;
- nn.Embedding: support float16 embeddings on CUDA;
- AdamW;
- Numpy compatibility;

#deep_learning
Читать полностью
Using public Dockerhub account for your private small scale deploy

Also a lifehack - you can just use Dockerhub for your private stuff, just separate the public part and the private part.
Push the public part (i.e. libraries and frameworks) to Dockerhub/

You private Dockerfile will be then something like:
FROM your_user/your_repo:latest

COPY your_app_folder your_app_folder
COPY app.py app.py

EXPOSE 8000

CMD ["python3", "app.py"]
Читать полностью
Managing your DS / ML environment neatly and in style

If you have a sophisticated environment that you need to do DS / ML / DL, then using a set of Docker images may be a good idea.
You can also tap into a vast community of amazing and well-maintained Dockerhub repositories (e.g. nvidia, pytorch).

But what you have to do this for several people? And use it with a proper IDE via ssh?
A well-known features of Docker include copy on write and user "forwarding". If you approach naively, each user will store his own images, which take quite some space.
And also you have to make your ssh daemon works inside of a container as a second service.

So I solved these "challenges" and created 2 public layers so far:
- Basic DS / ML layer - FROM aveysov/ml_images:layer-0 - from dockerfile;
- DS / ML libraries - FROM aveysov/ml_images:layer-0- from dockerfile;

Your final dockerfile may look something like this just pulling from any of those layers.
Note that when building this, you will need to pass your UID as a variable, e.g.:

docker build --build-arg NB_UID=1000 -t av_final_layer -f Layer_final.dockerfile .

When launched, this launched a notebook with extensions. You can just exec into the machine itself to run scripts or use an ssh daemon inside (do not forget to add your ssh key and service ssh start).


#deep_learning
#data_science
Читать полностью
2019 DS / ML digest 13

Link

Highlights of the week(s):
- x10 faster STT network?
- Train on 1/2 of test resolution - new down-to-earth SOTA approach to image classification? Old news!;
- New workhorse light-weight network - MixNet?

#digest
#deep_learning
Читать полностью
An ideal remote IDE?

Joking?
No, looks like VScode recently got its remote development extensions (it was only in insiders build a couple of months ago) working just right.

I tried remote-ssh extension and it looks quite polished. No syncing your large data folders and loading all python dependencies locally for hours.

The problem? It took me an hour just to open ssh session under Windows properly (permissions and Linux folder path substitution is hell on Windows). When I opened it - it worked like a charm.

So for now (it is personal) - best tools are in my opinion:
- Notebooks - for exploration and testing;
- VScode for codebase;
- Atom - for local scripts;

#data_science
Читать полностью
If you know how to add your python kernel to Theia - please ping me)
Full IDE in a browser?

Almost)

You all know all the pros and cons of:
- IDEs (PyCharm);
- Advanced text editors (Atom, Sublime Text);
- Interactive environments (notebook / lab, Atom + Hydrogen);

I personally dislike local IDEs - not because connecting to a remote / remote kernel / remote interpreter is a bit of a chore. Setting up is easy, but always thinking about what is synced and what is not - is just pain. Also when your daily driver machine is on Windows, using Linux subsystem all the time with Windows paths is just pain. (Also I diskile bulky interfaces, but this is just a habit and it depends).

But what if I told you there is a third option? =)
If you work as a team on a remote machine / set of machines?

TLDR - you can run a modern web "IDE" (it is something between Atom and real IDE - less bulky, but less functions) in a browser.
Now you can just run it with one command.

Pros:
- It is open source (though shipped as a part of some enterprise packages like Eclipse Che);
- Pre-built images available;
- It is extendible - new modules get released - you can build yourself or just find a build;
- It has extensive linting, python language server (just a standard library though);
- It has full text search ... kind of;
- Follow definition in your code;
- Docstrings and auto-complete work for your modules and standard library (not for you packages);

Looks cool af!
If they ship a build with a remote python kernel, then it will be a perfect option for teams!

I hope it will not follow a path taken by another crowd favourite similar web editor (it was purhcased by Amazon).

Links
- Website;
- Pre-built apps for python;
- Language server they are using;

#data_science
Читать полностью