Towards End-to-end ASR
Towards End-to-end ASR - an internal (?) presentation by Google https://drive.google.com/file/d/1Rpob1-C223L9UWTiLJ6_Dy12mTA3YyTn/view
This is such a huge corpus of work. Interesting conclusions:
- Google writes your voice (at least from Google assistant, unclear whether they abuse their "phone" app) and uses this data for their models. Surprise surprise!
- Obviously Google is pushing towards end-to-end ASR within one NN on a mobile device for a number of reasons:
(i) easier packaging
(iii) no requirement to run a large LM alongside the model
(iv) Google has a lot of data (end-to-end models suffer from lack of data mostly)
- 120MB total system size on mobile device. This means AM + LM, which in this case is one quantized RNN-T model (4x - float32 => int8)
- They also write that hybrid systems with LM fusion / rescoring perform better overall
- The "best" cited solutions are not end-to-end, though
- Finally understood why they were pushing their RNN-T models instead of 10x more frugal alternatives. Old and optimized layers, hacks to speed up inference, unlimited resources, better performance (on the same step). Also LSTMs are known to be able to replace LMs
- Google also knows about "Time Reduction Layer", but looks like when using it within and RNN it is a bit painful - a lot of fiddling in the model logic
- Looks like given unlimited resources, data and compute - the easiest solution is to train large LSTMs in an end-to-end fashion (I also noticed that LSTMs have higher quality on same step, but MUCH weaker speed and convergence overall in terms of time-to-accuracy), optimize it heavily, quantize and deploy
- Sharing AMs / LMs for different dialects kind of works (maybe in terms of time-to-accuracy?), but direct tuning is betterBut is full 100% end-to-end feasible on any scale below Google?
Probably not. Unless you are Facebook.
Having a fully end-2-end pipeline will have OOV (even with BPE / word-piece tokens) and other problems - like bias towards domains where you have audio. It will certainly NOT generalize towards unseen new words and pronunciations.
Meh.But can you have extremely small mobile models?
Yes and no. Our latest small AM is targeting 200MB before quantization and probably 50MB after. Current production model is around 90MB (after quantization). But can it serve instead of an LM?
Technically yes, but quality will suffer. Unlike Google we do not have unlimited data, compute and low level engineers. On the other hand fully neural post-processing / decoding w/o huge transformer-like models is more than feasible. So we will see =)