I know this issue is a bit late in this week. But hey, better late than never!👻 I am glad how it turned out and as the idea goes -
If it's worth doing, it's worth doing right!
In this issue, I talk about how Roblox managed to speed up BERT inference by 30x, why and how is automatic mixed precision useful and how ONNX Runtime is going to speed up inference and put Transformer models in production.
Hope you enjoy it!✌🏻
In recent times NLP models are getting large and in turn inference times are increasing beyond limits for production. It becomes necessary to hunt for viable options to deploy these power hungry models. Today we are going to look at some resources/techniques which will come in handy for exactly this purpose.
Scaling BERT to serve 1+ billion daily requests on CPU! (Roblox)👾
Roblox managed to use BERT in inference stage (using multi-core CPU) - in a low-latency and high-throughput environment - by implementing the following approaches -
BERT Baseline (330 ms/inference): A HuggingFace implementation of BERT with zero-padded inputs to a fixed length of 128 tokens.
DistilBert (171 ms/inference): A distilled and thus smaller model enabled faster inference and training with minimal drop in F1 score.
Dynamic shape input (69 ms/inference): Since their use case pointed to the fact that an input batch size of 1 is ideal, they stopped zero-padding altogether. This in turn resulted in a smaller input size and thus improved throughput and latency.
Quantization (10 ms/inference): A specific quantization technique called Dynamic Quantization was leveraged to achieve smaller weights. The crux of this technique is representing 32-bit floating point weights as 8-bit integers.
This technique involves quantizing weights AFTER training, as opposed to quantizing during training (which is called Quantization-Aware Training).
Caching: In their case, they observed that 40% cache hit rate in production when they cached 1 million entries in process memory. This doubled their throughput.
This allowed them to achieve 25,000 inference/second (over 1 billion inferences/day) at a latency of under 20ms!
Automatic Mixed Precision Training🧐
This concept is similar to the Quantization technique above except that this applies during training. This feature is going to be added in the upcoming PyTorch 1.6 release (amp module - automatic mixed precision) according to the author of this post giving deep insight to it's usefulness.
Precision: For the unfamiliar, mixed precision training is the technique of using lower-precision types (e.g. in PyTorch, using fp16 instead of the default fp32). Modulo hardware support, this means significantly faster training (since there's fewer bits to manipulate when you're doing math) [Reddit]
Automatic: PyTorch automatically determines which model operations are safe in half-precision and which are not, using a set of casting rules that the PyTorch dev team has worked out to be reliable. That means you don't have to go through the process of figuring out what you can halve and what you can't - the framework does it for you! [Reddit]
Mixed: Some vector operations are reliable in fp32
while others in fp16
. This mixed usage of dtypes is why the technique is called mixed precision.
One caveat is that this training works only on GPUs with Tensor capabilities like V100 (5120 CUDA cores, 600 tensor cores), T4 (2560 CUDA cores, 320 tensor cores) which are available in cloud etc. That being said, this achieves 2 major things - about 50-60% improvement in training time and occupies less GPU memory - which enables faster training and research development!
ONNX Runtime (Microsoft)🤯
ONNX Runtime, an open source library provided by Microsoft, announced a parternship with HuggingFace in May to accelerate training of Transformers and thus improving inference and training time!
In general, ONNX helps accelerating PyTorch and Tensorflow models in production, on both GPU and CPU. It is being used by Microsoft itself in some of their most popular products -
What really excites me -
ONNX Runtime is used in products and services handling over 20 billion inferences each day. ONNX Runtime has optimizations for transformer models with up to 17x speedup.
This makes it ideal given the fact that most of these power hungry models - even though appealing - don't see the face of production.
This parternship means that HuggingFace is actively working towards an easy integration of ONNX Runtime in our development cycles and thus giving us all the good stuff spoon-fed!
Looking at Transformer's Jupyter notebook showcasing ONNX integration, it just seems straightforward and easy to use. I know what I'm doing this weekend!😉
If you would like a copy of this issue in your mailbox every week, consider subscribing 🤗You can unsubscribe as easily if you don’t dig it!