Microsoft AI Releases “DeepSpeed ​​Compression”: A Composable Python-Based Library for Extreme Compression and Zero-Cost Quantization to Reduce Deep Learning Model Size and Accelerate Inference Speed

Deep learning and AI research is being revolutionized by large-scale models, which has led to significant advances in many areas, including multilingual translation, creative text generation, and linguistic interpretation. Nevertheless, the large size of the models leads to latency and cost limitations that make it difficult to install applications on them, despite their impressive capabilities. Microsoft AI’s DeepSpeed ​​team investigated system optimization and model compression advancements to address these deployment issues. The DeepSpeed ​​inference system was previously made available by researchers under the Scale initiative. This system uses a variety of optimizations to increase model inference speed, such as highly optimized CUDA kernels and inference-friendly parallelism. These enhancements aim to increase the efficiency of the inference system while maintaining model accuracy and other factors such as model size and computational load. To sum up, the amount of work is unchanged, but the speed and processing capacity have increased. The newly developed compression algorithms show great promise for reducing model size and inference processing. These approaches reduce the amount of work required for inference with minimal or no compromise in accuracy by computing DNN models in a condensed format. System improvements and model compression are complementary and can be used together to reduce inference latency and cost in a multiplicative way. This motivation to combine the best of both worlds led to the development of DeepSpeed ​​Compression. It is a composable library that combines state-of-the-art compression techniques with highly effective system enhancements to reduce DL patterns and speed up inference while maintaining significantly lower compression costs.

The use of current compression approaches for large-scale models still faces several practical difficulties despite various attempts to reduce model size and inference computation. The main problem is a complicated pipeline to achieve a high compression ratio. When compressing large models, several solutions have been proposed to combat the optimization complexity and accuracy degradation. However, best practices for high compressions, such as aggressive quantization techniques and layer reduction, have not been thoroughly studied. Large models can be compressed using current techniques, but the training costs are substantial. Another issue is the lack of specialized system optimizations for compressed models. System optimizations for a specific type of system are often required to exploit the benefits of compressed models. The best inference latency reduction can often be achieved through custom system optimizations for compressed models, whereas existing solutions often focus on reducing theoretical computational overhead. Current methodologies limit composability between different compression algorithms and system optimizations. These issues are overcome by DeepSpeed ​​Compression, which adopts an end-to-end strategy to increase the computational efficiency of compressed models using a highly optimized inference engine. The library also includes many state-of-the-art compression techniques that can be combined with system optimizations to provide the best of both worlds while enabling an efficient and simple pipeline for DL ​​model inference. Model size can be effectively reduced by 32 times without loss of accuracy and by 50 times while maintaining 97% accuracy with DeepSpeed ​​Compression. The two main methods used are layer reduction and overquantization.

The scarcity of training resources makes large-scale transformer models generally difficult to quantify. Therefore, the researchers also suggested ZeroQuant, which quantifies large-scale patterns with minimal fine-tuning expense. It primarily includes a fine-grained quantization approach that is hardware compatible and allows researchers to quantize weights and activations to low-bit values ​​while enabling fast inference performance. A layer-by-layer knowledge distillation pipeline, the second element, is used to refine the quantized model and close the precision gap caused by low-precision quantization. DeepSpeed ​​Compression, although recently released, has already been used to successfully optimize several important open source models and Microsoft production workloads. It greatly reduces latency and cost and is widely applicable to various NLP and CV activities. The core components of Microsoft AI’s DeepSpeed ​​Compression have recently been made publicly available. This includes the Compression Composer, which supports certain compression techniques for NLP and computer vision models, such as light layer reduction, task-specific knowledge pre-training and distillation, head pruning, line pruning and channel pruning. The team also intends to add other compression techniques to the library, such as specialized compressed model kernels and an optimization module capable of automatically selecting the most efficient compression schemes.

This Article is written as a research summary article by Marktechpost Staff based on the research article 'DeepSpeed Compression: A composable library for extreme compression and zero-cost quantization'. All Credit For This Research Goes To Researchers on This Project. Checkout the blog, github and website.

Please Don't Forget To Join Our ML Subreddit

Comments are closed.