Introduction to scaling Large Model training and inference using DeepSpeed

What is DeepSpeed for Generative AI?

DeepSpeed is an open source (apache2 license) library that optimizes training and inference for foundation models. It is a lightweight wrapper for PyTorch and optimizes for both speed and scale.

Training optimization using DeepSpeed

DeepSpeed optimizes training by managing distributed training, mixed precision, gradient accumulation, and checkpoints. Some of its features are:

It can train up to 13 Billion parameters in a single GPU.
It implements a feature called Zero Redundancy Optimizer (ZeRO) which essentially reduces redundancies in memory in distributed training.
It supports combinations of data, model and pipeline parallelism, which it calls 3D parallelism.
It increases communication efficiency by using 1-bit Adam (using 1 bit compression with Adam), 0/1 Adam and 1-bit LAMB reduce
It uses a library called Data Efficiency which increases training efficiency and model quality by making better use of data. It does that by using two techniques :
- Curriculum Learning improves training convergence by providing relatively simple examples during earlier training.
- Random layerwise token dropping, as the name suggest, randomly drop token at each layer.
supports long sequence length using sparse attention kernels.
Improves training efficiency by using large batch optimizers for deep training (lamb)
Enables distributed training with mixed precision

Inference Optimization using DeepSpeed

There are two main challenges to inference – latency and cost. DeepSpeed has the following features to optimize inference:

splitting Inference to multiple GPUs and using the best parallelism strategy for multiple GPU inference.
Increase the efficiency per GPU using
- deep fusion – combine multiple operations into a single kernel.
- novel kernel scheduling – small batch size increases kernel invocation time and the General Matrix Multiplication library is not tuned for small sizes. DeepSpeed addresses these challenges.
DeepSpeed Quantization toolkit reduces inference cost and contains
- Different quantization for parameters and activation.
- Specialised INT8 inference kernels.

Compression Features

DeepSpeed contains a component known as compression composer. This offers multiple compression methods such as quantization, head/row/channel pruning, knowledge distillation and layer reduction. It provide an API to combine the compression methods in various combinations.

What is DeepSpeed for Generative AI?

Training optimization using DeepSpeed

Inference Optimization using DeepSpeed

Compression Features

Leave a Comment Cancel reply