DeepSpeed-MII

Note

This project is under active development.

Introducing MII, an open-source Python library designed by DeepSpeed to democratize powerful model inference with a focus on high-throughput, low latency, and cost-effectiveness.

MII v0.1 introduced several features as part of our DeepSpeed-FastGen release such as blocked KV-caching, continuous batching, Dynamic SplitFuse, tensor parallelism, and high-performance CUDA kernels to support fast high throughput text-generation with LLMs. The latest version of MII delivers up to 2.5 times higher effective throughput compared to leading systems such as vLLM. For detailed performance results please see our DeepSpeed-FastGen release blog and the latest DeepSpeed-FastGen blog.

MII-Legacy

We first announced MII in 2022. Since then, MII has undergone a large refactoring effort to bring support of DeepSpeed-FastGen. MII-Legacy, which covers all prior releases up to v0.0.9, provides support for running inference for a wide variety of language model tasks. We also support accelerating text2image models like Stable Diffusion. For more details on our previous releases please see our legacy APIs.

DeepSpeed-MII

MII-Legacy

Contents