FastGen Quick Start Guide

This guide is aimed to get you quickly up and running DeepSpeed-MII and DeepSpeed-FastGen.

Requirements

1 or more NVIDIA GPUs with >=sm_80 compute capability (e.g., A100, A6000)
PyTorch installed in your local Python environment

Install

Install the latest version of DeepSpeed-MII with the following:

(.venv) $ pip install -U deepspeed-mii

Run a Non-Persistent Pipeline

A pipeline provides a non-persistent instance of the model for running inference. When the script running this code exits, the model will also be destroyed. The pipeline is ideal for doing quick tests or in cases where the best performance is not necessary.

Copy the following code block into an example.py file on your local machine. Run it with deepspeed --num_gpus <num of GPUs> example.py.

import mii
pipe = mii.pipeline("mistralai/Mistral-7B-v0.1")
response = pipe(["DeepSpeed is", "Seattle is"], max_new_tokens=128)
for r in response:
    print(r.generated_text)

Note

Depending on your internet connection, the download of model weights could take a few minutes. If you wish to try a smaller model, replace mistralai/Mistral-7B-v0.1 with facebook/opt-125m in the above code.

If the code successfully runs, you should see the generated text printed in your terminal.

Run a Persistent Deployment

In contrast the pipeline, deployments create a server process that persists beyond the execution of the python script. These deployments are intended for production use cases and allow for multiple clients to connect while providing the best performance from DeepSpeed-FastGen.

Copy the following code block into a serve.py file on your local machine. Run it with python serve.py.

import mii
mii.serve("mistralai/Mistral-7B-v0.1")

You should see logging messages indicating the server is starting and a final log message of server has started on ports [50051].

Now copy the following code block into a client.py file on your local machine. Run it with python client.py.

import mii
client = mii.client("mistralai/Mistral-7B-v0.1")
response = client(["DeepSpeed is", "Seattle is"], max_new_tokens=128)
for r in response:
    print(r.generated_text)

If the code successfully runs, you should see the generated text printed in your terminal. You can run this client script as many times (and from as many different processes) as you like and the model deployment will remain active.

Finally copy the following code block into a terminate.py file on your local machine. Run it with python terminate.py.

import mii
client = mii.client("mistralai/Mistral-7B-v0.1")
client.terminate_server()

This will shutdown the model deployment and free GPU memory.