Persistent Deployments

A persistent model deployment can created with the mii.serve() API. This stands up a gRPC server and returns a MIIClient object that can be used to send generation requests to the inference server. The inference server will persist after the python script exits and until it is explicitly terminated.

To connect to an existing deployment, the mii.client() API is used. This will connect with an existing gRPC server and return a MIIClient object.

MIIClient

class mii.backend.client.MIIClient(mii_config, host='localhost')[source]

Client for sending generation requests to a persistent deployment created with mii.serve(). Use mii.client() to create an instance of this class.

Parameters:

mii_config (MIIConfig) – MII config for the persistent deployment to connect with.
host (str (default: 'localhost')) – hostname where the persistent deployment is running.

__call__(*args, **kwargs)[source]

All args and kwargs get passed directly to generate().

Return type:: List[Response]
Returns:: A list of Response objects containing the generated text for all prompts.

generate(prompts, streaming_fn=None, **generate_kwargs)[source]

Generates text for the given prompts.

Parameters:

prompts (Union[str, List[str]]) – The string or list of strings used as prompts for generation.
streaming_fn (Optional[Callable] (default: None)) – Streaming support is currently a WIP.
**generate_kwargs (Dict) – Generation keywords. A full list can be found here.

Return type:

List[Response]

Returns:

A list of Response objects containing the generated text for all prompts.

terminate_server()[source]

Terminates the persistent deployment server. This can be called from any client.

Return type:: None

MIIClient is a callable class that provides a simplified interface for generating text for prompt inputs on a persistent model deployment. To create a persistent deployment, you must only provide the HuggingFace model name (or path to a locally stored model) to the mii.serve() API. DeepSpeed-MII will automatically load the model weights, create an inference engine, stand up a gRPC server, and return the callable client. An example is provided below:

import mii
client = mii.serve("mistralai/Mistral-7B-v0.1")
response = client(["DeepSpeed is", "Seattle is"], max_new_tokens=128)
print(response)

Because the deployment is persistent, this server will continue running until it is explicitly shutdown. This allows users to connect to a deployment from other processes using the mii.client() API:

import mii
client = mii.client("mistralai/Mistral-7B-v0.1")
response = client(["DeepSpeed is", "Seattle is"], max_new_tokens=128)
print(response)

When a server needs to be shutdown, this can be done from any client object:

import mii
client = mii.client("mistralai/Mistral-7B-v0.1")
client.terminate_server()

Deployment Configuration

While we prioritize offering a simple interface for loading models into production-ready persistent deployments, we also provide many configuration options for our persistent deployment.

Any of the fields in ModelConfig and MIIConfig can be passed as keyword arguments or in respective model_config and mii_config dictionaries to the mii.serve() API. Please see Model Configuration and MII Server Configuration for more information.

Generate Options

Text-generation behavior using the callable MIIClient class can be customized with several keyword arguments. A full list of the available options can be found in GenerateParamsConfig.

The generate options affect on the prompt(s) passed in a given call the client. For example, the generation length can be controlled on a per-prompt basis and override the default max_length:

response_long = client(prompt, max_length=1024)
response_short = client(prompt, max_length=128)

Model Parallelism

Our persistent deployment supports splitting models across multiple GPUs using tensor parallelism. To enable model parallelism, pass the tensor_parallel argument to mii.serve():

client = mii.serve("mistralai/Mistral-7B-v0.1", tensor_parallel=2)

Model Replicas

The persistent deployment can also create multiple model replicas. Passing the replica_num argument to mii.serve() enables this feature:

client = mii.serve("mistralai/Mistral-7B-v0.1", replica_num=2)

With multiple model replicas, the incoming requests from clients will be forwarded to the replicas in a round-robin scheduling by an intermediate load-balancer process. For example, if 4 requests with ids 0, 1, 2, 3 are sent to the persistent deployment, then replica 0 will process requests 0 and 2 while replica 1 will process requests 1 and 3.

Model replicas also compose with model parallelism. For example, 2 replicas can be created each split across 2 GPUs on a system with 4 GPUs total:

client = mii.serve("mistralai/Mistral-7B-v0.1", replica_num=2, tensor_parallel=2)