Persistent Deployments
A persistent model deployment can created with the mii.serve() API. This
stands up a gRPC server and returns a MIIClient object that can be used to send generation
requests to the inference server. The inference server will persist after the
python script exits and until it is explicitly terminated.
To connect to an existing deployment, the mii.client() API is used. This
will connect with an existing gRPC server and return a MIIClient object.
MIIClient
- class mii.backend.client.MIIClient(mii_config, host='localhost')[source]
Client for sending generation requests to a persistent deployment created with
mii.serve(). Usemii.client()to create an instance of this class.- Parameters:
- __call__(*args, **kwargs)[source]
All args and kwargs get passed directly to
generate().
MIIClient is a callable class that
provides a simplified interface for generating text for prompt inputs on a
persistent model deployment. To create a persistent deployment, you must only
provide the HuggingFace model name (or path to a locally stored model) to the
mii.serve() API. DeepSpeed-MII will automatically load the model weights,
create an inference engine, stand up a gRPC server, and return the callable
client. An example is provided below:
import mii
client = mii.serve("mistralai/Mistral-7B-v0.1")
response = client(["DeepSpeed is", "Seattle is"], max_new_tokens=128)
print(response)
Because the deployment is persistent, this server will continue running until it
is explicitly shutdown. This allows users to connect to a deployment from other
processes using the mii.client() API:
import mii
client = mii.client("mistralai/Mistral-7B-v0.1")
response = client(["DeepSpeed is", "Seattle is"], max_new_tokens=128)
print(response)
When a server needs to be shutdown, this can be done from any client object:
import mii
client = mii.client("mistralai/Mistral-7B-v0.1")
client.terminate_server()
Deployment Configuration
While we prioritize offering a simple interface for loading models into production-ready persistent deployments, we also provide many configuration options for our persistent deployment.
Any of the fields in ModelConfig and
MIIConfig can be passed as keyword
arguments or in respective model_config and mii_config
dictionaries to the mii.serve() API. Please see Model
Configuration and MII Server Configuration for more information.
Generate Options
Text-generation behavior using the callable MIIClient class can be customized with several keyword
arguments. A full list of the available options can be found in
GenerateParamsConfig.
The generate options affect on the prompt(s) passed in a given call the client.
For example, the generation length can be controlled on a per-prompt basis and
override the default max_length:
response_long = client(prompt, max_length=1024)
response_short = client(prompt, max_length=128)
Model Parallelism
Our persistent deployment supports splitting models across multiple GPUs using
tensor parallelism. To enable model parallelism, pass the tensor_parallel
argument to mii.serve():
client = mii.serve("mistralai/Mistral-7B-v0.1", tensor_parallel=2)
Model Replicas
The persistent deployment can also create multiple model replicas. Passing the
replica_num argument to mii.serve() enables this feature:
client = mii.serve("mistralai/Mistral-7B-v0.1", replica_num=2)
With multiple model replicas, the incoming requests from clients will be
forwarded to the replicas in a round-robin scheduling by an intermediate
load-balancer process. For example, if 4 requests with ids 0, 1, 2, 3 are
sent to the persistent deployment, then replica 0 will process requests
0 and 2 while replica 1 will process requests 1 and 3.
Model replicas also compose with model parallelism. For example, 2 replicas can be created each split across 2 GPUs on a system with 4 GPUs total:
client = mii.serve("mistralai/Mistral-7B-v0.1", replica_num=2, tensor_parallel=2)