Persistent Deployments ====================== A persistent model deployment can created with the :func:`mii.serve` API. This stands up a gRPC server and returns a :class:`MIIClient ` object that can be used to send generation requests to the inference server. The inference server will persist after the python script exits and until it is explicitly terminated. To connect to an existing deployment, the :func:`mii.client` API is used. This will connect with an existing gRPC server and return a :class:`MIIClient ` object. MIIClient --------- .. autoclass:: mii.backend.client.MIIClient .. automethod:: __call__ .. automethod:: generate .. automethod:: terminate_server :class:`MIIClient ` is a callable class that provides a simplified interface for generating text for prompt inputs on a persistent model deployment. To create a persistent deployment, you must only provide the HuggingFace model name (or path to a locally stored model) to the :func:`mii.serve` API. DeepSpeed-MII will automatically load the model weights, create an inference engine, stand up a gRPC server, and return the callable client. An example is provided below: .. code-block:: python import mii client = mii.serve("mistralai/Mistral-7B-v0.1") response = client(["DeepSpeed is", "Seattle is"], max_new_tokens=128) print(response) Because the deployment is persistent, this server will continue running until it is explicitly shutdown. This allows users to connect to a deployment from other processes using the :func:`mii.client` API: .. code-block:: python import mii client = mii.client("mistralai/Mistral-7B-v0.1") response = client(["DeepSpeed is", "Seattle is"], max_new_tokens=128) print(response) When a server needs to be shutdown, this can be done from any client object: .. code-block:: python import mii client = mii.client("mistralai/Mistral-7B-v0.1") client.terminate_server() Deployment Configuration ------------------------ While we prioritize offering a simple interface for loading models into production-ready persistent deployments, we also provide many configuration options for our persistent deployment. **Any of the fields in** :class:`ModelConfig ` **and** :class:`MIIConfig ` **can be passed as keyword arguments or in respective** ``model_config`` **and** ``mii_config`` **dictionaries to the** :func:`mii.serve` **API. Please see** :ref:`Model Configuration ` **and** :ref:`MII Server Configuration ` **for more information.** Generate Options ---------------- Text-generation behavior using the callable :class:`MIIClient ` class can be customized with several keyword arguments. A full list of the available options can be found in :class:`GenerateParamsConfig `. The generate options affect on the prompt(s) passed in a given call the client. For example, the generation length can be controlled on a per-prompt basis and override the default ``max_length``: .. code-block:: python response_long = client(prompt, max_length=1024) response_short = client(prompt, max_length=128) .. _deployment_model_parallelism: Model Parallelism ----------------- Our persistent deployment supports splitting models across multiple GPUs using tensor parallelism. To enable model parallelism, pass the ``tensor_parallel`` argument to :func:`mii.serve`: .. code-block:: python client = mii.serve("mistralai/Mistral-7B-v0.1", tensor_parallel=2) .. _deployment_model_replicas: Model Replicas -------------- The persistent deployment can also create multiple model replicas. Passing the ``replica_num`` argument to :func:`mii.serve` enables this feature: .. code-block:: python client = mii.serve("mistralai/Mistral-7B-v0.1", replica_num=2) With multiple model replicas, the incoming requests from clients will be forwarded to the replicas in a round-robin scheduling by an intermediate load-balancer process. For example, if 4 requests with ids ``0, 1, 2, 3`` are sent to the persistent deployment, then ``replica 0`` will process requests ``0`` and ``2`` while ``replica 1`` will process requests ``1`` and ``3``. Model replicas also compose with model parallelism. For example, 2 replicas can be created each split across 2 GPUs on a system with 4 GPUs total: .. code-block:: python client = mii.serve("mistralai/Mistral-7B-v0.1", replica_num=2, tensor_parallel=2)