Non-Persistent Pipelines

A non-persistent pipeline can be created with the mii.pipeline() API. This returns a non-persistent MIIPipeline object that is destroyed when the python script exits.

MIIPipeline

class mii.batching.ragged_batching.MIIPipeline(all_rank_output=False, *args, **kwargs)[source]

Pipeline class that inherits from RaggedBatchBase and provides functionality of ragged batching and dynamic splitfuse. This class is returned from mii.pipeline().

__call__(prompts, **generate_kwargs)[source]

Generates text for the given prompts

Parameters:

prompts (Union[str, List[str]]) – The string or list of strings used as prompts for generation.
**generate_kwargs – Generation keywords. A full list can be found in GenerateParamsConfig.

Return type:

List[Response]

Returns:

A list of Response objects containing the generated text for all prompts.

MIIPipeline is a callable class that provides a simplified interface for generating text for prompt inputs. To create a pipeline, you must only provide the HuggingFace model name (or path to a locally stored model) to the mii.pipeline() API. DeepSpeed-MII will automatically load the model weights, create an inference engine, and return the callable pipeline. A simple 4-line example is provided below:

import mii
pipe = mii.pipeline("mistralai/Mistral-7B-v0.1")
response = pipe(["DeepSpeed is", "Seattle is"], max_new_tokens=128)
print(response)

Pipeline Configuration

While we prioritize offering a simple interface to load models and run text-generation, we also provide many configuration options for users that want to customize the pipeline.

Any of the fields in ModelConfig can be passed as keyword arguments or in a model_config dictionary to the mii.pipeline() API. Please see Model Configuration for more information.

Generate Options

The text-generation of the callable MIIPipeline class can be modified with several keyword arguments. A full list of the available options can be found in GenerateParamsConfig.

The generate options affect only the prompt(s) passed in a given call to the pipeline. For example, you can control per-prompt generation length:

response_long = pipeline(prompt, max_length=1024)
response_short = pipeline(prompt, max_length=128)

Model Parallelism

Our pipeline object supports splitting models across multiple GPUs using tensor parallelism. You must use the deepspeed launcher to enable tennsor parallelism with the non-persistent pipeline, where the number of devices is controlled by the --num_gpus <int> option.

As an example, consider the following example.py python script:

# example.py
import mii
pipe = mii.pipeline("mistralai/Mistral-7B-v0.1")

To run this pipeline on a single GPU, use python or deepspeed --num_gpus 1:

(.venv) $ python example.py

To enable tensor parallelism across 2 GPUs, use deepspeed --num_gpus 2:

(.venv) $ deepspeed --num_gpus 2 example.py

Because the deepspeed launcher will run multiple processes of example.py, anything in the script will be executed by each process. For example, consider the following script:

# example.py
import os
import mii
local_rank = int(os.getenv("LOCAL_RANK", 0))
pipe = mii.pipeline("mistralai/Mistral-7B-v0.1")
response = pipe("DeepSpeed is", max_length=16)
print(f"rank {local_rank} response: {response}")

By default, the response is returned to only the rank 0 process. When run with deepspeed --num_gpus 2 example.py the following output is produced:

(.venv) $ deepspeed --num_gpus 2 example.py
rank 0 response: [a library for parallelizing and accelerating PyTorch.]
rank 1 response: []

This behavior can be changed by enabling all_rank_output when creating the pipeline (i.e., pipe = mii.pipeline("mistralai/Mistral-7B-v0.1", all_rank_output=True)):

(.venv) $ deepspeed --num_gpus 2 example.py
rank 0 response: [a library for parallelizing and accelerating PyTorch.]
rank 1 response: [a library for parallelizing and accelerating PyTorch.]