Non-Persistent Pipelines
A non-persistent pipeline can be created with the mii.pipeline() API. This
returns a non-persistent MIIPipeline object that is destroyed when the
python script exits.
MIIPipeline
- class mii.batching.ragged_batching.MIIPipeline(all_rank_output=False, *args, **kwargs)[source]
Pipeline class that inherits from
RaggedBatchBaseand provides functionality of ragged batching and dynamic splitfuse. This class is returned frommii.pipeline().- __call__(prompts, **generate_kwargs)[source]
Generates text for the given prompts
- Parameters:
prompts (
Union[str,List[str]]) – The string or list of strings used as prompts for generation.**generate_kwargs – Generation keywords. A full list can be found in
GenerateParamsConfig.
- Return type:
- Returns:
A list of
Responseobjects containing the generated text for all prompts.
MIIPipeline is a callable
class that provides a simplified interface for generating text for prompt
inputs. To create a pipeline, you must only provide the HuggingFace model name
(or path to a locally stored model) to the mii.pipeline() API.
DeepSpeed-MII will automatically load the model weights, create an inference
engine, and return the callable pipeline. A simple 4-line example is provided below:
import mii
pipe = mii.pipeline("mistralai/Mistral-7B-v0.1")
response = pipe(["DeepSpeed is", "Seattle is"], max_new_tokens=128)
print(response)
Pipeline Configuration
While we prioritize offering a simple interface to load models and run text-generation, we also provide many configuration options for users that want to customize the pipeline.
Any of the fields in ModelConfig can be
passed as keyword arguments or in a model_config dictionary to the
mii.pipeline() API. Please see Model Configuration for more information.
Generate Options
The text-generation of the callable MIIPipeline class can be modified with several
keyword arguments. A full list of the available options can be found in
GenerateParamsConfig.
The generate options affect only the prompt(s) passed in a given call to the pipeline. For example, you can control per-prompt generation length:
response_long = pipeline(prompt, max_length=1024)
response_short = pipeline(prompt, max_length=128)
Model Parallelism
Our pipeline object supports splitting models across multiple GPUs using tensor
parallelism. You must use the deepspeed launcher to enable tennsor parallelism
with the non-persistent pipeline, where the number of devices is controlled by
the --num_gpus <int> option.
As an example, consider the following example.py python script:
# example.py
import mii
pipe = mii.pipeline("mistralai/Mistral-7B-v0.1")
To run this pipeline on a single GPU, use python or deepspeed --num_gpus 1:
(.venv) $ python example.py
To enable tensor parallelism across 2 GPUs, use deepspeed --num_gpus 2:
(.venv) $ deepspeed --num_gpus 2 example.py
Because the deepspeed launcher will run multiple processes of
example.py, anything in the script will be executed by each process. For
example, consider the following script:
# example.py
import os
import mii
local_rank = int(os.getenv("LOCAL_RANK", 0))
pipe = mii.pipeline("mistralai/Mistral-7B-v0.1")
response = pipe("DeepSpeed is", max_length=16)
print(f"rank {local_rank} response: {response}")
By default, the response is returned to only the rank 0 process. When run
with deepspeed --num_gpus 2 example.py the following output is produced:
(.venv) $ deepspeed --num_gpus 2 example.py
rank 0 response: [a library for parallelizing and accelerating PyTorch.]
rank 1 response: []
This behavior can be changed by enabling all_rank_output when creating the
pipeline (i.e., pipe = mii.pipeline("mistralai/Mistral-7B-v0.1",
all_rank_output=True)):
(.venv) $ deepspeed --num_gpus 2 example.py
rank 0 response: [a library for parallelizing and accelerating PyTorch.]
rank 1 response: [a library for parallelizing and accelerating PyTorch.]