Non-Persistent Pipelines ======================== A non-persistent pipeline can be created with the :func:`mii.pipeline` API. This returns a non-persistent :class:`MIIPipeline ` object that is destroyed when the python script exits. MIIPipeline ----------- .. autoclass:: mii.batching.ragged_batching.MIIPipeline .. automethod:: __call__ :class:`MIIPipeline ` is a callable class that provides a simplified interface for generating text for prompt inputs. To create a pipeline, you must only provide the HuggingFace model name (or path to a locally stored model) to the :func:`mii.pipeline` API. DeepSpeed-MII will automatically load the model weights, create an inference engine, and return the callable pipeline. A simple 4-line example is provided below: .. code-block:: python import mii pipe = mii.pipeline("mistralai/Mistral-7B-v0.1") response = pipe(["DeepSpeed is", "Seattle is"], max_new_tokens=128) print(response) Pipeline Configuration ---------------------- While we prioritize offering a simple interface to load models and run text-generation, we also provide many configuration options for users that want to customize the pipeline. **Any of the fields in** :class:`ModelConfig ` **can be passed as keyword arguments or in a** ``model_config`` **dictionary to the** :func:`mii.pipeline` **API. Please see** :ref:`Model Configuration ` **for more information.** Generate Options ---------------- The text-generation of the callable :class:`MIIPipeline ` class can be modified with several keyword arguments. A full list of the available options can be found in :class:`GenerateParamsConfig `. The generate options affect only the prompt(s) passed in a given call to the pipeline. For example, you can control per-prompt generation length: .. code-block:: python response_long = pipeline(prompt, max_length=1024) response_short = pipeline(prompt, max_length=128) .. _pipeline_model_parallelism: Model Parallelism ----------------- Our pipeline object supports splitting models across multiple GPUs using tensor parallelism. You must use the ``deepspeed`` launcher to enable tennsor parallelism with the non-persistent pipeline, where the number of devices is controlled by the ``--num_gpus `` option. As an example, consider the following ``example.py`` python script: .. code-block:: python # example.py import mii pipe = mii.pipeline("mistralai/Mistral-7B-v0.1") To run this pipeline on a single GPU, use ``python`` or ``deepspeed --num_gpus 1``: .. code-block:: console (.venv) $ python example.py To enable tensor parallelism across 2 GPUs, use ``deepspeed --num_gpus 2``: .. code-block:: console (.venv) $ deepspeed --num_gpus 2 example.py Because the ``deepspeed`` launcher will run multiple processes of ``example.py``, anything in the script will be executed by each process. For example, consider the following script: .. code-block:: python # example.py import os import mii local_rank = int(os.getenv("LOCAL_RANK", 0)) pipe = mii.pipeline("mistralai/Mistral-7B-v0.1") response = pipe("DeepSpeed is", max_length=16) print(f"rank {local_rank} response: {response}") By default, the response is returned to only the rank 0 process. When run with ``deepspeed --num_gpus 2 example.py`` the following output is produced: .. code-block:: console (.venv) $ deepspeed --num_gpus 2 example.py rank 0 response: [a library for parallelizing and accelerating PyTorch.] rank 1 response: [] This behavior can be changed by enabling ``all_rank_output`` when creating the pipeline (i.e., ``pipe = mii.pipeline("mistralai/Mistral-7B-v0.1", all_rank_output=True)``): .. code-block:: console (.venv) $ deepspeed --num_gpus 2 example.py rank 0 response: [a library for parallelizing and accelerating PyTorch.] rank 1 response: [a library for parallelizing and accelerating PyTorch.]