Configuration

The config classes described here are used to customize Non-Persistent Pipelines and Persistent Deployments.

Model Configuration

The ModelConfig is used to stand up a DeepSpeed inference engine and provides a large amount of control to users. This class is automatically generated from user-provided arguments to mii.pipeline() and mii.serve(). The fields can be provided in a model_config dictionary or as keyword arguments.

For example, to change the default max_length for token generation of a pipeline, the following are equivalent:

As a keyword argument:

pipe = mii.pipeline("mistralai/Mistral-7B-v0.1", max_length=2048)

As a model_config dictionary:

pipe = mii.pipeline("mistralai/Mistral-7B-v0.1", model_config={"max_length": 2048})

class mii.config.ModelConfig[source]

model_name_or_path: str [Required]: Model name or path of the model to HuggingFace model to be deployed.

tokenizer: Union[str, MIITokenizerWrapper, None] = None: Tokenizer wrapped with MIITokenizerWrapper, name or path of the HuggingFace tokenizer to be used.

task: Optional[TaskType] = TaskType.TEXT_GENERATION: Name of the task to be performed by the model.

tensor_parallel: int = 1: Tensor parallelism to use for a model (i.e., how many GPUs to shard a model across). This defaults to the WORLD_SIZE environment variable, or a value of 1 if that variable is not set. This value is also propagated to the inference_engine_config.

quantization_mode: Optional[str] = None

The quantization mode in string format. The supported modes are as follows:

‘wf6af16’, weight-only quantization with FP6 weight and FP16 activation.

inference_engine_config: RaggedInferenceEngineConfig = {}: DeepSpeed inference engine config. This is automatically generated, but you can provide a set of custom configs.

torch_dist_port: int = 29500: Torch distributed port to be used. This also serves as a base port when multiple replicas are deployed. For example, if there are 2 replicas, the first will use port 29500 and the second will use port 29600.

zmq_port_number: int = 25555: Port number to use for the ZMQ communication (for broadcasting requests and responses among all ranks in ragged batching).

replica_num: int = 1

Number of model replicas. Enables easy data parallelism.

Constraints:

gt = 0

replica_configs: List[ReplicaConfig] = []: Configuration details for each replica. This will be automatically generated, but you can provide a set of custom configs.

device_map: Union[Literal['auto'], Dict[str, List[List[int]]]] = 'auto': GPU indices a model is deployed on. Note that CUDA_VISIBLE_DEVICES does not work with DeepSpeed-MII.

max_length: Optional[int] = None: The maximum number of tokens DeepSpeed-Inference can work with, including the input and output tokens.

sync_debug: bool = False: Inserts additional synchronization points for debugging purposes.

profile_model_time: bool = False: Log performance information about model inference with very little overhead.

property provider: ModelProvider

MII Server Configuration

The MIIConfig is used to stand up a DeepSpeed-MII gRPC server and provide a large amount of control to users. This class is automatically generated from user-provided arguments to mii.serve(). The fields can be provided in a mii_config dictionary or as keyword arguments.

For example, to change the base port number used to to communicate with a persistent deployment and the default max_length for token generation, the following are equivalent:

As keyword arguments:

client = mii.serve("mistralai/Mistral-7B-v0.1", port_number=50055, max_length=2048)

As model_config and mii_config dictionaries:

client = mii.serve("mistralai/Mistral-7B-v0.1", mii_config={"port_number": 50055}, model_config={"max_length": 2048})

class mii.config.MIIConfig[source]

deployment_name: str = '': Name of the deployment. Used as an identifier for obtaining a inference server client and posting queries. Automatically generated if it is not provided.

deployment_type: DeploymentType = DeploymentType.LOCAL: One of the enum mii.DeploymentTypes: * LOCAL uses a grpc server to create a local deployment. * AML will generate the assets necessary to deploy on AML resources.

model_conf: ModelConfig [Required] (alias 'model_config'): Configuration for the deployed model(s).

port_number: int = 50050: Port number to use for the load balancer process.

enable_restful_api: bool = False: Enables a RESTful API that can be queries with via http POST method.

restful_api_host: str = 'localhost': Hostname to use for the RESTful API.

restful_api_port: int = 51080: Port number to use for the RESTful API.

restful_processes: int = 32

Number of processes to use for the RESTful API.

Constraints:

ge = 1

hostfile: str = '/job/hostfile': DeepSpeed hostfile. Will be autogenerated if None is provided.

version: int = 1: Version number to pass to AML deployments.

instance_type: str = 'Standard_NC12s_v3': AML instance type to use when create AML deployment assets.

generate_replica_configs()[source]

Return type:: None

Text-Generation Configuration

The GenerateParamsConfig is used to process user-provided keyword arguments passed to MIIPipeline and MIIClient when doing text-generation.

class mii.config.GenerateParamsConfig[source]

Options for changing text-generation behavior.

max_length: int = 1024: Maximum length of input_tokens + generated_tokens.

max_new_tokens: Optional[int] = None: Maximum number of new tokens generated. max_length takes precedent.

min_new_tokens: int = 0: Minimum number of new tokens generated.

stream: bool = False: Enable streaming output.

ignore_eos: bool = False: Ignore EoS token and continue generating text until we reach max_length or max_new_tokens.

return_full_text: bool = False: Prepends the input prompt to the generated text.

do_sample: bool = True: When False, do greedy sampling.

top_p: float = 0.9

Top P value.

Constraints:

gt = 0
le = 1

top_k: Optional[int] = None

Top K value.

Constraints:

gt = 0

temperature: Optional[float] = None

Temperature value.

Constraints:

gt = 0

stop: List[str] = []: List of strings to stop generation at.