Configuration
The config classes described here are used to customize Non-Persistent Pipelines and Persistent Deployments.
Model Configuration
The ModelConfig is used to stand up a
DeepSpeed inference engine and provides a large amount of control to users. This
class is automatically generated from user-provided arguments to
mii.pipeline() and mii.serve(). The fields can be provided in a
model_config dictionary or as keyword arguments.
For example, to change the default max_length for token generation of a
pipeline, the following are equivalent:
As a keyword argument:
pipe = mii.pipeline("mistralai/Mistral-7B-v0.1", max_length=2048)
As a model_config dictionary:
pipe = mii.pipeline("mistralai/Mistral-7B-v0.1", model_config={"max_length": 2048})
- class mii.config.ModelConfig[source]
-
model_name_or_path:
str[Required] Model name or path of the model to HuggingFace model to be deployed.
-
tokenizer:
Union[str,MIITokenizerWrapper,None] = None Tokenizer wrapped with MIITokenizerWrapper, name or path of the HuggingFace tokenizer to be used.
-
tensor_parallel:
int= 1 Tensor parallelism to use for a model (i.e., how many GPUs to shard a model across). This defaults to the WORLD_SIZE environment variable, or a value of 1 if that variable is not set. This value is also propagated to the inference_engine_config.
-
quantization_mode:
Optional[str] = None - The quantization mode in string format. The supported modes are as follows:
‘wf6af16’, weight-only quantization with FP6 weight and FP16 activation.
-
inference_engine_config:
RaggedInferenceEngineConfig= {} DeepSpeed inference engine config. This is automatically generated, but you can provide a set of custom configs.
-
torch_dist_port:
int= 29500 Torch distributed port to be used. This also serves as a base port when multiple replicas are deployed. For example, if there are 2 replicas, the first will use port 29500 and the second will use port 29600.
-
zmq_port_number:
int= 25555 Port number to use for the ZMQ communication (for broadcasting requests and responses among all ranks in ragged batching).
-
replica_configs:
List[ReplicaConfig] = [] Configuration details for each replica. This will be automatically generated, but you can provide a set of custom configs.
-
device_map:
Union[Literal['auto'],Dict[str,List[List[int]]]] = 'auto' GPU indices a model is deployed on. Note that CUDA_VISIBLE_DEVICES does not work with DeepSpeed-MII.
-
max_length:
Optional[int] = None The maximum number of tokens DeepSpeed-Inference can work with, including the input and output tokens.
-
profile_model_time:
bool= False Log performance information about model inference with very little overhead.
- property provider: ModelProvider
-
model_name_or_path:
MII Server Configuration
The MIIConfig is used to stand up a
DeepSpeed-MII gRPC server and provide a large amount of
control to users. This class is automatically generated from user-provided
arguments to mii.serve(). The fields can be provided in a mii_config
dictionary or as keyword arguments.
For example, to change the base port number used to to communicate with a
persistent deployment and the default max_length for token generation, the
following are equivalent:
As keyword arguments:
client = mii.serve("mistralai/Mistral-7B-v0.1", port_number=50055, max_length=2048)
As model_config and mii_config dictionaries:
client = mii.serve("mistralai/Mistral-7B-v0.1", mii_config={"port_number": 50055}, model_config={"max_length": 2048})
- class mii.config.MIIConfig[source]
-
deployment_name:
str= '' Name of the deployment. Used as an identifier for obtaining a inference server client and posting queries. Automatically generated if it is not provided.
-
deployment_type:
DeploymentType= DeploymentType.LOCAL One of the enum mii.DeploymentTypes: * LOCAL uses a grpc server to create a local deployment. * AML will generate the assets necessary to deploy on AML resources.
-
model_conf:
ModelConfig[Required] (alias 'model_config') Configuration for the deployed model(s).
-
enable_restful_api:
bool= False Enables a RESTful API that can be queries with via http POST method.
-
deployment_name:
Text-Generation Configuration
The GenerateParamsConfig is used to
process user-provided keyword arguments passed to MIIPipeline and MIIClient when doing text-generation.
- class mii.config.GenerateParamsConfig[source]
Options for changing text-generation behavior.
-
max_new_tokens:
Optional[int] = None Maximum number of new tokens generated.
max_lengthtakes precedent.
-
max_new_tokens: