Specialized Configs for Triton Backends

The Python API provides specialized configuration classes that help provide only available options for the given type of model.

model_navigator.triton.BaseSpecializedModelConfig `dataclass`

BaseSpecializedModelConfig(max_batch_size=4, batching=True, default_model_filename=None, batcher=DynamicBatcher(), instance_groups=lambda: [](), parameters=lambda: {}(), response_cache=False, warmup=lambda: {}(), inputs=lambda: [](), outputs=lambda: []())

Bases: ABC

Common fields for specialized model configs.

Read more in Triton Inference server documentation

Parameters:

max_batch_size (int, default: 4 ) –

The maximal batch size that would be handled by model.
batching (bool, default: True ) –

Flag to enable/disable batching for model.
default_model_filename (Optional[str], default: None ) –

Optional filename of the model file to use.
batcher (Union[DynamicBatcher, SequenceBatcher], default: DynamicBatcher() ) –

Configuration of Dynamic Batching for the model.
instance_groups (List[InstanceGroup], default: lambda: []() ) –

Instance groups configuration for multiple instances of the model
parameters (Dict[str, str], default: lambda: {}() ) –

Custom parameters for model or backend
response_cache (bool, default: False ) –

Flag to enable/disable response cache for the model
warmup (Dict[str, ModelWarmup], default: lambda: {}() ) –

Warmup configuration for model

backend `abstractmethod` `property`

backend

Backend property that has to be overridden by specialized configs.

custom_fields `property`

custom_fields

Custom fields that are configured as parameters.

__post_init__

__post_init__()

Validate the configuration for early error handling.

Source code in model_navigator/triton/specialized_configs/base_model_config.py

def __post_init__(self) -> None:
    """Validate the configuration for early error handling."""
    if self.batching and self.max_batch_size <= 0:
        raise ModelNavigatorWrongParameterError("The `max_batch_size` must be greater or equal to 1.")

    if type(self.batcher) not in [DynamicBatcher, SequenceBatcher]:
        raise ModelNavigatorWrongParameterError("Unsupported batcher type provided.")

    if self.backend != Backend.TensorRT and any(group.profile for group in self.instance_groups):
        raise ModelNavigatorWrongParameterError(
            "Invalid `profile` option. The value can be set only for `backend=Backend.TensorRT`"
        )

model_navigator.triton.ONNXModelConfig `dataclass`

ONNXModelConfig(max_batch_size=4, batching=True, default_model_filename=None, batcher=DynamicBatcher(), instance_groups=lambda: [](), parameters=lambda: {}(), response_cache=False, warmup=lambda: {}(), inputs=lambda: [](), outputs=lambda: [](), platform=None, optimization=None)

Bases: BaseSpecializedModelConfig

Specialized model config for ONNX backend supported model.

Parameters:

platform (Optional[Platform], default: None ) –

Override backend parameter with platform. Possible options: Platform.ONNXRuntimeONNX
optimization (Optional[ONNXOptimization], default: None ) –

Possible optimization for ONNX models

backend `property`

backend

Define backend value for config.

custom_fields `property`

custom_fields

Custom fields that are configured as parameters.

__post_init__

__post_init__()

Validate the configuration for early error handling.

Source code in model_navigator/triton/specialized_configs/onnx_model_config.py

def __post_init__(self):
    """Validate the configuration for early error handling."""
    super().__post_init__()
    if self.optimization and not isinstance(self.optimization, ONNXOptimization):
        raise ModelNavigatorWrongParameterError("Unsupported optimization type provided.")

    if self.platform and self.platform != Platform.ONNXRuntimeONNX:
        raise ModelNavigatorWrongParameterError(f"Unsupported platform provided. Use: {Platform.ONNXRuntimeONNX}.")

model_navigator.triton.ONNXOptimization `dataclass`

ONNXOptimization(accelerator)

ONNX possible optimizations.

Parameters:

accelerator (Union[OpenVINOAccelerator, TensorRTAccelerator]) –

Execution accelerator for model

__post_init__

__post_init__()

Validate the configuration for early error handling.

Source code in model_navigator/triton/specialized_configs/onnx_model_config.py

def __post_init__(self):
    """Validate the configuration for early error handling."""
    if self.accelerator and type(self.accelerator) not in [OpenVINOAccelerator, TensorRTAccelerator]:
        raise ModelNavigatorWrongParameterError("Unsupported accelerator type provided.")

model_navigator.triton.PythonModelConfig `dataclass`

PythonModelConfig(max_batch_size=4, batching=True, default_model_filename=None, batcher=DynamicBatcher(), instance_groups=lambda: [](), parameters=lambda: {}(), response_cache=False, warmup=lambda: {}(), inputs=lambda: [](), outputs=lambda: []())

Bases: BaseSpecializedModelConfig

Specialized model config for Python backend supported model.

Parameters:

inputs (Sequence[InputTensorSpec], default: lambda: []() ) –

Required definition of model inputs
outputs (Sequence[OutputTensorSpec], default: lambda: []() ) –

Required definition of model outputs

backend `property`

backend

Define backend value for config.

custom_fields `property`

custom_fields

Custom fields that are configured as parameters.

__post_init__

__post_init__()

Validate the configuration for early error handling.

Source code in model_navigator/triton/specialized_configs/python_model_config.py

def __post_init__(self) -> None:
    """Validate the configuration for early error handling."""
    super().__post_init__()
    assert len(self.inputs) > 0, "Model inputs definition is required for Python backend."
    assert len(self.outputs) > 0, "Model outputs definition is required for Python backend."

model_navigator.triton.PyTorchModelConfig `dataclass`

PyTorchModelConfig(max_batch_size=4, batching=True, default_model_filename=None, batcher=DynamicBatcher(), instance_groups=lambda: [](), parameters=lambda: {}(), response_cache=False, warmup=lambda: {}(), inputs=lambda: [](), outputs=lambda: [](), platform=None)

Bases: BaseSpecializedModelConfig

Specialized model config for PyTorch backend supported model.

Parameters:

platform (Optional[Platform], default: None ) –

Override backend parameter with platform. Possible options: Platform.PyTorchLibtorch
inputs (Sequence[InputTensorSpec], default: lambda: []() ) –

Required definition of model inputs
outputs (Sequence[OutputTensorSpec], default: lambda: []() ) –

Required definition of model outputs

backend `property`

backend

Define backend value for config.

custom_fields `property`

custom_fields

Custom fields that are configured as parameters.

__post_init__

__post_init__()

Validate the configuration for early error handling.

Source code in model_navigator/triton/specialized_configs/pytorch_model_config.py

def __post_init__(self) -> None:
    """Validate the configuration for early error handling."""
    super().__post_init__()
    assert len(self.inputs) > 0, "Model inputs definition is required for PyTorch backend."
    assert len(self.outputs) > 0, "Model outputs definition is required for PyTorch backend."

    if self.platform and self.platform != Platform.PyTorchLibtorch:
        raise ModelNavigatorWrongParameterError(f"Unsupported platform provided. Use: {Platform.PyTorchLibtorch}.")

model_navigator.triton.TensorFlowModelConfig `dataclass`

TensorFlowModelConfig(max_batch_size=4, batching=True, default_model_filename=None, batcher=DynamicBatcher(), instance_groups=lambda: [](), parameters=lambda: {}(), response_cache=False, warmup=lambda: {}(), inputs=lambda: [](), outputs=lambda: [](), platform=None, optimization=None)

Bases: BaseSpecializedModelConfig

Specialized model config for TensorFlow backend supported model.

Parameters:

platform (Optional[Platform], default: None ) –

Override backend parameter with platform. Possible options: Platform.TensorFlowSavedModel, Platform.TensorFlowGraphDef
optimization (Optional[TensorFlowOptimization], default: None ) –

Possible optimization for TensorFlow models

backend `property`

backend

Define backend value for config.

custom_fields `property`

custom_fields

Custom fields that are configured as parameters.

__post_init__

__post_init__()

Validate the configuration for early error handling.

Source code in model_navigator/triton/specialized_configs/tensorflow_model_config.py

def __post_init__(self):
    """Validate the configuration for early error handling."""
    super().__post_init__()
    if self.optimization and not isinstance(self.optimization, TensorFlowOptimization):
        raise ModelNavigatorWrongParameterError("Unsupported optimization type provided.")

    platforms = [Platform.TensorFlowSavedModel, Platform.TensorFlowGraphDef]
    if self.platform and self.platform not in platforms:
        raise ModelNavigatorWrongParameterError(f"Unsupported platform provided. Use one of: {platforms}")

model_navigator.triton.TensorFlowOptimization `dataclass`

TensorFlowOptimization(accelerator)

TensorFlow possible optimizations.

Parameters:

accelerator (Union[AutoMixedPrecisionAccelerator, GPUIOAccelerator, TensorRTAccelerator]) –

Execution accelerator for model

__post_init__

__post_init__()

Validate the configuration for early error handling.

Source code in model_navigator/triton/specialized_configs/tensorflow_model_config.py

def __post_init__(self):
    """Validate the configuration for early error handling."""
    if self.accelerator and type(self.accelerator) not in [
        AutoMixedPrecisionAccelerator,
        GPUIOAccelerator,
        TensorRTAccelerator,
    ]:
        raise ModelNavigatorWrongParameterError("Unsupported accelerator type provided.")

model_navigator.triton.TensorRTModelConfig `dataclass`

TensorRTModelConfig(max_batch_size=4, batching=True, default_model_filename=None, batcher=DynamicBatcher(), instance_groups=lambda: [](), parameters=lambda: {}(), response_cache=False, warmup=lambda: {}(), inputs=lambda: [](), outputs=lambda: [](), platform=None, optimization=None)

Bases: BaseSpecializedModelConfig

Specialized model config for TensorRT platform supported model.

Parameters:

platform (Optional[Platform], default: None ) –

Override backend parameter with platform. Possible options: Platform.TensorRTPlan
optimization (Optional[TensorRTOptimization], default: None ) –

Possible optimization for TensorRT models

backend `property`

backend

Define backend value for config.

custom_fields `property`

custom_fields

Custom fields that are configured as parameters.

__post_init__

__post_init__()

Validate the configuration for early error handling.

Source code in model_navigator/triton/specialized_configs/tensorrt_model_config.py

def __post_init__(self):
    """Validate the configuration for early error handling."""
    super().__post_init__()
    if self.optimization and not isinstance(self.optimization, TensorRTOptimization):
        raise ModelNavigatorWrongParameterError("Unsupported optimization type provided.")

    if self.platform and self.platform != Platform.TensorRTPlan:
        raise ModelNavigatorWrongParameterError(f"Unsupported platform provided. Use: {Platform.TensorRTPlan}.")

model_navigator.triton.TensorRTOptimization `dataclass`

TensorRTOptimization(cuda_graphs=False, gather_kernel_buffer_threshold=None, eager_batching=False)

TensorRT possible optimizations.

Parameters:

cuda_graphs (bool, default: False ) –

Use CUDA graphs API to capture model operations and execute them more efficiently.
gather_kernel_buffer_threshold (Optional[int], default: None ) –

The backend may use a gather kernel to gather input data if the device has direct access to the source buffer and the destination buffer.
eager_batching (bool, default: False ) –

Start preparing the next batch before the model instance is ready for the next inference.

__post_init__

__post_init__()

Validate the configuration for early error handling.

Source code in model_navigator/triton/specialized_configs/tensorrt_model_config.py

def __post_init__(self):
    """Validate the configuration for early error handling."""
    if not self.cuda_graphs and not self.gather_kernel_buffer_threshold and not self.eager_batching:
        raise ModelNavigatorWrongParameterError("At least one of the optimization options should be enabled.")

model_navigator.triton.TensorRTLLMModelConfig `dataclass`

TensorRTLLMModelConfig(max_batch_size=4, batching=True, default_model_filename=None, batcher=DynamicBatcher(), instance_groups=lambda: [](), parameters=lambda: {}(), response_cache=False, warmup=lambda: {}(), inputs=lambda: [](), outputs=lambda: [](), encoder_dir=None, max_beam_width=None, batching_strategy=BatchingStrategy.INFLIGHT, batch_scheduler_policy=BatchSchedulerPolicy.MAX_UTILIZATION, decoding_mode=None, gpu_device_ids=lambda: [](), gpu_weights_percent=None, kv_cache_config=None, peft_cache_config=None, enable_chunked_context=None, normalize_log_probs=None, cancellation_check_period_ms=None, stats_check_period_ms=None, request_stats_max_iterations=None, iter_stats_max_iterations=None, exclude_input_in_output=None, medusa_choices=None, _engine_dir=None)

Bases: BaseSpecializedModelConfig

Specialized model config for TensorRT-LLM platform supported model.

Adapted from TensorRT-LLM config: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxts

Relevant TensorRT-LLM classes: - ExecutorConfig: https://nvidia.github.io/TensorRT-LLM/_cpp_gen/executor.html#_CPPv4N12tensorrt_llm8executor14ExecutorConfigE - KVCacheConfig: https://nvidia.github.io/TensorRT-LLM/_cpp_gen/executor.html#_CPPv4N12tensorrt_llm8executor13KvCacheConfigE - PeftCacheConfig: https://nvidia.github.io/TensorRT-LLM/_cpp_gen/executor.html#_CPPv4N12tensorrt_llm8executor15PeftCacheConfigE

Parameters:

engine_dir –

Path to the TensorRT engine directory.
encoder_dir (Optional[Path], default: None ) –

Path to the encoder model directory.
max_beam_width (Optional[int], default: None ) –

Maximal size of each beam in beam search.
batching_strategy (BatchingStrategy, default: INFLIGHT ) –

Batching strategy for model.
batch_scheduler_policy (BatchSchedulerPolicy, default: MAX_UTILIZATION ) –

Batching scheduler policy for model.
decoding_mode (Optional[DecodingMode], default: None ) –

Decoding mode for model.
gpu_device_ids (List[int], default: lambda: []() ) –

List of GPU devices on which model is running.
gpu_weights_percent (Optional[float], default: None ) –

The percentage of GPU memory fraction that should be allocated for weights.
kv_cache_config (Optional[KVCacheConfig], default: None ) –

KV cache config for model.
peft_cache_config (Optional[PeftCacheConfig], default: None ) –

Peft cache config for model.
enable_chunked_context (Optional[bool], default: None ) –

Enable chunked context for model
normalize_log_probs (Optional[bool], default: None ) –

Controls if log probabilities should be normalized or not.
cancellation_check_period_ms (Optional[int], default: None ) –

The request cancellation period check in ms.
stats_check_period_ms (Optional[int], default: None ) –

The statistics checking period in ms.
request_stats_max_iterations (Optional[int], default: None ) –

Controls the maximum number of iterations for which to keep per-request statistics.
iter_stats_max_iterations (Optional[int], default: None ) –

Controls the maximum number of iterations for which to keep statistics.
exclude_input_in_output (Optional[bool], default: None ) –

Controls if output tokens in Result should include the input tokens. Default is false.
medusa_choices (Optional[Union[List[int], List[List[int]], List[Tuple[int]]]], default: None ) –

Medusa choices as in https://github.com/FasterDecoding/Medusa/blob/main/medusa/model/medusa_choices.py

backend `property`

backend

Define backend value for config.

custom_fields `property`

custom_fields

Custom fields that are configured as parameters.

engine_dir `property` `writable`

engine_dir

Engined directory path.

__post_init__

__post_init__()

Validate the configuration for early error handling.

Source code in model_navigator/triton/specialized_configs/tensorrt_llm_model_config.py

def __post_init__(self):
    """Validate the configuration for early error handling."""
    super().__post_init__()
    self._validate_config()
    self._initialize_instance_groups()
    self._initialize_inputs()
    self._initialize_outputs()
    self._initialize_parameters()

model_navigator.triton.BatchingStrategy

Bases: Enum

Define the supported batch strategies.

model_navigator.triton.BatchSchedulerPolicy

Bases: Enum

Define the supported batch scheduler policies.

model_navigator.triton.DecodingMode

Bases: Enum

Define the supported decoding modes.

model_navigator.triton.KVCacheConfig `dataclass`

KVCacheConfig(enable_block_reuse=None, max_tokens=None, sink_token_length=None, max_attention_window=None, free_gpu_memory_fraction=None, host_cache_size=None, onboard_blocks=None)

Configuration of KV cache in TRT-LLM.

More: https://nvidia.github.io/TensorRT-LLM/_cpp_gen/executor.html#_CPPv4N12tensorrt_llm8executor13KvCacheConfigE

Parameters:

enable_block_reuse (Optional[bool], default: None ) –

Controls if KV cache blocks can be reused for different requests.
max_tokens (Optional[int], default: None ) –

The maximum number of tokens that should be stored in the KV cache If both max_tokens and free_gpu_memory_fraction are specified, memory corresponding to the minimum will be allocated.
sink_token_length (Optional[int], default: None ) –

Number of sink tokens (tokens to always keep in attention window)
max_attention_window (Optional[int], default: None ) –

Size of the attention window for each sequence. Only the last max_attention_window tokens of each sequence will be stored in the KV cache.
free_gpu_memory_fraction (Optional[float], default: None ) –

The fraction of GPU memory fraction that should be allocated for the KV cache. Default is 90%. If both max_tokens and free_gpu_memory_fraction are specified, memory corresponding to the minimum will be allocated.
host_cache_size (Optional[int], default: None ) –

Size of secondary memory pool in bytes. Default is 0. Having a secondary memory pool increases KV cache block reuse potential.
onboard_blocks (Optional[int], default: None ) –

Controls whether offloaded blocks should be onboarded back into primary memory before being reused.

__post_init__

__post_init__()

Validate the configuration for early error handling.

Source code in model_navigator/triton/specialized_configs/tensorrt_llm_model_config.py

def __post_init__(self):
    """Validate the configuration for early error handling."""
    if self.max_tokens is not None and self.max_tokens <= 0:
        raise ModelNavigatorWrongParameterError("`max_tokens` must be greater than 0.")

    if self.sink_token_length is not None and self.sink_token_length <= 0:
        raise ModelNavigatorWrongParameterError("`sink_token_length` must be greater than 0.")

    if self.max_attention_window is not None and self.max_attention_window <= 0:
        raise ModelNavigatorWrongParameterError("`max_attention_window` must be greater than 0.")

    if self.free_gpu_memory_fraction is not None and (
        self.free_gpu_memory_fraction < 0.0 or self.free_gpu_memory_fraction > 1.0
    ):
        raise ModelNavigatorWrongParameterError("`free_gpu_memory_fraction` must be between 0.0 and 1.0.")

    if self.host_cache_size is not None and self.host_cache_size <= 0:
        raise ModelNavigatorWrongParameterError("`host_cache_size` must be greater than 0.")

    if self.onboard_blocks is not None and self.onboard_blocks <= 0:
        raise ModelNavigatorWrongParameterError("`onboard_blocks` must be greater than 0.")

as_parameters

as_parameters()

Convert dataclass to configuration flags passed to backend as parameters.

Source code in model_navigator/triton/specialized_configs/tensorrt_llm_model_config.py

def as_parameters(self):
    """Convert dataclass to configuration flags passed to backend as parameters."""
    data = {}
    for k, v in self.__dict__.items():
        if v is None:
            continue

        mapped_key = self._MAPPING.get(k, k)
        data[mapped_key] = v

    return data

model_navigator.triton.PeftCacheConfig `dataclass`

PeftCacheConfig(optimal_adapter_size=None, max_adapter_size=None, gpu_memory_fraction=None, host_memory_bytes=None)

Configuration of Peft Cache.

More: https://nvidia.github.io/TensorRT-LLM/_cpp_gen/executor.html#_CPPv4N12tensorrt_llm8executor15PeftCacheConfigE

Parameters:

optimal_adapter_size (Optional[int], default: None ) –
max_adapter_size (Optional[int], default: None ) –
gpu_memory_fraction (Optional[float], default: None ) –
host_memory_bytes (Optional[int], default: None ) –

__post_init__

__post_init__()

Validate the configuration for early error handling.

Source code in model_navigator/triton/specialized_configs/tensorrt_llm_model_config.py

def __post_init__(self):
    """Validate the configuration for early error handling."""
    if self.optimal_adapter_size is not None and self.optimal_adapter_size <= 0:
        raise ModelNavigatorWrongParameterError("`optimal_adapter_size` must be greater than 0.")

    if self.max_adapter_size is not None and self.max_adapter_size <= 0:
        raise ModelNavigatorWrongParameterError("`max_adapter_size` must be greater than 0.")

    if self.max_adapter_size and self.optimal_adapter_size and self.max_adapter_size < self.optimal_adapter_size:
        raise ModelNavigatorWrongParameterError(
            "`max_adapter_size` must be greater than or equal to `optimal_adapter_size`."
        )

    if self.gpu_memory_fraction is not None and (self.gpu_memory_fraction < 0.0 or self.gpu_memory_fraction > 1.0):
        raise ModelNavigatorWrongParameterError("`gpu_memory_fraction` must be between 0.0 and 1.0.")

    if self.host_memory_bytes is not None and self.host_memory_bytes <= 0:
        raise ModelNavigatorWrongParameterError("`host_memory_bytes` must be greater than 0.")

as_parameters

as_parameters()

Convert dataclass to configuration flags passed to backend as parameters.

Source code in model_navigator/triton/specialized_configs/tensorrt_llm_model_config.py

def as_parameters(self):
    """Convert dataclass to configuration flags passed to backend as parameters."""
    data = {}
    for k, v in self.__dict__.items():
        if v is None:
            continue

        data[f"lora_cache_{k}"] = v

    return data

Specialized Configs for Triton Backends

model_navigator.triton.BaseSpecializedModelConfig dataclass

backend abstractmethod property

custom_fields property

__post_init__

model_navigator.triton.ONNXModelConfig dataclass

backend property

custom_fields property

__post_init__

model_navigator.triton.ONNXOptimization dataclass

__post_init__

model_navigator.triton.PythonModelConfig dataclass

backend property

custom_fields property

__post_init__

model_navigator.triton.PyTorchModelConfig dataclass

backend property

custom_fields property

__post_init__

model_navigator.triton.TensorFlowModelConfig dataclass

backend property

custom_fields property

__post_init__

model_navigator.triton.TensorFlowOptimization dataclass

__post_init__

model_navigator.triton.TensorRTModelConfig dataclass

backend property

custom_fields property

__post_init__

model_navigator.triton.TensorRTOptimization dataclass

__post_init__

model_navigator.triton.TensorRTLLMModelConfig dataclass

backend property

custom_fields property

engine_dir property writable

__post_init__

model_navigator.triton.BatchingStrategy

model_navigator.triton.BatchSchedulerPolicy

model_navigator.triton.DecodingMode

model_navigator.triton.KVCacheConfig dataclass

__post_init__

as_parameters

model_navigator.triton.PeftCacheConfig dataclass

__post_init__

as_parameters

model_navigator.triton.BaseSpecializedModelConfig `dataclass`

backend `abstractmethod` `property`

custom_fields `property`

model_navigator.triton.ONNXModelConfig `dataclass`

backend `property`

custom_fields `property`

model_navigator.triton.ONNXOptimization `dataclass`

model_navigator.triton.PythonModelConfig `dataclass`

backend `property`

custom_fields `property`

model_navigator.triton.PyTorchModelConfig `dataclass`

backend `property`

custom_fields `property`

model_navigator.triton.TensorFlowModelConfig `dataclass`

backend `property`

custom_fields `property`

model_navigator.triton.TensorFlowOptimization `dataclass`

model_navigator.triton.TensorRTModelConfig `dataclass`

backend `property`

custom_fields `property`

model_navigator.triton.TensorRTOptimization `dataclass`

model_navigator.triton.TensorRTLLMModelConfig `dataclass`

backend `property`

custom_fields `property`

engine_dir `property` `writable`

model_navigator.triton.KVCacheConfig `dataclass`

model_navigator.triton.PeftCacheConfig `dataclass`