Skip to content

Specialized Configs for Triton Backends

The Python API provides specialized configuration classes that help provide only available options for the given type of model.

model_navigator.triton.BaseSpecializedModelConfig dataclass

BaseSpecializedModelConfig(max_batch_size=4, batching=True, default_model_filename=None, batcher=DynamicBatcher(), instance_groups=lambda: [](), parameters=lambda: {}(), response_cache=False, warmup=lambda: {}(), inputs=lambda: [](), outputs=lambda: []())

Bases: ABC

Common fields for specialized model configs.

Read more in Triton Inference server documentation

Parameters:

  • max_batch_size (int, default: 4 ) –

    The maximal batch size that would be handled by model.

  • batching (bool, default: True ) –

    Flag to enable/disable batching for model.

  • default_model_filename (Optional[str], default: None ) –

    Optional filename of the model file to use.

  • batcher (Union[DynamicBatcher, SequenceBatcher], default: DynamicBatcher() ) –

    Configuration of Dynamic Batching for the model.

  • instance_groups (List[InstanceGroup], default: lambda: []() ) –

    Instance groups configuration for multiple instances of the model

  • parameters (Dict[str, str], default: lambda: {}() ) –

    Custom parameters for model or backend

  • response_cache (bool, default: False ) –

    Flag to enable/disable response cache for the model

  • warmup (Dict[str, ModelWarmup], default: lambda: {}() ) –

    Warmup configuration for model

backend abstractmethod property

backend

Backend property that has to be overridden by specialized configs.

custom_fields property

custom_fields

Custom fields that are configured as parameters.

__post_init__

__post_init__()

Validate the configuration for early error handling.

Source code in model_navigator/triton/specialized_configs/base_model_config.py
def __post_init__(self) -> None:
    """Validate the configuration for early error handling."""
    if self.batching and self.max_batch_size <= 0:
        raise ModelNavigatorWrongParameterError("The `max_batch_size` must be greater or equal to 1.")

    if type(self.batcher) not in [DynamicBatcher, SequenceBatcher]:
        raise ModelNavigatorWrongParameterError("Unsupported batcher type provided.")

    if self.backend != Backend.TensorRT and any(group.profile for group in self.instance_groups):
        raise ModelNavigatorWrongParameterError(
            "Invalid `profile` option. The value can be set only for `backend=Backend.TensorRT`"
        )

model_navigator.triton.ONNXModelConfig dataclass

ONNXModelConfig(max_batch_size=4, batching=True, default_model_filename=None, batcher=DynamicBatcher(), instance_groups=lambda: [](), parameters=lambda: {}(), response_cache=False, warmup=lambda: {}(), inputs=lambda: [](), outputs=lambda: [](), platform=None, optimization=None)

Bases: BaseSpecializedModelConfig

Specialized model config for ONNX backend supported model.

Parameters:

  • platform (Optional[Platform], default: None ) –

    Override backend parameter with platform. Possible options: Platform.ONNXRuntimeONNX

  • optimization (Optional[ONNXOptimization], default: None ) –

    Possible optimization for ONNX models

backend property

backend

Define backend value for config.

custom_fields property

custom_fields

Custom fields that are configured as parameters.

__post_init__

__post_init__()

Validate the configuration for early error handling.

Source code in model_navigator/triton/specialized_configs/onnx_model_config.py
def __post_init__(self):
    """Validate the configuration for early error handling."""
    super().__post_init__()
    if self.optimization and not isinstance(self.optimization, ONNXOptimization):
        raise ModelNavigatorWrongParameterError("Unsupported optimization type provided.")

    if self.platform and self.platform != Platform.ONNXRuntimeONNX:
        raise ModelNavigatorWrongParameterError(f"Unsupported platform provided. Use: {Platform.ONNXRuntimeONNX}.")

model_navigator.triton.ONNXOptimization dataclass

ONNXOptimization(accelerator)

ONNX possible optimizations.

Parameters:

__post_init__

__post_init__()

Validate the configuration for early error handling.

Source code in model_navigator/triton/specialized_configs/onnx_model_config.py
def __post_init__(self):
    """Validate the configuration for early error handling."""
    if self.accelerator and type(self.accelerator) not in [OpenVINOAccelerator, TensorRTAccelerator]:
        raise ModelNavigatorWrongParameterError("Unsupported accelerator type provided.")

model_navigator.triton.PythonModelConfig dataclass

PythonModelConfig(max_batch_size=4, batching=True, default_model_filename=None, batcher=DynamicBatcher(), instance_groups=lambda: [](), parameters=lambda: {}(), response_cache=False, warmup=lambda: {}(), inputs=lambda: [](), outputs=lambda: []())

Bases: BaseSpecializedModelConfig

Specialized model config for Python backend supported model.

Parameters:

backend property

backend

Define backend value for config.

custom_fields property

custom_fields

Custom fields that are configured as parameters.

__post_init__

__post_init__()

Validate the configuration for early error handling.

Source code in model_navigator/triton/specialized_configs/python_model_config.py
def __post_init__(self) -> None:
    """Validate the configuration for early error handling."""
    super().__post_init__()
    assert len(self.inputs) > 0, "Model inputs definition is required for Python backend."
    assert len(self.outputs) > 0, "Model outputs definition is required for Python backend."

model_navigator.triton.PyTorchModelConfig dataclass

PyTorchModelConfig(max_batch_size=4, batching=True, default_model_filename=None, batcher=DynamicBatcher(), instance_groups=lambda: [](), parameters=lambda: {}(), response_cache=False, warmup=lambda: {}(), inputs=lambda: [](), outputs=lambda: [](), platform=None)

Bases: BaseSpecializedModelConfig

Specialized model config for PyTorch backend supported model.

Parameters:

  • platform (Optional[Platform], default: None ) –

    Override backend parameter with platform. Possible options: Platform.PyTorchLibtorch

  • inputs (Sequence[InputTensorSpec], default: lambda: []() ) –

    Required definition of model inputs

  • outputs (Sequence[OutputTensorSpec], default: lambda: []() ) –

    Required definition of model outputs

backend property

backend

Define backend value for config.

custom_fields property

custom_fields

Custom fields that are configured as parameters.

__post_init__

__post_init__()

Validate the configuration for early error handling.

Source code in model_navigator/triton/specialized_configs/pytorch_model_config.py
def __post_init__(self) -> None:
    """Validate the configuration for early error handling."""
    super().__post_init__()
    assert len(self.inputs) > 0, "Model inputs definition is required for PyTorch backend."
    assert len(self.outputs) > 0, "Model outputs definition is required for PyTorch backend."

    if self.platform and self.platform != Platform.PyTorchLibtorch:
        raise ModelNavigatorWrongParameterError(f"Unsupported platform provided. Use: {Platform.PyTorchLibtorch}.")

model_navigator.triton.TensorFlowModelConfig dataclass

TensorFlowModelConfig(max_batch_size=4, batching=True, default_model_filename=None, batcher=DynamicBatcher(), instance_groups=lambda: [](), parameters=lambda: {}(), response_cache=False, warmup=lambda: {}(), inputs=lambda: [](), outputs=lambda: [](), platform=None, optimization=None)

Bases: BaseSpecializedModelConfig

Specialized model config for TensorFlow backend supported model.

Parameters:

  • platform (Optional[Platform], default: None ) –

    Override backend parameter with platform. Possible options: Platform.TensorFlowSavedModel, Platform.TensorFlowGraphDef

  • optimization (Optional[TensorFlowOptimization], default: None ) –

    Possible optimization for TensorFlow models

backend property

backend

Define backend value for config.

custom_fields property

custom_fields

Custom fields that are configured as parameters.

__post_init__

__post_init__()

Validate the configuration for early error handling.

Source code in model_navigator/triton/specialized_configs/tensorflow_model_config.py
def __post_init__(self):
    """Validate the configuration for early error handling."""
    super().__post_init__()
    if self.optimization and not isinstance(self.optimization, TensorFlowOptimization):
        raise ModelNavigatorWrongParameterError("Unsupported optimization type provided.")

    platforms = [Platform.TensorFlowSavedModel, Platform.TensorFlowGraphDef]
    if self.platform and self.platform not in platforms:
        raise ModelNavigatorWrongParameterError(f"Unsupported platform provided. Use one of: {platforms}")

model_navigator.triton.TensorFlowOptimization dataclass

TensorFlowOptimization(accelerator)

TensorFlow possible optimizations.

Parameters:

__post_init__

__post_init__()

Validate the configuration for early error handling.

Source code in model_navigator/triton/specialized_configs/tensorflow_model_config.py
def __post_init__(self):
    """Validate the configuration for early error handling."""
    if self.accelerator and type(self.accelerator) not in [
        AutoMixedPrecisionAccelerator,
        GPUIOAccelerator,
        TensorRTAccelerator,
    ]:
        raise ModelNavigatorWrongParameterError("Unsupported accelerator type provided.")

model_navigator.triton.TensorRTModelConfig dataclass

TensorRTModelConfig(max_batch_size=4, batching=True, default_model_filename=None, batcher=DynamicBatcher(), instance_groups=lambda: [](), parameters=lambda: {}(), response_cache=False, warmup=lambda: {}(), inputs=lambda: [](), outputs=lambda: [](), platform=None, optimization=None)

Bases: BaseSpecializedModelConfig

Specialized model config for TensorRT platform supported model.

Parameters:

  • platform (Optional[Platform], default: None ) –

    Override backend parameter with platform. Possible options: Platform.TensorRTPlan

  • optimization (Optional[TensorRTOptimization], default: None ) –

    Possible optimization for TensorRT models

backend property

backend

Define backend value for config.

custom_fields property

custom_fields

Custom fields that are configured as parameters.

__post_init__

__post_init__()

Validate the configuration for early error handling.

Source code in model_navigator/triton/specialized_configs/tensorrt_model_config.py
def __post_init__(self):
    """Validate the configuration for early error handling."""
    super().__post_init__()
    if self.optimization and not isinstance(self.optimization, TensorRTOptimization):
        raise ModelNavigatorWrongParameterError("Unsupported optimization type provided.")

    if self.platform and self.platform != Platform.TensorRTPlan:
        raise ModelNavigatorWrongParameterError(f"Unsupported platform provided. Use: {Platform.TensorRTPlan}.")

model_navigator.triton.TensorRTOptimization dataclass

TensorRTOptimization(cuda_graphs=False, gather_kernel_buffer_threshold=None, eager_batching=False)

TensorRT possible optimizations.

Parameters:

  • cuda_graphs (bool, default: False ) –

    Use CUDA graphs API to capture model operations and execute them more efficiently.

  • gather_kernel_buffer_threshold (Optional[int], default: None ) –

    The backend may use a gather kernel to gather input data if the device has direct access to the source buffer and the destination buffer.

  • eager_batching (bool, default: False ) –

    Start preparing the next batch before the model instance is ready for the next inference.

__post_init__

__post_init__()

Validate the configuration for early error handling.

Source code in model_navigator/triton/specialized_configs/tensorrt_model_config.py
def __post_init__(self):
    """Validate the configuration for early error handling."""
    if not self.cuda_graphs and not self.gather_kernel_buffer_threshold and not self.eager_batching:
        raise ModelNavigatorWrongParameterError("At least one of the optimization options should be enabled.")

model_navigator.triton.TensorRTLLMModelConfig dataclass

TensorRTLLMModelConfig(max_batch_size=4, batching=True, default_model_filename=None, batcher=DynamicBatcher(), instance_groups=lambda: [](), parameters=lambda: {}(), response_cache=False, warmup=lambda: {}(), inputs=lambda: [](), outputs=lambda: [](), encoder_dir=None, max_beam_width=None, batching_strategy=BatchingStrategy.INFLIGHT, batch_scheduler_policy=BatchSchedulerPolicy.MAX_UTILIZATION, decoding_mode=None, gpu_device_ids=lambda: [](), gpu_weights_percent=None, kv_cache_config=None, peft_cache_config=None, enable_chunked_context=None, normalize_log_probs=None, cancellation_check_period_ms=None, stats_check_period_ms=None, request_stats_max_iterations=None, iter_stats_max_iterations=None, exclude_input_in_output=None, medusa_choices=None, _engine_dir=None)

Bases: BaseSpecializedModelConfig

Specialized model config for TensorRT-LLM platform supported model.

Adapted from TensorRT-LLM config: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxts

Relevant TensorRT-LLM classes: - ExecutorConfig: https://nvidia.github.io/TensorRT-LLM/_cpp_gen/executor.html#_CPPv4N12tensorrt_llm8executor14ExecutorConfigE - KVCacheConfig: https://nvidia.github.io/TensorRT-LLM/_cpp_gen/executor.html#_CPPv4N12tensorrt_llm8executor13KvCacheConfigE - PeftCacheConfig: https://nvidia.github.io/TensorRT-LLM/_cpp_gen/executor.html#_CPPv4N12tensorrt_llm8executor15PeftCacheConfigE

Parameters:

  • engine_dir

    Path to the TensorRT engine directory.

  • encoder_dir (Optional[Path], default: None ) –

    Path to the encoder model directory.

  • max_beam_width (Optional[int], default: None ) –

    Maximal size of each beam in beam search.

  • batching_strategy (BatchingStrategy, default: INFLIGHT ) –

    Batching strategy for model.

  • batch_scheduler_policy (BatchSchedulerPolicy, default: MAX_UTILIZATION ) –

    Batching scheduler policy for model.

  • decoding_mode (Optional[DecodingMode], default: None ) –

    Decoding mode for model.

  • gpu_device_ids (List[int], default: lambda: []() ) –

    List of GPU devices on which model is running.

  • gpu_weights_percent (Optional[float], default: None ) –

    The percentage of GPU memory fraction that should be allocated for weights.

  • kv_cache_config (Optional[KVCacheConfig], default: None ) –

    KV cache config for model.

  • peft_cache_config (Optional[PeftCacheConfig], default: None ) –

    Peft cache config for model.

  • enable_chunked_context (Optional[bool], default: None ) –

    Enable chunked context for model

  • normalize_log_probs (Optional[bool], default: None ) –

    Controls if log probabilities should be normalized or not.

  • cancellation_check_period_ms (Optional[int], default: None ) –

    The request cancellation period check in ms.

  • stats_check_period_ms (Optional[int], default: None ) –

    The statistics checking period in ms.

  • request_stats_max_iterations (Optional[int], default: None ) –

    Controls the maximum number of iterations for which to keep per-request statistics.

  • iter_stats_max_iterations (Optional[int], default: None ) –

    Controls the maximum number of iterations for which to keep statistics.

  • exclude_input_in_output (Optional[bool], default: None ) –

    Controls if output tokens in Result should include the input tokens. Default is false.

  • medusa_choices (Optional[Union[List[int], List[List[int]], List[Tuple[int]]]], default: None ) –

    Medusa choices as in https://github.com/FasterDecoding/Medusa/blob/main/medusa/model/medusa_choices.py

backend property

backend

Define backend value for config.

custom_fields property

custom_fields

Custom fields that are configured as parameters.

engine_dir property writable

engine_dir

Engined directory path.

__post_init__

__post_init__()

Validate the configuration for early error handling.

Source code in model_navigator/triton/specialized_configs/tensorrt_llm_model_config.py
def __post_init__(self):
    """Validate the configuration for early error handling."""
    super().__post_init__()
    self._validate_config()
    self._initialize_instance_groups()
    self._initialize_inputs()
    self._initialize_outputs()
    self._initialize_parameters()

model_navigator.triton.BatchingStrategy

Bases: Enum

Define the supported batch strategies.

model_navigator.triton.BatchSchedulerPolicy

Bases: Enum

Define the supported batch scheduler policies.

model_navigator.triton.DecodingMode

Bases: Enum

Define the supported decoding modes.

model_navigator.triton.KVCacheConfig dataclass

KVCacheConfig(enable_block_reuse=None, max_tokens=None, sink_token_length=None, max_attention_window=None, free_gpu_memory_fraction=None, host_cache_size=None, onboard_blocks=None)

Configuration of KV cache in TRT-LLM.

More: https://nvidia.github.io/TensorRT-LLM/_cpp_gen/executor.html#_CPPv4N12tensorrt_llm8executor13KvCacheConfigE

Parameters:

  • enable_block_reuse (Optional[bool], default: None ) –

    Controls if KV cache blocks can be reused for different requests.

  • max_tokens (Optional[int], default: None ) –

    The maximum number of tokens that should be stored in the KV cache If both max_tokens and free_gpu_memory_fraction are specified, memory corresponding to the minimum will be allocated.

  • sink_token_length (Optional[int], default: None ) –

    Number of sink tokens (tokens to always keep in attention window)

  • max_attention_window (Optional[int], default: None ) –

    Size of the attention window for each sequence. Only the last max_attention_window tokens of each sequence will be stored in the KV cache.

  • free_gpu_memory_fraction (Optional[float], default: None ) –

    The fraction of GPU memory fraction that should be allocated for the KV cache. Default is 90%. If both max_tokens and free_gpu_memory_fraction are specified, memory corresponding to the minimum will be allocated.

  • host_cache_size (Optional[int], default: None ) –

    Size of secondary memory pool in bytes. Default is 0. Having a secondary memory pool increases KV cache block reuse potential.

  • onboard_blocks (Optional[int], default: None ) –

    Controls whether offloaded blocks should be onboarded back into primary memory before being reused.

__post_init__

__post_init__()

Validate the configuration for early error handling.

Source code in model_navigator/triton/specialized_configs/tensorrt_llm_model_config.py
def __post_init__(self):
    """Validate the configuration for early error handling."""
    if self.max_tokens is not None and self.max_tokens <= 0:
        raise ModelNavigatorWrongParameterError("`max_tokens` must be greater than 0.")

    if self.sink_token_length is not None and self.sink_token_length <= 0:
        raise ModelNavigatorWrongParameterError("`sink_token_length` must be greater than 0.")

    if self.max_attention_window is not None and self.max_attention_window <= 0:
        raise ModelNavigatorWrongParameterError("`max_attention_window` must be greater than 0.")

    if self.free_gpu_memory_fraction is not None and (
        self.free_gpu_memory_fraction < 0.0 or self.free_gpu_memory_fraction > 1.0
    ):
        raise ModelNavigatorWrongParameterError("`free_gpu_memory_fraction` must be between 0.0 and 1.0.")

    if self.host_cache_size is not None and self.host_cache_size <= 0:
        raise ModelNavigatorWrongParameterError("`host_cache_size` must be greater than 0.")

    if self.onboard_blocks is not None and self.onboard_blocks <= 0:
        raise ModelNavigatorWrongParameterError("`onboard_blocks` must be greater than 0.")

as_parameters

as_parameters()

Convert dataclass to configuration flags passed to backend as parameters.

Source code in model_navigator/triton/specialized_configs/tensorrt_llm_model_config.py
def as_parameters(self):
    """Convert dataclass to configuration flags passed to backend as parameters."""
    data = {}
    for k, v in self.__dict__.items():
        if v is None:
            continue

        mapped_key = self._MAPPING.get(k, k)
        data[mapped_key] = v

    return data

model_navigator.triton.PeftCacheConfig dataclass

PeftCacheConfig(optimal_adapter_size=None, max_adapter_size=None, gpu_memory_fraction=None, host_memory_bytes=None)

Configuration of Peft Cache.

More: https://nvidia.github.io/TensorRT-LLM/_cpp_gen/executor.html#_CPPv4N12tensorrt_llm8executor15PeftCacheConfigE

Parameters:

__post_init__

__post_init__()

Validate the configuration for early error handling.

Source code in model_navigator/triton/specialized_configs/tensorrt_llm_model_config.py
def __post_init__(self):
    """Validate the configuration for early error handling."""
    if self.optimal_adapter_size is not None and self.optimal_adapter_size <= 0:
        raise ModelNavigatorWrongParameterError("`optimal_adapter_size` must be greater than 0.")

    if self.max_adapter_size is not None and self.max_adapter_size <= 0:
        raise ModelNavigatorWrongParameterError("`max_adapter_size` must be greater than 0.")

    if self.max_adapter_size and self.optimal_adapter_size and self.max_adapter_size < self.optimal_adapter_size:
        raise ModelNavigatorWrongParameterError(
            "`max_adapter_size` must be greater than or equal to `optimal_adapter_size`."
        )

    if self.gpu_memory_fraction is not None and (self.gpu_memory_fraction < 0.0 or self.gpu_memory_fraction > 1.0):
        raise ModelNavigatorWrongParameterError("`gpu_memory_fraction` must be between 0.0 and 1.0.")

    if self.host_memory_bytes is not None and self.host_memory_bytes <= 0:
        raise ModelNavigatorWrongParameterError("`host_memory_bytes` must be greater than 0.")

as_parameters

as_parameters()

Convert dataclass to configuration flags passed to backend as parameters.

Source code in model_navigator/triton/specialized_configs/tensorrt_llm_model_config.py
def as_parameters(self):
    """Convert dataclass to configuration flags passed to backend as parameters."""
    data = {}
    for k, v in self.__dict__.items():
        if v is None:
            continue

        data[f"lora_cache_{k}"] = v

    return data