Specialized Configs for Triton Backends
The Python API provides specialized configuration classes that help provide only available options for the given type of model.
model_navigator.triton.BaseSpecializedModelConfig
dataclass
BaseSpecializedModelConfig(max_batch_size=4, batching=True, default_model_filename=None, batcher=DynamicBatcher(), instance_groups=lambda: [](), parameters=lambda: {}(), response_cache=False, warmup=lambda: {}(), inputs=lambda: [](), outputs=lambda: []())
Bases: ABC
Common fields for specialized model configs.
Read more in Triton Inference server documentation
Parameters:
-
max_batch_size(int, default:4) –The maximal batch size that would be handled by model.
-
batching(bool, default:True) –Flag to enable/disable batching for model.
-
default_model_filename(Optional[str], default:None) –Optional filename of the model file to use.
-
batcher(Union[DynamicBatcher, SequenceBatcher], default:DynamicBatcher()) –Configuration of Dynamic Batching for the model.
-
instance_groups(List[InstanceGroup], default:lambda: []()) –Instance groups configuration for multiple instances of the model
-
parameters(Dict[str, str], default:lambda: {}()) –Custom parameters for model or backend
-
response_cache(bool, default:False) –Flag to enable/disable response cache for the model
-
warmup(Dict[str, ModelWarmup], default:lambda: {}()) –Warmup configuration for model
backend
abstractmethod
property
Backend property that has to be overridden by specialized configs.
__post_init__
Validate the configuration for early error handling.
Source code in model_navigator/triton/specialized_configs/base_model_config.py
model_navigator.triton.ONNXModelConfig
dataclass
ONNXModelConfig(max_batch_size=4, batching=True, default_model_filename=None, batcher=DynamicBatcher(), instance_groups=lambda: [](), parameters=lambda: {}(), response_cache=False, warmup=lambda: {}(), inputs=lambda: [](), outputs=lambda: [](), platform=None, optimization=None)
Bases: BaseSpecializedModelConfig
Specialized model config for ONNX backend supported model.
Parameters:
-
platform(Optional[Platform], default:None) –Override backend parameter with platform. Possible options: Platform.ONNXRuntimeONNX
-
optimization(Optional[ONNXOptimization], default:None) –Possible optimization for ONNX models
__post_init__
Validate the configuration for early error handling.
Source code in model_navigator/triton/specialized_configs/onnx_model_config.py
model_navigator.triton.ONNXOptimization
dataclass
ONNX possible optimizations.
Parameters:
-
accelerator(Union[OpenVINOAccelerator, TensorRTAccelerator]) –Execution accelerator for model
__post_init__
Validate the configuration for early error handling.
Source code in model_navigator/triton/specialized_configs/onnx_model_config.py
model_navigator.triton.PythonModelConfig
dataclass
PythonModelConfig(max_batch_size=4, batching=True, default_model_filename=None, batcher=DynamicBatcher(), instance_groups=lambda: [](), parameters=lambda: {}(), response_cache=False, warmup=lambda: {}(), inputs=lambda: [](), outputs=lambda: []())
Bases: BaseSpecializedModelConfig
Specialized model config for Python backend supported model.
Parameters:
-
inputs(Sequence[InputTensorSpec], default:lambda: []()) –Required definition of model inputs
-
outputs(Sequence[OutputTensorSpec], default:lambda: []()) –Required definition of model outputs
__post_init__
Validate the configuration for early error handling.
Source code in model_navigator/triton/specialized_configs/python_model_config.py
model_navigator.triton.PyTorchModelConfig
dataclass
PyTorchModelConfig(max_batch_size=4, batching=True, default_model_filename=None, batcher=DynamicBatcher(), instance_groups=lambda: [](), parameters=lambda: {}(), response_cache=False, warmup=lambda: {}(), inputs=lambda: [](), outputs=lambda: [](), platform=None)
Bases: BaseSpecializedModelConfig
Specialized model config for PyTorch backend supported model.
Parameters:
-
platform(Optional[Platform], default:None) –Override backend parameter with platform. Possible options: Platform.PyTorchLibtorch
-
inputs(Sequence[InputTensorSpec], default:lambda: []()) –Required definition of model inputs
-
outputs(Sequence[OutputTensorSpec], default:lambda: []()) –Required definition of model outputs
__post_init__
Validate the configuration for early error handling.
Source code in model_navigator/triton/specialized_configs/pytorch_model_config.py
model_navigator.triton.TensorFlowModelConfig
dataclass
TensorFlowModelConfig(max_batch_size=4, batching=True, default_model_filename=None, batcher=DynamicBatcher(), instance_groups=lambda: [](), parameters=lambda: {}(), response_cache=False, warmup=lambda: {}(), inputs=lambda: [](), outputs=lambda: [](), platform=None, optimization=None)
Bases: BaseSpecializedModelConfig
Specialized model config for TensorFlow backend supported model.
Parameters:
-
platform(Optional[Platform], default:None) –Override backend parameter with platform. Possible options: Platform.TensorFlowSavedModel, Platform.TensorFlowGraphDef
-
optimization(Optional[TensorFlowOptimization], default:None) –Possible optimization for TensorFlow models
__post_init__
Validate the configuration for early error handling.
Source code in model_navigator/triton/specialized_configs/tensorflow_model_config.py
model_navigator.triton.TensorFlowOptimization
dataclass
TensorFlow possible optimizations.
Parameters:
-
accelerator(Union[AutoMixedPrecisionAccelerator, GPUIOAccelerator, TensorRTAccelerator]) –Execution accelerator for model
__post_init__
Validate the configuration for early error handling.
Source code in model_navigator/triton/specialized_configs/tensorflow_model_config.py
model_navigator.triton.TensorRTModelConfig
dataclass
TensorRTModelConfig(max_batch_size=4, batching=True, default_model_filename=None, batcher=DynamicBatcher(), instance_groups=lambda: [](), parameters=lambda: {}(), response_cache=False, warmup=lambda: {}(), inputs=lambda: [](), outputs=lambda: [](), platform=None, optimization=None)
Bases: BaseSpecializedModelConfig
Specialized model config for TensorRT platform supported model.
Parameters:
-
platform(Optional[Platform], default:None) –Override backend parameter with platform. Possible options: Platform.TensorRTPlan
-
optimization(Optional[TensorRTOptimization], default:None) –Possible optimization for TensorRT models
__post_init__
Validate the configuration for early error handling.
Source code in model_navigator/triton/specialized_configs/tensorrt_model_config.py
model_navigator.triton.TensorRTOptimization
dataclass
TensorRT possible optimizations.
Parameters:
-
cuda_graphs(bool, default:False) –Use CUDA graphs API to capture model operations and execute them more efficiently.
-
gather_kernel_buffer_threshold(Optional[int], default:None) –The backend may use a gather kernel to gather input data if the device has direct access to the source buffer and the destination buffer.
-
eager_batching(bool, default:False) –Start preparing the next batch before the model instance is ready for the next inference.
__post_init__
Validate the configuration for early error handling.
Source code in model_navigator/triton/specialized_configs/tensorrt_model_config.py
model_navigator.triton.TensorRTLLMModelConfig
dataclass
TensorRTLLMModelConfig(max_batch_size=4, batching=True, default_model_filename=None, batcher=DynamicBatcher(), instance_groups=lambda: [](), parameters=lambda: {}(), response_cache=False, warmup=lambda: {}(), inputs=lambda: [](), outputs=lambda: [](), encoder_dir=None, max_beam_width=None, batching_strategy=BatchingStrategy.INFLIGHT, batch_scheduler_policy=BatchSchedulerPolicy.MAX_UTILIZATION, decoding_mode=None, gpu_device_ids=lambda: [](), gpu_weights_percent=None, kv_cache_config=None, peft_cache_config=None, enable_chunked_context=None, normalize_log_probs=None, cancellation_check_period_ms=None, stats_check_period_ms=None, request_stats_max_iterations=None, iter_stats_max_iterations=None, exclude_input_in_output=None, medusa_choices=None, _engine_dir=None)
Bases: BaseSpecializedModelConfig
Specialized model config for TensorRT-LLM platform supported model.
Adapted from TensorRT-LLM config: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxts
Relevant TensorRT-LLM classes: - ExecutorConfig: https://nvidia.github.io/TensorRT-LLM/_cpp_gen/executor.html#_CPPv4N12tensorrt_llm8executor14ExecutorConfigE - KVCacheConfig: https://nvidia.github.io/TensorRT-LLM/_cpp_gen/executor.html#_CPPv4N12tensorrt_llm8executor13KvCacheConfigE - PeftCacheConfig: https://nvidia.github.io/TensorRT-LLM/_cpp_gen/executor.html#_CPPv4N12tensorrt_llm8executor15PeftCacheConfigE
Parameters:
-
engine_dir–Path to the TensorRT engine directory.
-
encoder_dir(Optional[Path], default:None) –Path to the encoder model directory.
-
max_beam_width(Optional[int], default:None) –Maximal size of each beam in beam search.
-
batching_strategy(BatchingStrategy, default:INFLIGHT) –Batching strategy for model.
-
batch_scheduler_policy(BatchSchedulerPolicy, default:MAX_UTILIZATION) –Batching scheduler policy for model.
-
decoding_mode(Optional[DecodingMode], default:None) –Decoding mode for model.
-
gpu_device_ids(List[int], default:lambda: []()) –List of GPU devices on which model is running.
-
gpu_weights_percent(Optional[float], default:None) –The percentage of GPU memory fraction that should be allocated for weights.
-
kv_cache_config(Optional[KVCacheConfig], default:None) –KV cache config for model.
-
peft_cache_config(Optional[PeftCacheConfig], default:None) –Peft cache config for model.
-
enable_chunked_context(Optional[bool], default:None) –Enable chunked context for model
-
normalize_log_probs(Optional[bool], default:None) –Controls if log probabilities should be normalized or not.
-
cancellation_check_period_ms(Optional[int], default:None) –The request cancellation period check in ms.
-
stats_check_period_ms(Optional[int], default:None) –The statistics checking period in ms.
-
request_stats_max_iterations(Optional[int], default:None) –Controls the maximum number of iterations for which to keep per-request statistics.
-
iter_stats_max_iterations(Optional[int], default:None) –Controls the maximum number of iterations for which to keep statistics.
-
exclude_input_in_output(Optional[bool], default:None) –Controls if output tokens in Result should include the input tokens. Default is false.
-
medusa_choices(Optional[Union[List[int], List[List[int]], List[Tuple[int]]]], default:None) –Medusa choices as in https://github.com/FasterDecoding/Medusa/blob/main/medusa/model/medusa_choices.py
__post_init__
Validate the configuration for early error handling.
Source code in model_navigator/triton/specialized_configs/tensorrt_llm_model_config.py
model_navigator.triton.BatchingStrategy
model_navigator.triton.BatchSchedulerPolicy
model_navigator.triton.DecodingMode
model_navigator.triton.KVCacheConfig
dataclass
KVCacheConfig(enable_block_reuse=None, max_tokens=None, sink_token_length=None, max_attention_window=None, free_gpu_memory_fraction=None, host_cache_size=None, onboard_blocks=None)
Configuration of KV cache in TRT-LLM.
More: https://nvidia.github.io/TensorRT-LLM/_cpp_gen/executor.html#_CPPv4N12tensorrt_llm8executor13KvCacheConfigE
Parameters:
-
enable_block_reuse(Optional[bool], default:None) –Controls if KV cache blocks can be reused for different requests.
-
max_tokens(Optional[int], default:None) –The maximum number of tokens that should be stored in the KV cache If both max_tokens and free_gpu_memory_fraction are specified, memory corresponding to the minimum will be allocated.
-
sink_token_length(Optional[int], default:None) –Number of sink tokens (tokens to always keep in attention window)
-
max_attention_window(Optional[int], default:None) –Size of the attention window for each sequence. Only the last max_attention_window tokens of each sequence will be stored in the KV cache.
-
free_gpu_memory_fraction(Optional[float], default:None) –The fraction of GPU memory fraction that should be allocated for the KV cache. Default is 90%. If both max_tokens and free_gpu_memory_fraction are specified, memory corresponding to the minimum will be allocated.
-
host_cache_size(Optional[int], default:None) –Size of secondary memory pool in bytes. Default is 0. Having a secondary memory pool increases KV cache block reuse potential.
-
onboard_blocks(Optional[int], default:None) –Controls whether offloaded blocks should be onboarded back into primary memory before being reused.
__post_init__
Validate the configuration for early error handling.
Source code in model_navigator/triton/specialized_configs/tensorrt_llm_model_config.py
as_parameters
Convert dataclass to configuration flags passed to backend as parameters.
Source code in model_navigator/triton/specialized_configs/tensorrt_llm_model_config.py
model_navigator.triton.PeftCacheConfig
dataclass
PeftCacheConfig(optimal_adapter_size=None, max_adapter_size=None, gpu_memory_fraction=None, host_memory_bytes=None)
Configuration of Peft Cache.
More: https://nvidia.github.io/TensorRT-LLM/_cpp_gen/executor.html#_CPPv4N12tensorrt_llm8executor15PeftCacheConfigE
Parameters:
-
optimal_adapter_size(Optional[int], default:None) – -
max_adapter_size(Optional[int], default:None) – -
gpu_memory_fraction(Optional[float], default:None) – -
host_memory_bytes(Optional[int], default:None) –
__post_init__
Validate the configuration for early error handling.
Source code in model_navigator/triton/specialized_configs/tensorrt_llm_model_config.py
as_parameters
Convert dataclass to configuration flags passed to backend as parameters.