API Reference
pytriton.triton.TritonConfig
dataclass
Triton Inference Server configuration class for customization of server execution.
The arguments are optional. If value is not provided the defaults for Triton Inference Server are used. Please, refer to https://github.com/triton-inference-server/server/ for more details.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id |
Optional[str]
|
Identifier for this server. |
None
|
log_verbose |
Optional[int]
|
Set verbose logging level. Zero (0) disables verbose logging and values >= 1 enable verbose logging. |
None
|
log_file |
Optional[Path]
|
Set the name of the log output file. |
None
|
exit_timeout_secs |
Optional[int]
|
Timeout (in seconds) when exiting to wait for in-flight inferences to finish. |
None
|
exit_on_error |
Optional[bool]
|
Exit the inference server if an error occurs during initialization. |
None
|
strict_readiness |
Optional[bool]
|
If true /v2/health/ready endpoint indicates ready if the server is responsive and all models are available. |
None
|
allow_http |
Optional[bool]
|
Allow the server to listen for HTTP requests. |
None
|
http_address |
Optional[str]
|
The address for the http server to bind to. Default is 0.0.0.0. |
None
|
http_port |
Optional[int]
|
The port for the server to listen on for HTTP requests. Default is 8000. |
None
|
http_header_forward_pattern |
Optional[str]
|
The regular expression pattern that will be used for forwarding HTTP headers as inference request parameters. |
None
|
http_thread_count |
Optional[int]
|
Number of threads handling HTTP requests. |
None
|
allow_grpc |
Optional[bool]
|
Allow the server to listen for GRPC requests. |
None
|
grpc_address |
Optional[str]
|
The address for the grpc server to binds to. Default is 0.0.0.0. |
None
|
grpc_port |
Optional[int]
|
The port for the server to listen on for GRPC requests. Default is 8001. |
None
|
grpc_header_forward_pattern |
Optional[str]
|
The regular expression pattern that will be used for forwarding GRPC headers as inference request parameters. |
None
|
grpc_infer_allocation_pool_size |
Optional[int]
|
The maximum number of inference request/response objects that remain allocated for reuse. As long as the number of in-flight requests doesn't exceed this value there will be no allocation/deallocation of request/response objects. |
None
|
grpc_use_ssl |
Optional[bool]
|
Use SSL authentication for GRPC requests. Default is false. |
None
|
grpc_use_ssl_mutual |
Optional[bool]
|
Use mututal SSL authentication for GRPC requests. This option will preempt grpc_use_ssl if it is also specified. Default is false. |
None
|
grpc_server_cert |
Optional[Path]
|
File holding PEM-encoded server certificate. Ignored unless grpc_use_ssl is true. |
None
|
grpc_server_key |
Optional[Path]
|
Path to file holding PEM-encoded server key. Ignored unless grpc_use_ssl is true. |
None
|
grpc_root_cert |
Optional[Path]
|
Path to file holding PEM-encoded root certificate. Ignored unless grpc_use_ssl is true. |
None
|
grpc_infer_response_compression_level |
Optional[str]
|
The compression level to be used while returning the inference response to the peer. Allowed values are none, low, medium and high. Default is none. |
None
|
grpc_keepalive_time |
Optional[int]
|
The period (in milliseconds) after which a keepalive ping is sent on the transport. |
None
|
grpc_keepalive_timeout |
Optional[int]
|
The period (in milliseconds) the sender of the keepalive ping waits for an acknowledgement. |
None
|
grpc_keepalive_permit_without_calls |
Optional[bool]
|
Allows keepalive pings to be sent even if there are no calls in flight |
None
|
grpc_http2_max_pings_without_data |
Optional[int]
|
The maximum number of pings that can be sent when there is no data/header frame to be sent. |
None
|
grpc_http2_min_recv_ping_interval_without_data |
Optional[int]
|
If there are no data/header frames being sent on the transport, this channel argument on the server side controls the minimum time (in milliseconds) that gRPC Core would expect between receiving successive pings. |
None
|
grpc_http2_max_ping_strikes |
Optional[int]
|
Maximum number of bad pings that the server will tolerate before sending an HTTP2 GOAWAY frame and closing the transport. |
None
|
grpc_restricted_protocol |
Specify restricted GRPC protocol setting.
The format of this flag is |
required | |
allow_metrics |
Optional[bool]
|
Allow the server to provide prometheus metrics. |
None
|
allow_gpu_metrics |
Optional[bool]
|
Allow the server to provide GPU metrics. |
None
|
allow_cpu_metrics |
Optional[bool]
|
Allow the server to provide CPU metrics. |
None
|
metrics_interval_ms |
Optional[int]
|
Metrics will be collected once every |
None
|
metrics_port |
Optional[int]
|
The port reporting prometheus metrics. |
None
|
metrics_address |
Optional[str]
|
The address for the metrics server to bind to. Default is the same as http_address. |
None
|
allow_sagemaker |
Optional[bool]
|
Allow the server to listen for Sagemaker requests. |
None
|
sagemaker_port |
Optional[int]
|
The port for the server to listen on for Sagemaker requests. |
None
|
sagemaker_safe_port_range |
Optional[str]
|
Set the allowed port range for endpoints other than the SageMaker endpoints. |
None
|
sagemaker_thread_count |
Optional[int]
|
Number of threads handling Sagemaker requests. |
None
|
allow_vertex_ai |
Optional[bool]
|
Allow the server to listen for Vertex AI requests. |
None
|
vertex_ai_port |
Optional[int]
|
The port for the server to listen on for Vertex AI requests. |
None
|
vertex_ai_thread_count |
Optional[int]
|
Number of threads handling Vertex AI requests. |
None
|
vertex_ai_default_model |
Optional[str]
|
The name of the model to use for single-model inference requests. |
None
|
metrics_config |
Optional[List[str]]
|
Specify a metrics-specific configuration setting.
The format of this flag is |
None
|
trace_config |
Optional[List[str]]
|
Specify global or trace mode specific configuration setting.
The format of this flag is |
None
|
cache_config |
Optional[List[str]]
|
Specify a cache-specific configuration setting.
The format of this flag is |
None
|
cache_directory |
Optional[str]
|
The global directory searched for cache shared libraries. Default is '/opt/tritonserver/caches'. This directory is expected to contain a cache implementation as a shared library with the name 'libtritoncache.so'. |
None
|
buffer_manager_thread_count |
Optional[int]
|
The number of threads used to accelerate copies and other operations required to manage input and output tensor contents. |
None
|
__post_init__()
Validate configuration for early error handling.
from_dict(config)
classmethod
Creates a TritonConfig
instance from an input dictionary. Values are converted into correct types.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config |
Dict[str, Any]
|
a dictionary with all required fields |
required |
Returns:
Type | Description |
---|---|
TritonConfig
|
a |
Source code in pytriton/triton.py
from_env()
classmethod
Creates TritonConfig from environment variables.
Environment variables should start with PYTRITON_TRITON_CONFIG_
prefix. For example:
PYTRITON_TRITON_CONFIG_GRPC_PORT=45436
PYTRITON_TRITON_CONFIG_LOG_VERBOSE=4
Typical use:
triton_config = TritonConfig.from_env()
Returns:
Type | Description |
---|---|
TritonConfig
|
TritonConfig class instantiated from environment variables. |
Source code in pytriton/triton.py
pytriton.decorators
Inference callable decorators.
ConstantPadder(pad_value=0)
Padder that pads the given batches with a constant value.
Initialize the padder.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pad_value |
int
|
Padding value. Defaults to 0. |
0
|
Source code in pytriton/decorators.py
__call__(batches_list)
Pad the given batches with the specified value to pad size enabling further batching to single arrays.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batches_list |
List[Dict[str, ndarray]]
|
List of batches to pad. |
required |
Returns:
Type | Description |
---|---|
InferenceResults
|
List[Dict[str, np.ndarray]]: List of padded batches. |
Raises:
Type | Description |
---|---|
PyTritonRuntimeError
|
If the input arrays for a given input name have different dtypes. |
Source code in pytriton/decorators.py
ModelConfigDict()
Bases: MutableMapping
Dictionary for storing model configs for inference callable.
Create ModelConfigDict object.
Source code in pytriton/decorators.py
__delitem__(infer_callable)
__getitem__(infer_callable)
__iter__()
__len__()
__setitem__(infer_callable, item)
Set model config for inference callable.
TritonContext
dataclass
Triton context definition class.
batch(wrapped, instance, args, kwargs)
Decorator for converting list of request dicts to dict of input batches.
Converts list of request dicts to dict of input batches. It passes **kwargs to inference callable where each named input contains numpy array with batch of requests received by Triton server. We assume that each request has the same set of keys (you can use group_by_keys decorator before using @batch decorator if your requests may have different set of keys).
Source code in pytriton/decorators.py
convert_output(outputs, wrapped=None, instance=None, model_config=None)
Converts output from tuple ot list to dictionary.
It is utility function useful for mapping output list into dictionary of outputs. Currently, it is used in @sample and @batch decorators (we assume that user can return list or tuple of outputs instead of dictionary if this list matches output list in model config (size and order).
Source code in pytriton/decorators.py
fill_optionals(**defaults)
This decorator ensures that any missing inputs in requests are filled with default values specified by the user.
Default values should be NumPy arrays without batch axis.
If you plan to group requests ex. with @group_by_keys or @group_by_vales decorators provide default values for optional parameters at the beginning of decorators stack. The other decorators can then group requests into bigger batches resulting in a better model performance.
Typical use:
@fill_optionals()
@group_by_keys()
@batch
def infer_fun(**inputs):
...
return outputs
Parameters:
Name | Type | Description | Default |
---|---|---|---|
defaults |
keyword arguments containing default values for missing inputs |
{}
|
If you have default values for some optional parameter it is good idea to provide them at the very beginning, so the other decorators (e.g. @group_by_keys) can make bigger consistent groups.
Source code in pytriton/decorators.py
411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 |
|
first_value(*keys, squeeze_single_values=True, strict=True)
This decorator overwrites selected inputs with first element of the given input.
It can be used in two ways:
-
Wrapping a single request inference callable by chaining with @batch decorator: @batch @first_value("temperature") def infer_fn(**inputs): ... return result
-
Wrapping a multiple requests inference callable: @first_value("temperature") def infer_fn(requests): ... return results
By default, the decorator squeezes single value arrays to scalars.
This behavior can be disabled by setting the squeeze_single_values
flag to False.
By default, the decorator checks the equality of the values on selected values.
This behavior can be disabled by setting the strict
flag to False.
Wrapper can only be used with models that support batching.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keys |
str
|
The input keys selected for conversion. |
()
|
squeeze_single_values |
squeeze single value ND array to scalar values. Defaults to True. |
True
|
|
strict |
bool
|
enable checking if all values on single selected input of request are equal. Defaults to True. |
True
|
Raises:
Type | Description |
---|---|
PyTritonRuntimeError
|
if not all values on a single selected input of the request are equal |
PyTritonBadParameterError
|
if any of the keys passed to the decorator are not allowed. |
Source code in pytriton/decorators.py
554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 |
|
get_inference_request_batch_size(inference_request)
Get batch size from triton request.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inference_request |
InferenceRequest
|
Triton request. |
required |
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
Batch size. |
Source code in pytriton/decorators.py
get_model_config(wrapped, instance)
Retrieves instance of TritonModelConfig from callable.
It is internally used in convert_output function to get output list from model. You can use this in custom decorators if you need access to model_config information. If you use @triton_context decorator you do not need this function (you can get model_config directly from triton_context passing function/callable to dictionary getter).
Source code in pytriton/decorators.py
get_triton_context(wrapped, instance)
Retrieves triton context from callable.
It is used in @triton_context to get triton context registered by triton binding in inference callable. If you use @triton_context decorator you do not need this function.
Source code in pytriton/decorators.py
group_by_keys(wrapped, instance, args, kwargs)
Group by keys.
Decorator prepares groups of requests with the same set of keys and calls wrapped function for each group separately (it is convenient to use this decorator before batching, because the batching decorator requires consistent set of inputs as it stacks them into batches).
Source code in pytriton/decorators.py
group_by_values(*keys, pad_fn=None)
Decorator for grouping requests by values of selected keys.
This function splits a batch into multiple sub-batches based on the specified keys values and calls the decorated function with each sub-batch. This is particularly useful when working with models that require dynamic parameters sent by the user.
For example, given an input of the form:
{"sentences": [b"Sentence1", b"Sentence2", b"Sentence3"], "param1": [1, 1, 2], "param2": [1, 1, 1]}
Using @group_by_values("param1", "param2") will split the batch into two sub-batches:
[
{"sentences": [b"Sentence1", b"Sentence2"], "param1": [1, 1], "param2": [1, 1]},
{"sentences": [b"Sentence3"], "param1": [2], "param2": [1]}
]
This decorator should be used after the @batch decorator.
Example usage:
@batch
@group_by_values("param1", "param2")
def infer_fun(**inputs):
...
return outputs
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*keys |
List of keys to group by. |
()
|
|
pad_fn |
Optional[Callable[[InferenceRequests], InferenceRequests]]
|
Optional function to pad the batch to the same size before merging again to a single batch. |
None
|
Returns:
Type | Description |
---|---|
The decorator function. |
Source code in pytriton/decorators.py
227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 |
|
pad_batch(wrapped, instance, args, kwargs)
Add padding to the inputs batches.
Decorator appends last rows to the inputs multiple times to get desired batch size (preferred batch size or max batch size from model config whatever is closer to current input size).
Source code in pytriton/decorators.py
sample(wrapped, instance, args, kwargs)
Decorator is used for non-batched inputs to convert from one element list of requests to request kwargs.
Decorator takes first request and convert it into named inputs.
Useful with non-batching models - instead of one element list of request, we will get named inputs - kwargs
.
Source code in pytriton/decorators.py
triton_context(wrapped, instance, args, kwargs)
Adds triton context.
It gives you additional argument passed to the function in **kwargs called 'triton_context'. You can read model config from it and in the future possibly have some interaction with triton.
Source code in pytriton/decorators.py
pytriton.triton.Triton(*, config=None, workspace=None)
Triton Inference Server for Python models.
Initialize Triton Inference Server context for starting server and loading models.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config |
Optional[TritonConfig]
|
TritonConfig object with optional customizations for Triton Inference Server. Configuration can be passed also through environment variables. See TritonConfig.from_env() class method for details. Order of precedence:
|
None
|
workspace |
Union[Workspace, str, Path, None]
|
workspace or path where the Triton Model Store and files used by pytriton will be created.
If workspace is |
None
|
Source code in pytriton/triton.py
__enter__()
__exit__(*_)
Exit the context stopping the process and cleaning the workspace.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*_ |
unused arguments |
()
|
bind(model_name, infer_func, inputs, outputs, model_version=1, config=None, strict=False)
Create a model with given name and inference callable binding into Triton Inference Server.
More information about model configuration: https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md
Parameters:
Name | Type | Description | Default |
---|---|---|---|
infer_func |
Union[Callable, Sequence[Callable]]
|
Inference callable to handle request/response from Triton Inference Server |
required |
inputs |
Sequence[Tensor]
|
Definition of model inputs |
required |
outputs |
Sequence[Tensor]
|
Definition of model outputs |
required |
model_name |
str
|
Name under which model is available in Triton Inference Server. It can only contain |
required |
model_version |
int
|
Version of model |
1
|
config |
Optional[ModelConfig]
|
Model configuration for Triton Inference Server deployment |
None
|
strict |
bool
|
Enable strict validation between model config outputs and inference function result |
False
|
Source code in pytriton/triton.py
is_alive()
Verify is deployed models and server are alive.
Returns:
Type | Description |
---|---|
bool
|
True if server and loaded models are alive, False otherwise. |
Source code in pytriton/triton.py
run()
Run Triton Inference Server.
Source code in pytriton/triton.py
serve(monitoring_period_sec=MONITORING_PERIOD_SEC)
Run Triton Inference Server and lock thread for serving requests/response.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
monitoring_period_sec |
float
|
the timeout of monitoring if Triton and models are available. Every monitoring_period_sec seconds main thread wakes up and check if triton server and proxy backend are still alive and sleep again. If triton or proxy is not alive - method returns. |
MONITORING_PERIOD_SEC
|
Source code in pytriton/triton.py
stop()
Stop Triton Inference Server.
Source code in pytriton/triton.py
pytriton.model_config.tensor.Tensor
dataclass
Model input and output definition for Triton deployment.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
shape |
tuple
|
Shape of the input/output tensor. |
required |
dtype |
Union[dtype, Type[dtype], Type[object]]
|
Data type of the input/output tensor. |
required |
name |
Optional[str]
|
Name of the input/output of model. |
None
|
optional |
Optional[bool]
|
Flag to mark if input is optional. |
False
|
__post_init__()
Override object values on post init or field override.
pytriton.model_config.common
Common structures for internal and external ModelConfig.
DeviceKind
Bases: Enum
Device kind for model deployment.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
KIND_AUTO |
Automatically select the device for model deployment. |
required | |
KIND_CPU |
Model is deployed on CPU. |
required | |
KIND_GPU |
Model is deployed on GPU. |
required |
DynamicBatcher
dataclass
Dynamic batcher configuration.
More in Triton Inference Server documentation
Parameters:
Name | Type | Description | Default |
---|---|---|---|
max_queue_delay_microseconds |
int
|
The maximum time, in microseconds, a request will be delayed in the scheduling queue to wait for additional requests for batching. |
0
|
preferred_batch_size |
Optional[list]
|
Preferred batch sizes for dynamic batching. |
None
|
preserve_ordering |
Should the dynamic batcher preserve the ordering of responses to match the order of requests received by the scheduler. |
False
|
|
priority_levels |
int
|
The number of priority levels to be enabled for the model. |
0
|
default_priority_level |
int
|
The priority level used for requests that don't specify their priority. |
0
|
default_queue_policy |
Optional[QueuePolicy]
|
The default queue policy used for requests. |
None
|
priority_queue_policy |
Optional[Dict[int, QueuePolicy]]
|
Specify the queue policy for the priority level. |
None
|
QueuePolicy
dataclass
Model queue policy configuration.
More in Triton Inference Server documentation
Parameters:
Name | Type | Description | Default |
---|---|---|---|
timeout_action |
TimeoutAction
|
The action applied to timed-out request. |
REJECT
|
default_timeout_microseconds |
int
|
The default timeout for every request, in microseconds. |
0
|
allow_timeout_override |
bool
|
Whether individual request can override the default timeout value. |
False
|
max_queue_size |
int
|
The maximum queue size for holding requests. |
0
|
TimeoutAction
Bases: Enum
Timeout action definition for timeout_action QueuePolicy field.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
REJECT |
Reject the request and return error message accordingly. |
required | |
DELAY |
Delay the request until all other requests at the same (or higher) priority levels that have not reached their timeouts are processed. |
required |
pytriton.model_config.model_config.ModelConfig
dataclass
Additional model configuration for running model through Triton Inference Server.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batching |
bool
|
Flag to enable/disable batching for model. |
True
|
max_batch_size |
int
|
The maximal batch size that would be handled by model. |
4
|
batcher |
DynamicBatcher
|
Configuration of Dynamic Batching for the model. |
field(default_factory=DynamicBatcher)
|
response_cache |
bool
|
Flag to enable/disable response cache for the model |
False
|
decoupled |
bool
|
Flag to enable/disable decoupled from requests execution |
False
|
pytriton.client.client
Clients for easy interaction with models deployed on the Triton Inference Server.
Typical usage example:
with ModelClient("localhost", "MyModel") as client:
result_dict = client.infer_sample(input_a=a, input_b=b)
Inference inputs can be provided either as positional or keyword arguments:
result_dict = client.infer_sample(input1, input2)
result_dict = client.infer_sample(a=input1, b=input2)
Mixing of argument passing conventions is not supported and will raise PyTritonClientValueError.
AsyncioModelClient(url, model_name, model_version=None, *, lazy_init=True, init_timeout_s=None, inference_timeout_s=None)
Bases: BaseModelClient
Asyncio client for model deployed on the Triton Inference Server.
This client is based on Triton Inference Server Python clients and GRPC library:
* tritonclient.http.aio.InferenceServerClient
* tritonclient.grpc.aio.InferenceServerClient
It can wait for server to be ready with model loaded and then perform inference on it.
AsyncioModelClient
supports asyncio context manager protocol.
Typical usage:
from pytriton.client import AsyncioModelClient
import numpy as np
input1_sample = np.random.rand(1, 3, 224, 224).astype(np.float32)
input2_sample = np.random.rand(1, 3, 224, 224).astype(np.float32)
async with AsyncioModelClient("localhost", "MyModel") as client:
result_dict = await client.infer_sample(input1_sample, input2_sample)
print(result_dict["output_name"])
Inits ModelClient for given model deployed on the Triton Inference Server.
If lazy_init
argument is False, model configuration will be read
from inference server during initialization.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
The Triton Inference Server url, e.g. 'grpc://localhost:8001'. In case no scheme is provided http scheme will be used as default. In case no port is provided default port for given scheme will be used - 8001 for grpc scheme, 8000 for http scheme. |
required |
model_name |
str
|
name of the model to interact with. |
required |
model_version |
Optional[str]
|
version of the model to interact with. If model_version is None inference on latest model will be performed. The latest versions of the model are numerically the greatest version numbers. |
None
|
lazy_init |
bool
|
if initialization should be performed just before sending first request to inference server. |
True
|
init_timeout_s |
Optional[float]
|
timeout for server and model being ready. |
None
|
Raises:
Type | Description |
---|---|
PyTritonClientModelUnavailableError
|
If model with given name (and version) is unavailable. |
PyTritonClientTimeoutError
|
if |
PyTritonClientUrlParseError
|
In case of problems with parsing url. |
Source code in pytriton/client/client.py
model_config
async
property
Obtain configuration of model deployed on the Triton Inference Server.
Also waits for server to get into readiness state.
__aenter__()
async
Create context for use AsyncioModelClient as a context manager.
Source code in pytriton/client/client.py
__aexit__(*_)
async
Close resources used by AsyncioModelClient when exiting from context.
close()
async
Close resources used by _ModelClientBase.
get_lib()
infer_batch(*inputs, parameters=None, headers=None, **named_inputs)
async
Run asynchronous inference on batched data.
Typical usage:
async with AsyncioModelClient("localhost", "MyModel") as client:
result_dict = await client.infer_batch(input1, input2)
Inference inputs can be provided either as positional or keyword arguments:
result_dict = await client.infer_batch(input1, input2)
result_dict = await client.infer_batch(a=input1, b=input2)
Mixing of argument passing conventions is not supported and will raise PyTritonClientValueError.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*inputs |
inference inputs provided as positional arguments. |
()
|
|
parameters |
Optional[Dict[str, Union[str, int, bool]]]
|
custom inference parameters. |
None
|
headers |
Optional[Dict[str, Union[str, int, bool]]]
|
custom inference headers. |
None
|
**named_inputs |
inference inputs provided as named arguments. |
{}
|
Returns:
Type | Description |
---|---|
Dict[str, ndarray]
|
dictionary with inference results, where dictionary keys are output names. |
Raises:
Type | Description |
---|---|
PyTritonClientValueError
|
if mixing of positional and named arguments passing detected. |
PyTritonClientTimeoutError
|
in case of first method call, |
PyTritonClientModelDoesntSupportBatchingError
|
if model doesn't support batching. |
PyTritonClientModelUnavailableError
|
If model with given name (and version) is unavailable. |
PyTritonClientInferenceServerError
|
If error occurred on inference callable or Triton Inference Server side. |
Source code in pytriton/client/client.py
infer_sample(*inputs, parameters=None, headers=None, **named_inputs)
async
Run asynchronous inference on single data sample.
Typical usage:
async with AsyncioModelClient("localhost", "MyModel") as client:
result_dict = await client.infer_sample(input1, input2)
Inference inputs can be provided either as positional or keyword arguments:
result_dict = await client.infer_sample(input1, input2)
result_dict = await client.infer_sample(a=input1, b=input2)
Mixing of argument passing conventions is not supported and will raise PyTritonClientRuntimeError.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*inputs |
inference inputs provided as positional arguments. |
()
|
|
parameters |
Optional[Dict[str, Union[str, int, bool]]]
|
custom inference parameters. |
None
|
headers |
Optional[Dict[str, Union[str, int, bool]]]
|
custom inference headers. |
None
|
**named_inputs |
inference inputs provided as named arguments. |
{}
|
Returns:
Type | Description |
---|---|
Dict[str, ndarray]
|
dictionary with inference results, where dictionary keys are output names. |
Raises:
Type | Description |
---|---|
PyTritonClientValueError
|
if mixing of positional and named arguments passing detected. |
PyTritonClientTimeoutError
|
in case of first method call, |
PyTritonClientModelUnavailableError
|
If model with given name (and version) is unavailable. |
PyTritonClientInferenceServerError
|
If error occurred on inference callable or Triton Inference Server side. |
Source code in pytriton/client/client.py
wait_for_model(timeout_s)
async
Asynchronous wait for Triton Inference Server and deployed on it model readiness.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
timeout_s |
float
|
timeout to server and model get into readiness state. |
required |
Raises:
Type | Description |
---|---|
PyTritonClientTimeoutError
|
If server and model are not in readiness state before given timeout. |
PyTritonClientModelUnavailableError
|
If model with given name (and version) is unavailable. |
KeyboardInterrupt
|
If hosting process receives SIGINT |
Source code in pytriton/client/client.py
BaseModelClient(url, model_name, model_version=None, *, lazy_init=True, init_timeout_s=None, inference_timeout_s=None)
Base client for model deployed on the Triton Inference Server.
Inits BaseModelClient for given model deployed on the Triton Inference Server.
Common usage:
```
with ModelClient("localhost", "BERT") as client
result_dict = client.infer_sample(input1_sample, input2_sample)
```
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
The Triton Inference Server url, e.g. |
required |
model_name |
str
|
name of the model to interact with. |
required |
model_version |
Optional[str]
|
version of the model to interact with. If model_version is None inference on latest model will be performed. The latest versions of the model are numerically the greatest version numbers. |
None
|
lazy_init |
bool
|
if initialization should be performed just before sending first request to inference server. |
True
|
init_timeout_s |
Optional[float]
|
timeout in seconds for the server and model to be ready. If not passed, the default timeout of 300 seconds will be used. |
None
|
inference_timeout_s |
Optional[float]
|
timeout in seconds for a single model inference request. If not passed, the default timeout of 60 seconds will be used. |
None
|
Raises:
Type | Description |
---|---|
PyTritonClientModelUnavailableError
|
If model with given name (and version) is unavailable. |
PyTritonClientTimeoutError
|
if |
PyTritonClientInvalidUrlError
|
If provided Triton Inference Server url is invalid. |
Source code in pytriton/client/client.py
create_client_from_url(url, network_timeout_s=None)
Create Triton Inference Server client.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
url of the server to connect to. If url doesn't contain scheme (e.g. "localhost:8001") http scheme is added. If url doesn't contain port (e.g. "localhost") default port for given scheme is added. |
required |
network_timeout_s |
Optional[float]
|
timeout for client commands. Default value is 60.0 s. |
None
|
Returns:
Type | Description |
---|---|
Triton Inference Server client. |
Raises:
Type | Description |
---|---|
PyTritonClientInvalidUrlError
|
If provided Triton Inference Server url is invalid. |
Source code in pytriton/client/client.py
DecoupledModelClient(url, model_name, model_version=None, *, lazy_init=True, init_timeout_s=None, inference_timeout_s=None)
Bases: ModelClient
Synchronous client for decoupled model deployed on the Triton Inference Server.
Inits DecoupledModelClient for given model deployed on the Triton Inference Server.
Source code in pytriton/client/client.py
FuturesModelClient(url, model_name, model_version=None, *, max_workers=None, init_timeout_s=None, inference_timeout_s=None)
A client for interacting with a model deployed on the Triton Inference Server using concurrent.futures.
This client allows asynchronous inference requests using a thread pool executor. It can be used to perform inference
on a model by providing input data and receiving the corresponding output data. The client can be used in a with
statement to ensure proper resource management.
Example usage:
```python
with FuturesModelClient("localhost", "MyModel") as client:
result_future = client.infer_sample(input1=input1_data, input2=input2_data)
# do something else
print(result_future.result())
```
Initializes the FuturesModelClient for a given model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
The Triton Inference Server url, e.g. |
required |
model_name |
str
|
The name of the model to interact with. |
required |
model_version |
Optional[str]
|
The version of the model to interact with. If None, the latest version will be used. |
None
|
max_workers |
Optional[int]
|
The maximum number of threads that can be used to execute the given calls. If None, the |
None
|
init_timeout_s |
Optional[float]
|
Timeout in seconds for server and model being ready. If non passed default 60 seconds timeout will be used. |
None
|
inference_timeout_s |
Optional[float]
|
Timeout in seconds for the single model inference request. If non passed default 60 seconds timeout will be used. |
None
|
Source code in pytriton/client/client.py
__enter__()
__exit__(exc_type, exc_value, traceback)
close(wait=True)
Close resources used by FuturesModelClient.
This method closes the resources used by the FuturesModelClient instance, including the Triton Inference Server connections. Once this method is called, the FuturesModelClient instance should not be used again.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
wait |
If True, then shutdown will not return until all running futures have finished executing. |
True
|
Source code in pytriton/client/client.py
infer_batch(*inputs, parameters=None, headers=None, **named_inputs)
Run asynchronous inference on batched data and return a Future object.
This method allows the user to perform inference on batched data by providing input data and receiving the corresponding output data. The method returns a Future object that wraps a dictionary of inference results, where dictionary keys are output names.
Example usage:
```python
with FuturesModelClient("localhost", "BERT") as client:
future = client.infer_batch(input1_sample, input2_sample)
# do something else
print(future.result())
```
Inference inputs can be provided either as positional or keyword arguments:
```python
future = client.infer_batch(input1, input2)
future = client.infer_batch(a=input1, b=input2)
```
Mixing of argument passing conventions is not supported and will raise PyTritonClientValueError.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*inputs |
Inference inputs provided as positional arguments. |
()
|
|
parameters |
Optional[Dict[str, Union[str, int, bool]]]
|
Optional dictionary of inference parameters. |
None
|
headers |
Optional[Dict[str, Union[str, int, bool]]]
|
Optional dictionary of HTTP headers for the inference request. |
None
|
**named_inputs |
Inference inputs provided as named arguments. |
{}
|
Returns:
Type | Description |
---|---|
Future
|
A Future object wrapping a dictionary of inference results, where dictionary keys are output names. |
Raises:
Type | Description |
---|---|
PyTritonClientClosedError
|
If the FuturesModelClient is closed. |
Source code in pytriton/client/client.py
infer_sample(*inputs, parameters=None, headers=None, **named_inputs)
Run asynchronous inference on a single data sample and return a Future object.
This method allows the user to perform inference on a single data sample by providing input data and receiving the corresponding output data. The method returns a Future object that wraps a dictionary of inference results, where dictionary keys are output names.
Example usage:
```python
with FuturesModelClient("localhost", "BERT") as client:
result_future = client.infer_sample(input1=input1_data, input2=input2_data)
# do something else
print(result_future.result())
```
Inference inputs can be provided either as positional or keyword arguments:
```python
future = client.infer_sample(input1, input2)
future = client.infer_sample(a=input1, b=input2)
```
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*inputs |
Inference inputs provided as positional arguments. |
()
|
|
parameters |
Optional[Dict[str, Union[str, int, bool]]]
|
Optional dictionary of inference parameters. |
None
|
headers |
Optional[Dict[str, Union[str, int, bool]]]
|
Optional dictionary of HTTP headers for the inference request. |
None
|
**named_inputs |
Inference inputs provided as named arguments. |
{}
|
Returns:
Type | Description |
---|---|
Future
|
A Future object wrapping a dictionary of inference results, where dictionary keys are output names. |
Raises:
Type | Description |
---|---|
PyTritonClientClosedError
|
If the FuturesModelClient is closed. |
Source code in pytriton/client/client.py
model_config()
Obtain the configuration of the model deployed on the Triton Inference Server.
This method returns a Future object that will contain the TritonModelConfig object when it is ready. Client will wait init_timeout_s for the server to get into readiness state before obtaining the model configuration.
Returns:
Type | Description |
---|---|
Future
|
A Future object that will contain the TritonModelConfig object when it is ready. |
Raises:
Type | Description |
---|---|
PyTritonClientClosedError
|
If the FuturesModelClient is closed. |
Source code in pytriton/client/client.py
wait_for_model(timeout_s)
Returns a Future object which result will be None when the model is ready.
Typical usage:
```python
with FuturesModelClient("localhost", "BERT") as client
future = client.wait_for_model(300.)
# do something else
future.result() # wait rest of timeout_s time
# till return None if model is ready
# or raise PyTritonClientTimeutError
```
Parameters:
Name | Type | Description | Default |
---|---|---|---|
timeout_s |
float
|
The maximum amount of time to wait for the model to be ready, in seconds. |
required |
Returns:
Type | Description |
---|---|
Future
|
A Future object which result is None when the model is ready. |
Source code in pytriton/client/client.py
ModelClient(url, model_name, model_version=None, *, lazy_init=True, init_timeout_s=None, inference_timeout_s=None)
Bases: BaseModelClient
Synchronous client for model deployed on the Triton Inference Server.
Inits ModelClient for given model deployed on the Triton Inference Server.
If lazy_init
argument is False, model configuration will be read
from inference server during initialization.
Common usage:
with ModelClient("localhost", "BERT") as client
result_dict = client.infer_sample(input1_sample, input2_sample)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
The Triton Inference Server url, e.g. 'grpc://localhost:8001'. In case no scheme is provided http scheme will be used as default. In case no port is provided default port for given scheme will be used - 8001 for grpc scheme, 8000 for http scheme. |
required |
model_name |
str
|
name of the model to interact with. |
required |
model_version |
Optional[str]
|
version of the model to interact with. If model_version is None inference on latest model will be performed. The latest versions of the model are numerically the greatest version numbers. |
None
|
lazy_init |
bool
|
if initialization should be performed just before sending first request to inference server. |
True
|
init_timeout_s |
Optional[float]
|
timeout for maximum waiting time in loop, which sends retry requests ask if model is ready. It is applied at initialization time only when |
None
|
inference_timeout_s |
Optional[float]
|
timeout in seconds for the model inference process. If non passed default 60 seconds timeout will be used. For HTTP client it is not only inference timeout but any client request timeout - get model config, is model loaded. For GRPC client it is only inference timeout. |
None
|
Raises:
Type | Description |
---|---|
PyTritonClientModelUnavailableError
|
If model with given name (and version) is unavailable. |
PyTritonClientTimeoutError
|
if |
PyTritonClientUrlParseError
|
In case of problems with parsing url. |
Source code in pytriton/client/client.py
is_batching_supported
property
Checks if model supports batching.
Also waits for server to get into readiness state.
model_config: TritonModelConfig
property
Obtain the configuration of the model deployed on the Triton Inference Server.
This method waits for the server to get into readiness state before obtaining the model configuration.
Returns:
Name | Type | Description |
---|---|---|
TritonModelConfig |
TritonModelConfig
|
configuration of the model deployed on the Triton Inference Server. |
Raises:
Type | Description |
---|---|
PyTritonClientTimeoutError
|
If the server and model are not in readiness state before the given timeout. |
PyTritonClientModelUnavailableError
|
If the model with the given name (and version) is unavailable. |
KeyboardInterrupt
|
If the hosting process receives SIGINT. |
PyTritonClientClosedError
|
If the ModelClient is closed. |
__enter__()
__exit__(*_)
close()
Close resources used by ModelClient.
This method closes the resources used by the ModelClient instance, including the Triton Inference Server connections. Once this method is called, the ModelClient instance should not be used again.
Source code in pytriton/client/client.py
get_lib()
infer_batch(*inputs, parameters=None, headers=None, **named_inputs)
Run synchronous inference on batched data.
Typical usage:
```python
with ModelClient("localhost", "MyModel") as client:
result_dict = client.infer_batch(input1, input2)
```
Inference inputs can be provided either as positional or keyword arguments:
```python
result_dict = client.infer_batch(input1, input2)
result_dict = client.infer_batch(a=input1, b=input2)
```
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*inputs |
Inference inputs provided as positional arguments. |
()
|
|
parameters |
Optional[Dict[str, Union[str, int, bool]]]
|
Custom inference parameters. |
None
|
headers |
Optional[Dict[str, Union[str, int, bool]]]
|
Custom inference headers. |
None
|
**named_inputs |
Inference inputs provided as named arguments. |
{}
|
Returns:
Type | Description |
---|---|
Dict[str, ndarray]
|
Dictionary with inference results, where dictionary keys are output names. |
Raises:
Type | Description |
---|---|
PyTritonClientValueError
|
If mixing of positional and named arguments passing detected. |
PyTritonClientTimeoutError
|
If the wait time for the server and model being ready exceeds |
PyTritonClientModelUnavailableError
|
If the model with the given name (and version) is unavailable. |
PyTritonClientInferenceServerError
|
If an error occurred on the inference callable or Triton Inference Server side. |
PyTritonClientModelDoesntSupportBatchingError
|
If the model doesn't support batching. |
PyTritonClientValueError
|
if mixing of positional and named arguments passing detected. |
PyTritonClientTimeoutError
|
in case of first method call, |
PyTritonClientModelUnavailableError
|
If model with given name (and version) is unavailable. |
PyTritonClientInferenceServerError
|
If error occurred on inference callable or Triton Inference Server side, |
Source code in pytriton/client/client.py
infer_sample(*inputs, parameters=None, headers=None, **named_inputs)
Run synchronous inference on a single data sample.
Typical usage:
```python
with ModelClient("localhost", "MyModel") as client:
result_dict = client.infer_sample(input1, input2)
```
Inference inputs can be provided either as positional or keyword arguments:
```python
result_dict = client.infer_sample(input1, input2)
result_dict = client.infer_sample(a=input1, b=input2)
```
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*inputs |
Inference inputs provided as positional arguments. |
()
|
|
parameters |
Optional[Dict[str, Union[str, int, bool]]]
|
Custom inference parameters. |
None
|
headers |
Optional[Dict[str, Union[str, int, bool]]]
|
Custom inference headers. |
None
|
**named_inputs |
Inference inputs provided as named arguments. |
{}
|
Returns:
Type | Description |
---|---|
Dict[str, ndarray]
|
Dictionary with inference results, where dictionary keys are output names. |
Raises:
Type | Description |
---|---|
PyTritonClientValueError
|
If mixing of positional and named arguments passing detected. |
PyTritonClientTimeoutError
|
If the wait time for the server and model being ready exceeds |
PyTritonClientModelUnavailableError
|
If the model with the given name (and version) is unavailable. |
PyTritonClientInferenceServerError
|
If an error occurred on the inference callable or Triton Inference Server side. |
Source code in pytriton/client/client.py
wait_for_model(timeout_s)
Wait for the Triton Inference Server and the deployed model to be ready.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
timeout_s |
float
|
timeout in seconds to wait for the server and model to be ready. |
required |
Raises:
Type | Description |
---|---|
PyTritonClientTimeoutError
|
If the server and model are not ready before the given timeout. |
PyTritonClientModelUnavailableError
|
If the model with the given name (and version) is unavailable. |
KeyboardInterrupt
|
If the hosting process receives SIGINT. |
PyTritonClientClosedError
|
If the ModelClient is closed. |