Skip to content

Model Clients

pytriton.client.ModelClient

ModelClient(url: str, model_name: str, model_version: Optional[str] = None, *, lazy_init: bool = True, init_timeout_s: Optional[float] = None, inference_timeout_s: Optional[float] = None, model_config: Optional[TritonModelConfig] = None, ensure_model_is_ready: bool = True)

Bases: BaseModelClient

Synchronous client for model deployed on the Triton Inference Server.

Inits ModelClient for given model deployed on the Triton Inference Server.

If lazy_init argument is False, model configuration will be read from inference server during initialization.

Common usage:

client = ModelClient("localhost", "BERT")
result_dict = client.infer_sample(input1_sample, input2_sample)
client.close()

Client supports also context manager protocol:

with ModelClient("localhost", "BERT") as client:
    result_dict = client.infer_sample(input1_sample, input2_sample)

The creation of client requires connection to the server and downloading model configuration. You can create client from existing client using the same class:

client = ModelClient.from_existing_client(existing_client)

Parameters:

  • url (str) –

    The Triton Inference Server url, e.g. 'grpc://localhost:8001'. In case no scheme is provided http scheme will be used as default. In case no port is provided default port for given scheme will be used - 8001 for grpc scheme, 8000 for http scheme.

  • model_name (str) –

    name of the model to interact with.

  • model_version (Optional[str], default: None ) –

    version of the model to interact with. If model_version is None inference on latest model will be performed. The latest versions of the model are numerically the greatest version numbers.

  • lazy_init (bool, default: True ) –

    if initialization should be performed just before sending first request to inference server.

  • init_timeout_s (Optional[float], default: None ) –

    timeout for maximum waiting time in loop, which sends retry requests ask if model is ready. It is applied at initialization time only when lazy_init argument is False. Default is to do retry loop at first inference.

  • inference_timeout_s (Optional[float], default: None ) –

    timeout in seconds for the model inference process. If non passed default 60 seconds timeout will be used. For HTTP client it is not only inference timeout but any client request timeout - get model config, is model loaded. For GRPC client it is only inference timeout.

  • model_config (Optional[TritonModelConfig], default: None ) –

    model configuration. If not passed, it will be read from inference server during initialization.

  • ensure_model_is_ready (bool, default: True ) –

    if model should be checked if it is ready before first inference request.

Raises:

  • PyTritonClientModelUnavailableError

    If model with given name (and version) is unavailable.

  • PyTritonClientTimeoutError

    if lazy_init argument is False and wait time for server and model being ready exceeds init_timeout_s.

  • PyTritonClientUrlParseError

    In case of problems with parsing url.

Source code in pytriton/client/client.py
def __init__(
    self,
    url: str,
    model_name: str,
    model_version: Optional[str] = None,
    *,
    lazy_init: bool = True,
    init_timeout_s: Optional[float] = None,
    inference_timeout_s: Optional[float] = None,
    model_config: Optional[TritonModelConfig] = None,
    ensure_model_is_ready: bool = True,
):
    """Inits ModelClient for given model deployed on the Triton Inference Server.

    If `lazy_init` argument is False, model configuration will be read
    from inference server during initialization.

    Common usage:

    ```python
    client = ModelClient("localhost", "BERT")
    result_dict = client.infer_sample(input1_sample, input2_sample)
    client.close()
    ```

    Client supports also context manager protocol:

    ```python
    with ModelClient("localhost", "BERT") as client:
        result_dict = client.infer_sample(input1_sample, input2_sample)
    ```

    The creation of client requires connection to the server and downloading model configuration. You can create client from existing client using the same class:

    ```python
    client = ModelClient.from_existing_client(existing_client)
    ```

    Args:
        url: The Triton Inference Server url, e.g. 'grpc://localhost:8001'.
            In case no scheme is provided http scheme will be used as default.
            In case no port is provided default port for given scheme will be used -
            8001 for grpc scheme, 8000 for http scheme.
        model_name: name of the model to interact with.
        model_version: version of the model to interact with.
            If model_version is None inference on latest model will be performed.
            The latest versions of the model are numerically the greatest version numbers.
        lazy_init: if initialization should be performed just before sending first request to inference server.
        init_timeout_s: timeout for maximum waiting time in loop, which sends retry requests ask if model is ready. It is applied at initialization time only when `lazy_init` argument is False. Default is to do retry loop at first inference.
        inference_timeout_s: timeout in seconds for the model inference process.
            If non passed default 60 seconds timeout will be used.
            For HTTP client it is not only inference timeout but any client request timeout
            - get model config, is model loaded. For GRPC client it is only inference timeout.
        model_config: model configuration. If not passed, it will be read from inference server during initialization.
        ensure_model_is_ready: if model should be checked if it is ready before first inference request.

    Raises:
        PyTritonClientModelUnavailableError: If model with given name (and version) is unavailable.
        PyTritonClientTimeoutError:
            if `lazy_init` argument is False and wait time for server and model being ready exceeds `init_timeout_s`.
        PyTritonClientUrlParseError: In case of problems with parsing url.
    """
    super().__init__(
        url=url,
        model_name=model_name,
        model_version=model_version,
        lazy_init=lazy_init,
        init_timeout_s=init_timeout_s,
        inference_timeout_s=inference_timeout_s,
        model_config=model_config,
        ensure_model_is_ready=ensure_model_is_ready,
    )

is_batching_supported property

is_batching_supported

Checks if model supports batching.

Also waits for server to get into readiness state.

model_config property

model_config: TritonModelConfig

Obtain the configuration of the model deployed on the Triton Inference Server.

This method waits for the server to get into readiness state before obtaining the model configuration.

Returns:

  • TritonModelConfig ( TritonModelConfig ) –

    configuration of the model deployed on the Triton Inference Server.

Raises:

  • PyTritonClientTimeoutError

    If the server and model are not in readiness state before the given timeout.

  • PyTritonClientModelUnavailableError

    If the model with the given name (and version) is unavailable.

  • KeyboardInterrupt

    If the hosting process receives SIGINT.

  • PyTritonClientClosedError

    If the ModelClient is closed.

__enter__

__enter__()

Create context for using ModelClient as a context manager.

Source code in pytriton/client/client.py
def __enter__(self):
    """Create context for using ModelClient as a context manager."""
    return self

__exit__

__exit__(*_)

Close resources used by ModelClient instance when exiting from the context.

Source code in pytriton/client/client.py
def __exit__(self, *_):
    """Close resources used by ModelClient instance when exiting from the context."""
    self.close()

close

close()

Close resources used by ModelClient.

This method closes the resources used by the ModelClient instance, including the Triton Inference Server connections. Once this method is called, the ModelClient instance should not be used again.

Source code in pytriton/client/client.py
def close(self):
    """Close resources used by ModelClient.

    This method closes the resources used by the ModelClient instance,
    including the Triton Inference Server connections.
    Once this method is called, the ModelClient instance should not be used again.
    """
    _LOGGER.debug("Closing ModelClient")
    try:
        if self._general_client is not None:
            self._general_client.close()
        if self._infer_client is not None:
            self._infer_client.close()
        self._general_client = None
        self._infer_client = None
    except Exception as e:
        _LOGGER.error(f"Error while closing ModelClient resources: {e}")
        raise e

create_client_from_url

create_client_from_url(url: str, network_timeout_s: Optional[float] = None)

Create Triton Inference Server client.

Parameters:

  • url (str) –

    url of the server to connect to. If url doesn't contain scheme (e.g. "localhost:8001") http scheme is added. If url doesn't contain port (e.g. "localhost") default port for given scheme is added.

  • network_timeout_s (Optional[float], default: None ) –

    timeout for client commands. Default value is 60.0 s.

Returns:

  • Triton Inference Server client.

Raises:

  • PyTritonClientInvalidUrlError

    If provided Triton Inference Server url is invalid.

Source code in pytriton/client/client.py
def create_client_from_url(self, url: str, network_timeout_s: Optional[float] = None):
    """Create Triton Inference Server client.

    Args:
        url: url of the server to connect to.
            If url doesn't contain scheme (e.g. "localhost:8001") http scheme is added.
            If url doesn't contain port (e.g. "localhost") default port for given scheme is added.
        network_timeout_s: timeout for client commands. Default value is 60.0 s.

    Returns:
        Triton Inference Server client.

    Raises:
        PyTritonClientInvalidUrlError: If provided Triton Inference Server url is invalid.
    """
    self._triton_url = TritonUrl.from_url(url)
    self._url = self._triton_url.without_scheme
    self._triton_client_lib = self.get_lib()
    self._monkey_patch_client()

    if self._triton_url.scheme == "grpc":
        # by default grpc client has very large number of timeout, thus we want to make it equal to http client timeout
        network_timeout_s = _DEFAULT_NETWORK_TIMEOUT_S if network_timeout_s is None else network_timeout_s
        warnings.warn(
            f"tritonclient.grpc doesn't support timeout for other commands than infer. Ignoring network_timeout: {network_timeout_s}.",
            NotSupportedTimeoutWarning,
            stacklevel=1,
        )

    triton_client_init_kwargs = self._get_init_extra_args()

    _LOGGER.debug(
        f"Creating InferenceServerClient for {self._triton_url.with_scheme} with {triton_client_init_kwargs}"
    )
    return self._triton_client_lib.InferenceServerClient(self._url, **triton_client_init_kwargs)

from_existing_client classmethod

from_existing_client(existing_client: BaseModelClient)

Create a new instance from an existing client using the same class.

Common usage:

client = BaseModelClient.from_existing_client(existing_client)

Parameters:

  • existing_client (BaseModelClient) –

    An instance of an already initialized subclass.

Returns:

  • A new instance of the same subclass with shared configuration and readiness state.

Source code in pytriton/client/client.py
@classmethod
def from_existing_client(cls, existing_client: "BaseModelClient"):
    """Create a new instance from an existing client using the same class.

    Common usage:
    ```python
    client = BaseModelClient.from_existing_client(existing_client)
    ```

    Args:
        existing_client: An instance of an already initialized subclass.

    Returns:
        A new instance of the same subclass with shared configuration and readiness state.
    """
    kwargs = {}
    # Copy model configuration and readiness state if present
    if hasattr(existing_client, "_model_config"):
        kwargs["model_config"] = existing_client._model_config
        kwargs["ensure_model_is_ready"] = False

    new_client = cls(
        url=existing_client._url,
        model_name=existing_client._model_name,
        model_version=existing_client._model_version,
        init_timeout_s=existing_client._init_timeout_s,
        inference_timeout_s=existing_client._inference_timeout_s,
        **kwargs,
    )

    return new_client

get_lib

get_lib()

Returns tritonclient library for given scheme.

Source code in pytriton/client/client.py
def get_lib(self):
    """Returns tritonclient library for given scheme."""
    return {"grpc": tritonclient.grpc, "http": tritonclient.http}[self._triton_url.scheme.lower()]

infer_batch

infer_batch(*inputs, parameters: Optional[Dict[str, Union[str, int, bool]]] = None, headers: Optional[Dict[str, Union[str, int, bool]]] = None, **named_inputs) -> Dict[str, ndarray]

Run synchronous inference on batched data.

Typical usage:

client = ModelClient("localhost", "MyModel")
result_dict = client.infer_batch(input1, input2)
client.close()

Inference inputs can be provided either as positional or keyword arguments:

result_dict = client.infer_batch(input1, input2)
result_dict = client.infer_batch(a=input1, b=input2)

Parameters:

  • *inputs

    Inference inputs provided as positional arguments.

  • parameters (Optional[Dict[str, Union[str, int, bool]]], default: None ) –

    Custom inference parameters.

  • headers (Optional[Dict[str, Union[str, int, bool]]], default: None ) –

    Custom inference headers.

  • **named_inputs

    Inference inputs provided as named arguments.

Returns:

  • Dict[str, ndarray]

    Dictionary with inference results, where dictionary keys are output names.

Raises:

  • PyTritonClientValueError

    If mixing of positional and named arguments passing detected.

  • PyTritonClientTimeoutError

    If the wait time for the server and model being ready exceeds init_timeout_s or inference request time exceeds inference_timeout_s.

  • PyTritonClientModelUnavailableError

    If the model with the given name (and version) is unavailable.

  • PyTritonClientInferenceServerError

    If an error occurred on the inference callable or Triton Inference Server side.

  • PyTritonClientModelDoesntSupportBatchingError

    If the model doesn't support batching.

  • PyTritonClientValueError

    if mixing of positional and named arguments passing detected.

  • PyTritonClientTimeoutError

    in case of first method call, lazy_init argument is False and wait time for server and model being ready exceeds init_timeout_s or inference time exceeds inference_timeout_s passed to __init__.

  • PyTritonClientModelUnavailableError

    If model with given name (and version) is unavailable.

  • PyTritonClientInferenceServerError

    If error occurred on inference callable or Triton Inference Server side,

Source code in pytriton/client/client.py
def infer_batch(
    self,
    *inputs,
    parameters: Optional[Dict[str, Union[str, int, bool]]] = None,
    headers: Optional[Dict[str, Union[str, int, bool]]] = None,
    **named_inputs,
) -> Dict[str, np.ndarray]:
    """Run synchronous inference on batched data.

    Typical usage:

    ```python
    client = ModelClient("localhost", "MyModel")
    result_dict = client.infer_batch(input1, input2)
    client.close()
    ```

    Inference inputs can be provided either as positional or keyword arguments:

    ```python
    result_dict = client.infer_batch(input1, input2)
    result_dict = client.infer_batch(a=input1, b=input2)
    ```

    Args:
        *inputs: Inference inputs provided as positional arguments.
        parameters: Custom inference parameters.
        headers: Custom inference headers.
        **named_inputs: Inference inputs provided as named arguments.

    Returns:
        Dictionary with inference results, where dictionary keys are output names.

    Raises:
        PyTritonClientValueError: If mixing of positional and named arguments passing detected.
        PyTritonClientTimeoutError: If the wait time for the server and model being ready exceeds `init_timeout_s` or
            inference request time exceeds `inference_timeout_s`.
        PyTritonClientModelUnavailableError: If the model with the given name (and version) is unavailable.
        PyTritonClientInferenceServerError: If an error occurred on the inference callable or Triton Inference Server side.
        PyTritonClientModelDoesntSupportBatchingError: If the model doesn't support batching.
        PyTritonClientValueError: if mixing of positional and named arguments passing detected.
        PyTritonClientTimeoutError:
            in case of first method call, `lazy_init` argument is False
            and wait time for server and model being ready exceeds `init_timeout_s` or
            inference time exceeds `inference_timeout_s` passed to `__init__`.
        PyTritonClientModelUnavailableError: If model with given name (and version) is unavailable.
        PyTritonClientInferenceServerError: If error occurred on inference callable or Triton Inference Server side,
    """
    _verify_inputs_args(inputs, named_inputs)
    _verify_parameters(parameters)
    _verify_parameters(headers)

    if not self.is_batching_supported:
        raise PyTritonClientModelDoesntSupportBatchingError(
            f"Model {self.model_config.model_name} doesn't support batching - use infer_sample method instead"
        )

    return self._infer(inputs or named_inputs, parameters, headers)

infer_sample

infer_sample(*inputs, parameters: Optional[Dict[str, Union[str, int, bool]]] = None, headers: Optional[Dict[str, Union[str, int, bool]]] = None, **named_inputs) -> Dict[str, ndarray]

Run synchronous inference on a single data sample.

Typical usage:

client = ModelClient("localhost", "MyModel")
result_dict = client.infer_sample(input1, input2)
client.close()

Inference inputs can be provided either as positional or keyword arguments:

result_dict = client.infer_sample(input1, input2)
result_dict = client.infer_sample(a=input1, b=input2)

Parameters:

  • *inputs

    Inference inputs provided as positional arguments.

  • parameters (Optional[Dict[str, Union[str, int, bool]]], default: None ) –

    Custom inference parameters.

  • headers (Optional[Dict[str, Union[str, int, bool]]], default: None ) –

    Custom inference headers.

  • **named_inputs

    Inference inputs provided as named arguments.

Returns:

  • Dict[str, ndarray]

    Dictionary with inference results, where dictionary keys are output names.

Raises:

  • PyTritonClientValueError

    If mixing of positional and named arguments passing detected.

  • PyTritonClientTimeoutError

    If the wait time for the server and model being ready exceeds init_timeout_s or inference request time exceeds inference_timeout_s.

  • PyTritonClientModelUnavailableError

    If the model with the given name (and version) is unavailable.

  • PyTritonClientInferenceServerError

    If an error occurred on the inference callable or Triton Inference Server side.

Source code in pytriton/client/client.py
def infer_sample(
    self,
    *inputs,
    parameters: Optional[Dict[str, Union[str, int, bool]]] = None,
    headers: Optional[Dict[str, Union[str, int, bool]]] = None,
    **named_inputs,
) -> Dict[str, np.ndarray]:
    """Run synchronous inference on a single data sample.

    Typical usage:

    ```python
    client = ModelClient("localhost", "MyModel")
    result_dict = client.infer_sample(input1, input2)
    client.close()
    ```

    Inference inputs can be provided either as positional or keyword arguments:

    ```python
    result_dict = client.infer_sample(input1, input2)
    result_dict = client.infer_sample(a=input1, b=input2)
    ```

    Args:
        *inputs: Inference inputs provided as positional arguments.
        parameters: Custom inference parameters.
        headers: Custom inference headers.
        **named_inputs: Inference inputs provided as named arguments.

    Returns:
        Dictionary with inference results, where dictionary keys are output names.

    Raises:
        PyTritonClientValueError: If mixing of positional and named arguments passing detected.
        PyTritonClientTimeoutError: If the wait time for the server and model being ready exceeds `init_timeout_s` or
            inference request time exceeds `inference_timeout_s`.
        PyTritonClientModelUnavailableError: If the model with the given name (and version) is unavailable.
        PyTritonClientInferenceServerError: If an error occurred on the inference callable or Triton Inference Server side.
    """
    _verify_inputs_args(inputs, named_inputs)
    _verify_parameters(parameters)
    _verify_parameters(headers)

    if self.is_batching_supported:
        if inputs:
            inputs = tuple(data[np.newaxis, ...] for data in inputs)
        elif named_inputs:
            named_inputs = {name: data[np.newaxis, ...] for name, data in named_inputs.items()}

    result = self._infer(inputs or named_inputs, parameters, headers)

    return self._debatch_result(result)

load_model

load_model(config: Optional[str] = None, files: Optional[dict] = None)

Load model on the Triton Inference Server.

Parameters:

  • config (Optional[str], default: None ) –

    str - Optional JSON representation of a model config provided for the load request, if provided, this config will be used for loading the model.

  • files (Optional[dict], default: None ) –

    dict - Optional dictionary specifying file path (with "file:" prefix) in the override model directory to the file content as bytes. The files will form the model directory that the model will be loaded from. If specified, 'config' must be provided to be the model configuration of the override model directory.

Source code in pytriton/client/client.py
def load_model(self, config: Optional[str] = None, files: Optional[dict] = None):
    """Load model on the Triton Inference Server.

    Args:
        config: str - Optional JSON representation of a model config provided for
            the load request, if provided, this config will be used for
            loading the model.
        files: dict - Optional dictionary specifying file path (with "file:" prefix) in
            the override model directory to the file content as bytes.
            The files will form the model directory that the model will be
            loaded from. If specified, 'config' must be provided to be
            the model configuration of the override model directory.
    """
    self._general_client.load_model(self._model_name, config=config, files=files)

unload_model

unload_model()

Unload model from the Triton Inference Server.

Source code in pytriton/client/client.py
def unload_model(self):
    """Unload model from the Triton Inference Server."""
    self._general_client.unload_model(self._model_name)

wait_for_model

wait_for_model(timeout_s: float)

Wait for the Triton Inference Server and the deployed model to be ready.

Parameters:

  • timeout_s (float) –

    timeout in seconds to wait for the server and model to be ready.

Raises:

  • PyTritonClientTimeoutError

    If the server and model are not ready before the given timeout.

  • PyTritonClientModelUnavailableError

    If the model with the given name (and version) is unavailable.

  • KeyboardInterrupt

    If the hosting process receives SIGINT.

  • PyTritonClientClosedError

    If the ModelClient is closed.

Source code in pytriton/client/client.py
def wait_for_model(self, timeout_s: float):
    """Wait for the Triton Inference Server and the deployed model to be ready.

    Args:
        timeout_s: timeout in seconds to wait for the server and model to be ready.

    Raises:
        PyTritonClientTimeoutError: If the server and model are not ready before the given timeout.
        PyTritonClientModelUnavailableError: If the model with the given name (and version) is unavailable.
        KeyboardInterrupt: If the hosting process receives SIGINT.
        PyTritonClientClosedError: If the ModelClient is closed.
    """
    if self._general_client is None:
        raise PyTritonClientClosedError("ModelClient is closed")
    wait_for_model_ready(self._general_client, self._model_name, self._model_version, timeout_s=timeout_s)

wait_for_server

wait_for_server(timeout_s: float)

Wait for Triton Inference Server readiness.

Parameters:

  • timeout_s (float) –

    timeout to server get into readiness state.

Raises:

  • PyTritonClientTimeoutError

    If server is not in readiness state before given timeout.

  • KeyboardInterrupt

    If hosting process receives SIGINT

Source code in pytriton/client/client.py
def wait_for_server(self, timeout_s: float):
    """Wait for Triton Inference Server readiness.

    Args:
        timeout_s: timeout to server get into readiness state.

    Raises:
        PyTritonClientTimeoutError: If server is not in readiness state before given timeout.
        KeyboardInterrupt: If hosting process receives SIGINT
    """
    wait_for_server_ready(self._general_client, timeout_s=timeout_s)

pytriton.client.AsyncioModelClient

AsyncioModelClient(url: str, model_name: str, model_version: Optional[str] = None, *, lazy_init: bool = True, init_timeout_s: Optional[float] = None, inference_timeout_s: Optional[float] = None, model_config: Optional[TritonModelConfig] = None, ensure_model_is_ready: bool = True)

Bases: BaseModelClient

Asyncio client for model deployed on the Triton Inference Server.

This client is based on Triton Inference Server Python clients and GRPC library
  • tritonclient.http.aio.InferenceServerClient
  • tritonclient.grpc.aio.InferenceServerClient

It can wait for server to be ready with model loaded and then perform inference on it. AsyncioModelClient supports asyncio context manager protocol.

Typical usage:

from pytriton.client import AsyncioModelClient
import numpy as np

input1_sample = np.random.rand(1, 3, 224, 224).astype(np.float32)
input2_sample = np.random.rand(1, 3, 224, 224).astype(np.float32)

client = AsyncioModelClient("localhost", "MyModel")
result_dict = await client.infer_sample(input1_sample, input2_sample)
print(result_dict["output_name"])
await client.close()

Inits ModelClient for given model deployed on the Triton Inference Server.

If lazy_init argument is False, model configuration will be read from inference server during initialization.

Parameters:

  • url (str) –

    The Triton Inference Server url, e.g. 'grpc://localhost:8001'. In case no scheme is provided http scheme will be used as default. In case no port is provided default port for given scheme will be used - 8001 for grpc scheme, 8000 for http scheme.

  • model_name (str) –

    name of the model to interact with.

  • model_version (Optional[str], default: None ) –

    version of the model to interact with. If model_version is None inference on latest model will be performed. The latest versions of the model are numerically the greatest version numbers.

  • lazy_init (bool, default: True ) –

    if initialization should be performed just before sending first request to inference server.

  • init_timeout_s (Optional[float], default: None ) –

    timeout for server and model being ready.

  • model_config (Optional[TritonModelConfig], default: None ) –

    model configuration. If not passed, it will be read from inference server during initialization.

  • ensure_model_is_ready (bool, default: True ) –

    if model should be checked if it is ready before first inference request.

Raises:

  • PyTritonClientModelUnavailableError

    If model with given name (and version) is unavailable.

  • PyTritonClientTimeoutError

    if lazy_init argument is False and wait time for server and model being ready exceeds init_timeout_s.

  • PyTritonClientUrlParseError

    In case of problems with parsing url.

Source code in pytriton/client/client.py
def __init__(
    self,
    url: str,
    model_name: str,
    model_version: Optional[str] = None,
    *,
    lazy_init: bool = True,
    init_timeout_s: Optional[float] = None,
    inference_timeout_s: Optional[float] = None,
    model_config: Optional[TritonModelConfig] = None,
    ensure_model_is_ready: bool = True,
):
    """Inits ModelClient for given model deployed on the Triton Inference Server.

    If `lazy_init` argument is False, model configuration will be read
    from inference server during initialization.

    Args:
        url: The Triton Inference Server url, e.g. 'grpc://localhost:8001'.
            In case no scheme is provided http scheme will be used as default.
            In case no port is provided default port for given scheme will be used -
            8001 for grpc scheme, 8000 for http scheme.
        model_name: name of the model to interact with.
        model_version: version of the model to interact with.
            If model_version is None inference on latest model will be performed.
            The latest versions of the model are numerically the greatest version numbers.
        lazy_init: if initialization should be performed just before sending first request to inference server.
        init_timeout_s: timeout for server and model being ready.
        model_config: model configuration. If not passed, it will be read from inference server during initialization.
        ensure_model_is_ready: if model should be checked if it is ready before first inference request.

    Raises:
        PyTritonClientModelUnavailableError: If model with given name (and version) is unavailable.
        PyTritonClientTimeoutError: if `lazy_init` argument is False and wait time for server and model being ready exceeds `init_timeout_s`.
        PyTritonClientUrlParseError: In case of problems with parsing url.
    """
    super().__init__(
        url=url,
        model_name=model_name,
        model_version=model_version,
        lazy_init=lazy_init,
        init_timeout_s=init_timeout_s,
        inference_timeout_s=inference_timeout_s,
        model_config=model_config,
        ensure_model_is_ready=ensure_model_is_ready,
    )

model_config async property

model_config

Obtain configuration of model deployed on the Triton Inference Server.

Also waits for server to get into readiness state.

__aenter__ async

__aenter__()

Create context for use AsyncioModelClient as a context manager.

Source code in pytriton/client/client.py
async def __aenter__(self):
    """Create context for use AsyncioModelClient as a context manager."""
    _LOGGER.debug("Entering AsyncioModelClient context")
    try:
        if not self._lazy_init:
            _LOGGER.debug("Waiting in AsyncioModelClient context for model to be ready")
            await self._wait_and_init_model_config(self._init_timeout_s)
            _LOGGER.debug("Model is ready in AsyncioModelClient context")
        return self
    except Exception as e:
        _LOGGER.error("Error occurred during AsyncioModelClient context initialization")
        await self.close()
        raise e

__aexit__ async

__aexit__(*_)

Close resources used by AsyncioModelClient when exiting from context.

Source code in pytriton/client/client.py
async def __aexit__(self, *_):
    """Close resources used by AsyncioModelClient when exiting from context."""
    await self.close()
    _LOGGER.debug("Exiting AsyncioModelClient context")

close async

close()

Close resources used by _ModelClientBase.

Source code in pytriton/client/client.py
async def close(self):
    """Close resources used by _ModelClientBase."""
    _LOGGER.debug("Closing InferenceServerClient")
    await self._general_client.close()
    await self._infer_client.close()
    _LOGGER.debug("InferenceServerClient closed")

create_client_from_url

create_client_from_url(url: str, network_timeout_s: Optional[float] = None)

Create Triton Inference Server client.

Parameters:

  • url (str) –

    url of the server to connect to. If url doesn't contain scheme (e.g. "localhost:8001") http scheme is added. If url doesn't contain port (e.g. "localhost") default port for given scheme is added.

  • network_timeout_s (Optional[float], default: None ) –

    timeout for client commands. Default value is 60.0 s.

Returns:

  • Triton Inference Server client.

Raises:

  • PyTritonClientInvalidUrlError

    If provided Triton Inference Server url is invalid.

Source code in pytriton/client/client.py
def create_client_from_url(self, url: str, network_timeout_s: Optional[float] = None):
    """Create Triton Inference Server client.

    Args:
        url: url of the server to connect to.
            If url doesn't contain scheme (e.g. "localhost:8001") http scheme is added.
            If url doesn't contain port (e.g. "localhost") default port for given scheme is added.
        network_timeout_s: timeout for client commands. Default value is 60.0 s.

    Returns:
        Triton Inference Server client.

    Raises:
        PyTritonClientInvalidUrlError: If provided Triton Inference Server url is invalid.
    """
    self._triton_url = TritonUrl.from_url(url)
    self._url = self._triton_url.without_scheme
    self._triton_client_lib = self.get_lib()
    self._monkey_patch_client()

    if self._triton_url.scheme == "grpc":
        # by default grpc client has very large number of timeout, thus we want to make it equal to http client timeout
        network_timeout_s = _DEFAULT_NETWORK_TIMEOUT_S if network_timeout_s is None else network_timeout_s
        warnings.warn(
            f"tritonclient.grpc doesn't support timeout for other commands than infer. Ignoring network_timeout: {network_timeout_s}.",
            NotSupportedTimeoutWarning,
            stacklevel=1,
        )

    triton_client_init_kwargs = self._get_init_extra_args()

    _LOGGER.debug(
        f"Creating InferenceServerClient for {self._triton_url.with_scheme} with {triton_client_init_kwargs}"
    )
    return self._triton_client_lib.InferenceServerClient(self._url, **triton_client_init_kwargs)

from_existing_client classmethod

from_existing_client(existing_client: BaseModelClient)

Create a new instance from an existing client using the same class.

Common usage:

client = BaseModelClient.from_existing_client(existing_client)

Parameters:

  • existing_client (BaseModelClient) –

    An instance of an already initialized subclass.

Returns:

  • A new instance of the same subclass with shared configuration and readiness state.

Source code in pytriton/client/client.py
@classmethod
def from_existing_client(cls, existing_client: "BaseModelClient"):
    """Create a new instance from an existing client using the same class.

    Common usage:
    ```python
    client = BaseModelClient.from_existing_client(existing_client)
    ```

    Args:
        existing_client: An instance of an already initialized subclass.

    Returns:
        A new instance of the same subclass with shared configuration and readiness state.
    """
    kwargs = {}
    # Copy model configuration and readiness state if present
    if hasattr(existing_client, "_model_config"):
        kwargs["model_config"] = existing_client._model_config
        kwargs["ensure_model_is_ready"] = False

    new_client = cls(
        url=existing_client._url,
        model_name=existing_client._model_name,
        model_version=existing_client._model_version,
        init_timeout_s=existing_client._init_timeout_s,
        inference_timeout_s=existing_client._inference_timeout_s,
        **kwargs,
    )

    return new_client

get_lib

get_lib()

Get Triton Inference Server Python client library.

Source code in pytriton/client/client.py
def get_lib(self):
    """Get Triton Inference Server Python client library."""
    return {"grpc": tritonclient.grpc.aio, "http": tritonclient.http.aio}[self._triton_url.scheme.lower()]

infer_batch async

infer_batch(*inputs, parameters: Optional[Dict[str, Union[str, int, bool]]] = None, headers: Optional[Dict[str, Union[str, int, bool]]] = None, **named_inputs)

Run asynchronous inference on batched data.

Typical usage:

client = AsyncioModelClient("localhost", "MyModel")
result_dict = await client.infer_batch(input1, input2)
await client.close()

Inference inputs can be provided either as positional or keyword arguments:

result_dict = await client.infer_batch(input1, input2)
result_dict = await client.infer_batch(a=input1, b=input2)

Mixing of argument passing conventions is not supported and will raise PyTritonClientValueError.

Parameters:

  • *inputs

    inference inputs provided as positional arguments.

  • parameters (Optional[Dict[str, Union[str, int, bool]]], default: None ) –

    custom inference parameters.

  • headers (Optional[Dict[str, Union[str, int, bool]]], default: None ) –

    custom inference headers.

  • **named_inputs

    inference inputs provided as named arguments.

Returns:

  • dictionary with inference results, where dictionary keys are output names.

Raises:

  • PyTritonClientValueError

    if mixing of positional and named arguments passing detected.

  • PyTritonClientTimeoutError

    in case of first method call, lazy_init argument is False and wait time for server and model being ready exceeds init_timeout_s or inference time exceeds timeout_s.

  • PyTritonClientModelDoesntSupportBatchingError

    if model doesn't support batching.

  • PyTritonClientModelUnavailableError

    If model with given name (and version) is unavailable.

  • PyTritonClientInferenceServerError

    If error occurred on inference callable or Triton Inference Server side.

Source code in pytriton/client/client.py
async def infer_batch(
    self,
    *inputs,
    parameters: Optional[Dict[str, Union[str, int, bool]]] = None,
    headers: Optional[Dict[str, Union[str, int, bool]]] = None,
    **named_inputs,
):
    """Run asynchronous inference on batched data.

    Typical usage:

    ```python
    client = AsyncioModelClient("localhost", "MyModel")
    result_dict = await client.infer_batch(input1, input2)
    await client.close()
    ```

    Inference inputs can be provided either as positional or keyword arguments:

    ```python
    result_dict = await client.infer_batch(input1, input2)
    result_dict = await client.infer_batch(a=input1, b=input2)
    ```

    Mixing of argument passing conventions is not supported and will raise PyTritonClientValueError.

    Args:
        *inputs: inference inputs provided as positional arguments.
        parameters: custom inference parameters.
        headers: custom inference headers.
        **named_inputs: inference inputs provided as named arguments.

    Returns:
        dictionary with inference results, where dictionary keys are output names.

    Raises:
        PyTritonClientValueError: if mixing of positional and named arguments passing detected.
        PyTritonClientTimeoutError:
            in case of first method call, `lazy_init` argument is False
            and wait time for server and model being ready exceeds `init_timeout_s`
            or inference time exceeds `timeout_s`.
        PyTritonClientModelDoesntSupportBatchingError: if model doesn't support batching.
        PyTritonClientModelUnavailableError: If model with given name (and version) is unavailable.
        PyTritonClientInferenceServerError: If error occurred on inference callable or Triton Inference Server side.
    """
    _verify_inputs_args(inputs, named_inputs)
    _verify_parameters(parameters)
    _verify_parameters(headers)

    _LOGGER.debug(f"Running inference for {self._model_name}")
    model_config = await self.model_config
    _LOGGER.debug(f"Model config for {self._model_name} obtained")

    model_supports_batching = model_config.max_batch_size > 0
    if not model_supports_batching:
        _LOGGER.error(f"Model {model_config.model_name} doesn't support batching")
        raise PyTritonClientModelDoesntSupportBatchingError(
            f"Model {model_config.model_name} doesn't support batching - use infer_sample method instead"
        )

    _LOGGER.debug(f"Running _infer for {self._model_name}")
    result = await self._infer(inputs or named_inputs, parameters, headers)
    _LOGGER.debug(f"_infer for {self._model_name} finished")
    return result

infer_sample async

infer_sample(*inputs, parameters: Optional[Dict[str, Union[str, int, bool]]] = None, headers: Optional[Dict[str, Union[str, int, bool]]] = None, **named_inputs)

Run asynchronous inference on single data sample.

Typical usage:

client = AsyncioModelClient("localhost", "MyModel")
result_dict = await client.infer_sample(input1, input2)
await client.close()

Inference inputs can be provided either as positional or keyword arguments:

result_dict = await client.infer_sample(input1, input2)
result_dict = await client.infer_sample(a=input1, b=input2)

Mixing of argument passing conventions is not supported and will raise PyTritonClientRuntimeError.

Parameters:

  • *inputs

    inference inputs provided as positional arguments.

  • parameters (Optional[Dict[str, Union[str, int, bool]]], default: None ) –

    custom inference parameters.

  • headers (Optional[Dict[str, Union[str, int, bool]]], default: None ) –

    custom inference headers.

  • **named_inputs

    inference inputs provided as named arguments.

Returns:

  • dictionary with inference results, where dictionary keys are output names.

Raises:

  • PyTritonClientValueError

    if mixing of positional and named arguments passing detected.

  • PyTritonClientTimeoutError

    in case of first method call, lazy_init argument is False and wait time for server and model being ready exceeds init_timeout_s or inference time exceeds timeout_s.

  • PyTritonClientModelUnavailableError

    If model with given name (and version) is unavailable.

  • PyTritonClientInferenceServerError

    If error occurred on inference callable or Triton Inference Server side.

Source code in pytriton/client/client.py
async def infer_sample(
    self,
    *inputs,
    parameters: Optional[Dict[str, Union[str, int, bool]]] = None,
    headers: Optional[Dict[str, Union[str, int, bool]]] = None,
    **named_inputs,
):
    """Run asynchronous inference on single data sample.

    Typical usage:

    ```python
    client = AsyncioModelClient("localhost", "MyModel")
    result_dict = await client.infer_sample(input1, input2)
    await client.close()
    ```

    Inference inputs can be provided either as positional or keyword arguments:

    ```python
    result_dict = await client.infer_sample(input1, input2)
    result_dict = await client.infer_sample(a=input1, b=input2)
    ```

    Mixing of argument passing conventions is not supported and will raise PyTritonClientRuntimeError.

    Args:
        *inputs: inference inputs provided as positional arguments.
        parameters: custom inference parameters.
        headers: custom inference headers.
        **named_inputs: inference inputs provided as named arguments.

    Returns:
        dictionary with inference results, where dictionary keys are output names.

    Raises:
        PyTritonClientValueError: if mixing of positional and named arguments passing detected.
        PyTritonClientTimeoutError:
            in case of first method call, `lazy_init` argument is False
            and wait time for server and model being ready exceeds `init_timeout_s`
            or inference time exceeds `timeout_s`.
        PyTritonClientModelUnavailableError: If model with given name (and version) is unavailable.
        PyTritonClientInferenceServerError: If error occurred on inference callable or Triton Inference Server side.
    """
    _verify_inputs_args(inputs, named_inputs)
    _verify_parameters(parameters)
    _verify_parameters(headers)

    _LOGGER.debug(f"Running inference for {self._model_name}")
    model_config = await self.model_config
    _LOGGER.debug(f"Model config for {self._model_name} obtained")

    model_supports_batching = model_config.max_batch_size > 0
    if model_supports_batching:
        if inputs:
            inputs = tuple(data[np.newaxis, ...] for data in inputs)
        elif named_inputs:
            named_inputs = {name: data[np.newaxis, ...] for name, data in named_inputs.items()}

    _LOGGER.debug(f"Running _infer for {self._model_name}")
    result = await self._infer(inputs or named_inputs, parameters, headers)
    _LOGGER.debug(f"_infer for {self._model_name} finished")
    if model_supports_batching:
        result = {name: data[0] for name, data in result.items()}

    return result

wait_for_model async

wait_for_model(timeout_s: float)

Asynchronous wait for Triton Inference Server and deployed on it model readiness.

Parameters:

  • timeout_s (float) –

    timeout to server and model get into readiness state.

Raises:

  • PyTritonClientTimeoutError

    If server and model are not in readiness state before given timeout.

  • PyTritonClientModelUnavailableError

    If model with given name (and version) is unavailable.

  • KeyboardInterrupt

    If hosting process receives SIGINT

Source code in pytriton/client/client.py
async def wait_for_model(self, timeout_s: float):
    """Asynchronous wait for Triton Inference Server and deployed on it model readiness.

    Args:
        timeout_s: timeout to server and model get into readiness state.

    Raises:
        PyTritonClientTimeoutError: If server and model are not in readiness state before given timeout.
        PyTritonClientModelUnavailableError: If model with given name (and version) is unavailable.
        KeyboardInterrupt: If hosting process receives SIGINT
    """
    _LOGGER.debug(f"Waiting for model {self._model_name} to be ready")
    try:
        await asyncio.wait_for(
            asyncio_wait_for_model_ready(
                self._general_client, self._model_name, self._model_version, timeout_s=timeout_s
            ),
            self._init_timeout_s,
        )
    except asyncio.TimeoutError as e:
        message = f"Timeout while waiting for model {self._model_name} to be ready for {self._init_timeout_s}s"
        _LOGGER.error(message)
        raise PyTritonClientTimeoutError(message) from e

pytriton.client.DecoupledModelClient

DecoupledModelClient(url: str, model_name: str, model_version: Optional[str] = None, *, lazy_init: bool = True, init_timeout_s: Optional[float] = None, inference_timeout_s: Optional[float] = None, model_config: Optional[TritonModelConfig] = None, ensure_model_is_ready: bool = True)

Bases: ModelClient

Synchronous client for decoupled model deployed on the Triton Inference Server.

Inits DecoupledModelClient for given decoupled model deployed on the Triton Inference Server.

Common usage:

client = DecoupledModelClient("localhost", "BERT")
for response in client.infer_sample(input1_sample, input2_sample):
    print(response)
client.close()

Parameters:

  • url (str) –

    The Triton Inference Server url, e.g. grpc://localhost:8001. In case no scheme is provided http scheme will be used as default. In case no port is provided default port for given scheme will be used - 8001 for grpc scheme, 8000 for http scheme.

  • model_name (str) –

    name of the model to interact with.

  • model_version (Optional[str], default: None ) –

    version of the model to interact with. If model_version is None inference on latest model will be performed. The latest versions of the model are numerically the greatest version numbers.

  • lazy_init (bool, default: True ) –

    if initialization should be performed just before sending first request to inference server.

  • init_timeout_s (Optional[float], default: None ) –

    timeout in seconds for the server and model to be ready. If not passed, the default timeout of 300 seconds will be used.

  • inference_timeout_s (Optional[float], default: None ) –

    timeout in seconds for a single model inference request. If not passed, the default timeout of 60 seconds will be used.

  • model_config (Optional[TritonModelConfig], default: None ) –

    model configuration. If not passed, it will be read from inference server during initialization.

  • ensure_model_is_ready (bool, default: True ) –

    if model should be checked if it is ready before first inference request.

Raises:

  • PyTritonClientModelUnavailableError

    If model with given name (and version) is unavailable.

  • PyTritonClientTimeoutError

    if lazy_init argument is False and wait time for server and model being ready exceeds init_timeout_s.

  • PyTritonClientInvalidUrlError

    If provided Triton Inference Server url is invalid.

Source code in pytriton/client/client.py
def __init__(
    self,
    url: str,
    model_name: str,
    model_version: Optional[str] = None,
    *,
    lazy_init: bool = True,
    init_timeout_s: Optional[float] = None,
    inference_timeout_s: Optional[float] = None,
    model_config: Optional[TritonModelConfig] = None,
    ensure_model_is_ready: bool = True,
):
    """Inits DecoupledModelClient for given decoupled model deployed on the Triton Inference Server.

    Common usage:

    ```python
    client = DecoupledModelClient("localhost", "BERT")
    for response in client.infer_sample(input1_sample, input2_sample):
        print(response)
    client.close()
    ```

    Args:
        url: The Triton Inference Server url, e.g. `grpc://localhost:8001`.
            In case no scheme is provided http scheme will be used as default.
            In case no port is provided default port for given scheme will be used -
            8001 for grpc scheme, 8000 for http scheme.
        model_name: name of the model to interact with.
        model_version: version of the model to interact with.
            If model_version is None inference on latest model will be performed.
            The latest versions of the model are numerically the greatest version numbers.
        lazy_init: if initialization should be performed just before sending first request to inference server.
        init_timeout_s: timeout in seconds for the server and model to be ready. If not passed, the default timeout of 300 seconds will be used.
        inference_timeout_s: timeout in seconds for a single model inference request. If not passed, the default timeout of 60 seconds will be used.
        model_config: model configuration. If not passed, it will be read from inference server during initialization.
        ensure_model_is_ready: if model should be checked if it is ready before first inference request.

    Raises:
        PyTritonClientModelUnavailableError: If model with given name (and version) is unavailable.
        PyTritonClientTimeoutError:
            if `lazy_init` argument is False and wait time for server and model being ready exceeds `init_timeout_s`.
        PyTritonClientInvalidUrlError: If provided Triton Inference Server url is invalid.
    """
    super().__init__(
        url,
        model_name,
        model_version,
        lazy_init=lazy_init,
        init_timeout_s=init_timeout_s,
        inference_timeout_s=inference_timeout_s,
        model_config=model_config,
        ensure_model_is_ready=ensure_model_is_ready,
    )
    if self._triton_url.scheme == "http":
        raise PyTritonClientValueError("DecoupledModelClient is only supported for grpc protocol")
    self._queue = Queue()
    self._lock = Lock()

is_batching_supported property

is_batching_supported

Checks if model supports batching.

Also waits for server to get into readiness state.

model_config property

model_config: TritonModelConfig

Obtain the configuration of the model deployed on the Triton Inference Server.

This method waits for the server to get into readiness state before obtaining the model configuration.

Returns:

  • TritonModelConfig ( TritonModelConfig ) –

    configuration of the model deployed on the Triton Inference Server.

Raises:

  • PyTritonClientTimeoutError

    If the server and model are not in readiness state before the given timeout.

  • PyTritonClientModelUnavailableError

    If the model with the given name (and version) is unavailable.

  • KeyboardInterrupt

    If the hosting process receives SIGINT.

  • PyTritonClientClosedError

    If the ModelClient is closed.

__enter__

__enter__()

Create context for using ModelClient as a context manager.

Source code in pytriton/client/client.py
def __enter__(self):
    """Create context for using ModelClient as a context manager."""
    return self

__exit__

__exit__(*_)

Close resources used by ModelClient instance when exiting from the context.

Source code in pytriton/client/client.py
def __exit__(self, *_):
    """Close resources used by ModelClient instance when exiting from the context."""
    self.close()

close

close()

Close resources used by DecoupledModelClient.

Source code in pytriton/client/client.py
def close(self):
    """Close resources used by DecoupledModelClient."""
    _LOGGER.debug("Closing DecoupledModelClient")
    if self._lock.acquire(blocking=False):
        try:
            super().close()
        finally:
            self._lock.release()
    else:
        _LOGGER.warning("DecoupledModelClient is stil streaming answers")
        self._infer_client.stop_stream(False)
        super().close()

create_client_from_url

create_client_from_url(url: str, network_timeout_s: Optional[float] = None)

Create Triton Inference Server client.

Parameters:

  • url (str) –

    url of the server to connect to. If url doesn't contain scheme (e.g. "localhost:8001") http scheme is added. If url doesn't contain port (e.g. "localhost") default port for given scheme is added.

  • network_timeout_s (Optional[float], default: None ) –

    timeout for client commands. Default value is 60.0 s.

Returns:

  • Triton Inference Server client.

Raises:

  • PyTritonClientInvalidUrlError

    If provided Triton Inference Server url is invalid.

Source code in pytriton/client/client.py
def create_client_from_url(self, url: str, network_timeout_s: Optional[float] = None):
    """Create Triton Inference Server client.

    Args:
        url: url of the server to connect to.
            If url doesn't contain scheme (e.g. "localhost:8001") http scheme is added.
            If url doesn't contain port (e.g. "localhost") default port for given scheme is added.
        network_timeout_s: timeout for client commands. Default value is 60.0 s.

    Returns:
        Triton Inference Server client.

    Raises:
        PyTritonClientInvalidUrlError: If provided Triton Inference Server url is invalid.
    """
    self._triton_url = TritonUrl.from_url(url)
    self._url = self._triton_url.without_scheme
    self._triton_client_lib = self.get_lib()
    self._monkey_patch_client()

    if self._triton_url.scheme == "grpc":
        # by default grpc client has very large number of timeout, thus we want to make it equal to http client timeout
        network_timeout_s = _DEFAULT_NETWORK_TIMEOUT_S if network_timeout_s is None else network_timeout_s
        warnings.warn(
            f"tritonclient.grpc doesn't support timeout for other commands than infer. Ignoring network_timeout: {network_timeout_s}.",
            NotSupportedTimeoutWarning,
            stacklevel=1,
        )

    triton_client_init_kwargs = self._get_init_extra_args()

    _LOGGER.debug(
        f"Creating InferenceServerClient for {self._triton_url.with_scheme} with {triton_client_init_kwargs}"
    )
    return self._triton_client_lib.InferenceServerClient(self._url, **triton_client_init_kwargs)

from_existing_client classmethod

from_existing_client(existing_client: BaseModelClient)

Create a new instance from an existing client using the same class.

Common usage:

client = BaseModelClient.from_existing_client(existing_client)

Parameters:

  • existing_client (BaseModelClient) –

    An instance of an already initialized subclass.

Returns:

  • A new instance of the same subclass with shared configuration and readiness state.

Source code in pytriton/client/client.py
@classmethod
def from_existing_client(cls, existing_client: "BaseModelClient"):
    """Create a new instance from an existing client using the same class.

    Common usage:
    ```python
    client = BaseModelClient.from_existing_client(existing_client)
    ```

    Args:
        existing_client: An instance of an already initialized subclass.

    Returns:
        A new instance of the same subclass with shared configuration and readiness state.
    """
    kwargs = {}
    # Copy model configuration and readiness state if present
    if hasattr(existing_client, "_model_config"):
        kwargs["model_config"] = existing_client._model_config
        kwargs["ensure_model_is_ready"] = False

    new_client = cls(
        url=existing_client._url,
        model_name=existing_client._model_name,
        model_version=existing_client._model_version,
        init_timeout_s=existing_client._init_timeout_s,
        inference_timeout_s=existing_client._inference_timeout_s,
        **kwargs,
    )

    return new_client

get_lib

get_lib()

Returns tritonclient library for given scheme.

Source code in pytriton/client/client.py
def get_lib(self):
    """Returns tritonclient library for given scheme."""
    return {"grpc": tritonclient.grpc, "http": tritonclient.http}[self._triton_url.scheme.lower()]

infer_batch

infer_batch(*inputs, parameters: Optional[Dict[str, Union[str, int, bool]]] = None, headers: Optional[Dict[str, Union[str, int, bool]]] = None, **named_inputs) -> Dict[str, ndarray]

Run synchronous inference on batched data.

Typical usage:

client = ModelClient("localhost", "MyModel")
result_dict = client.infer_batch(input1, input2)
client.close()

Inference inputs can be provided either as positional or keyword arguments:

result_dict = client.infer_batch(input1, input2)
result_dict = client.infer_batch(a=input1, b=input2)

Parameters:

  • *inputs

    Inference inputs provided as positional arguments.

  • parameters (Optional[Dict[str, Union[str, int, bool]]], default: None ) –

    Custom inference parameters.

  • headers (Optional[Dict[str, Union[str, int, bool]]], default: None ) –

    Custom inference headers.

  • **named_inputs

    Inference inputs provided as named arguments.

Returns:

  • Dict[str, ndarray]

    Dictionary with inference results, where dictionary keys are output names.

Raises:

  • PyTritonClientValueError

    If mixing of positional and named arguments passing detected.

  • PyTritonClientTimeoutError

    If the wait time for the server and model being ready exceeds init_timeout_s or inference request time exceeds inference_timeout_s.

  • PyTritonClientModelUnavailableError

    If the model with the given name (and version) is unavailable.

  • PyTritonClientInferenceServerError

    If an error occurred on the inference callable or Triton Inference Server side.

  • PyTritonClientModelDoesntSupportBatchingError

    If the model doesn't support batching.

  • PyTritonClientValueError

    if mixing of positional and named arguments passing detected.

  • PyTritonClientTimeoutError

    in case of first method call, lazy_init argument is False and wait time for server and model being ready exceeds init_timeout_s or inference time exceeds inference_timeout_s passed to __init__.

  • PyTritonClientModelUnavailableError

    If model with given name (and version) is unavailable.

  • PyTritonClientInferenceServerError

    If error occurred on inference callable or Triton Inference Server side,

Source code in pytriton/client/client.py
def infer_batch(
    self,
    *inputs,
    parameters: Optional[Dict[str, Union[str, int, bool]]] = None,
    headers: Optional[Dict[str, Union[str, int, bool]]] = None,
    **named_inputs,
) -> Dict[str, np.ndarray]:
    """Run synchronous inference on batched data.

    Typical usage:

    ```python
    client = ModelClient("localhost", "MyModel")
    result_dict = client.infer_batch(input1, input2)
    client.close()
    ```

    Inference inputs can be provided either as positional or keyword arguments:

    ```python
    result_dict = client.infer_batch(input1, input2)
    result_dict = client.infer_batch(a=input1, b=input2)
    ```

    Args:
        *inputs: Inference inputs provided as positional arguments.
        parameters: Custom inference parameters.
        headers: Custom inference headers.
        **named_inputs: Inference inputs provided as named arguments.

    Returns:
        Dictionary with inference results, where dictionary keys are output names.

    Raises:
        PyTritonClientValueError: If mixing of positional and named arguments passing detected.
        PyTritonClientTimeoutError: If the wait time for the server and model being ready exceeds `init_timeout_s` or
            inference request time exceeds `inference_timeout_s`.
        PyTritonClientModelUnavailableError: If the model with the given name (and version) is unavailable.
        PyTritonClientInferenceServerError: If an error occurred on the inference callable or Triton Inference Server side.
        PyTritonClientModelDoesntSupportBatchingError: If the model doesn't support batching.
        PyTritonClientValueError: if mixing of positional and named arguments passing detected.
        PyTritonClientTimeoutError:
            in case of first method call, `lazy_init` argument is False
            and wait time for server and model being ready exceeds `init_timeout_s` or
            inference time exceeds `inference_timeout_s` passed to `__init__`.
        PyTritonClientModelUnavailableError: If model with given name (and version) is unavailable.
        PyTritonClientInferenceServerError: If error occurred on inference callable or Triton Inference Server side,
    """
    _verify_inputs_args(inputs, named_inputs)
    _verify_parameters(parameters)
    _verify_parameters(headers)

    if not self.is_batching_supported:
        raise PyTritonClientModelDoesntSupportBatchingError(
            f"Model {self.model_config.model_name} doesn't support batching - use infer_sample method instead"
        )

    return self._infer(inputs or named_inputs, parameters, headers)

infer_sample

infer_sample(*inputs, parameters: Optional[Dict[str, Union[str, int, bool]]] = None, headers: Optional[Dict[str, Union[str, int, bool]]] = None, **named_inputs) -> Dict[str, ndarray]

Run synchronous inference on a single data sample.

Typical usage:

client = ModelClient("localhost", "MyModel")
result_dict = client.infer_sample(input1, input2)
client.close()

Inference inputs can be provided either as positional or keyword arguments:

result_dict = client.infer_sample(input1, input2)
result_dict = client.infer_sample(a=input1, b=input2)

Parameters:

  • *inputs

    Inference inputs provided as positional arguments.

  • parameters (Optional[Dict[str, Union[str, int, bool]]], default: None ) –

    Custom inference parameters.

  • headers (Optional[Dict[str, Union[str, int, bool]]], default: None ) –

    Custom inference headers.

  • **named_inputs

    Inference inputs provided as named arguments.

Returns:

  • Dict[str, ndarray]

    Dictionary with inference results, where dictionary keys are output names.

Raises:

  • PyTritonClientValueError

    If mixing of positional and named arguments passing detected.

  • PyTritonClientTimeoutError

    If the wait time for the server and model being ready exceeds init_timeout_s or inference request time exceeds inference_timeout_s.

  • PyTritonClientModelUnavailableError

    If the model with the given name (and version) is unavailable.

  • PyTritonClientInferenceServerError

    If an error occurred on the inference callable or Triton Inference Server side.

Source code in pytriton/client/client.py
def infer_sample(
    self,
    *inputs,
    parameters: Optional[Dict[str, Union[str, int, bool]]] = None,
    headers: Optional[Dict[str, Union[str, int, bool]]] = None,
    **named_inputs,
) -> Dict[str, np.ndarray]:
    """Run synchronous inference on a single data sample.

    Typical usage:

    ```python
    client = ModelClient("localhost", "MyModel")
    result_dict = client.infer_sample(input1, input2)
    client.close()
    ```

    Inference inputs can be provided either as positional or keyword arguments:

    ```python
    result_dict = client.infer_sample(input1, input2)
    result_dict = client.infer_sample(a=input1, b=input2)
    ```

    Args:
        *inputs: Inference inputs provided as positional arguments.
        parameters: Custom inference parameters.
        headers: Custom inference headers.
        **named_inputs: Inference inputs provided as named arguments.

    Returns:
        Dictionary with inference results, where dictionary keys are output names.

    Raises:
        PyTritonClientValueError: If mixing of positional and named arguments passing detected.
        PyTritonClientTimeoutError: If the wait time for the server and model being ready exceeds `init_timeout_s` or
            inference request time exceeds `inference_timeout_s`.
        PyTritonClientModelUnavailableError: If the model with the given name (and version) is unavailable.
        PyTritonClientInferenceServerError: If an error occurred on the inference callable or Triton Inference Server side.
    """
    _verify_inputs_args(inputs, named_inputs)
    _verify_parameters(parameters)
    _verify_parameters(headers)

    if self.is_batching_supported:
        if inputs:
            inputs = tuple(data[np.newaxis, ...] for data in inputs)
        elif named_inputs:
            named_inputs = {name: data[np.newaxis, ...] for name, data in named_inputs.items()}

    result = self._infer(inputs or named_inputs, parameters, headers)

    return self._debatch_result(result)

load_model

load_model(config: Optional[str] = None, files: Optional[dict] = None)

Load model on the Triton Inference Server.

Parameters:

  • config (Optional[str], default: None ) –

    str - Optional JSON representation of a model config provided for the load request, if provided, this config will be used for loading the model.

  • files (Optional[dict], default: None ) –

    dict - Optional dictionary specifying file path (with "file:" prefix) in the override model directory to the file content as bytes. The files will form the model directory that the model will be loaded from. If specified, 'config' must be provided to be the model configuration of the override model directory.

Source code in pytriton/client/client.py
def load_model(self, config: Optional[str] = None, files: Optional[dict] = None):
    """Load model on the Triton Inference Server.

    Args:
        config: str - Optional JSON representation of a model config provided for
            the load request, if provided, this config will be used for
            loading the model.
        files: dict - Optional dictionary specifying file path (with "file:" prefix) in
            the override model directory to the file content as bytes.
            The files will form the model directory that the model will be
            loaded from. If specified, 'config' must be provided to be
            the model configuration of the override model directory.
    """
    self._general_client.load_model(self._model_name, config=config, files=files)

unload_model

unload_model()

Unload model from the Triton Inference Server.

Source code in pytriton/client/client.py
def unload_model(self):
    """Unload model from the Triton Inference Server."""
    self._general_client.unload_model(self._model_name)

wait_for_model

wait_for_model(timeout_s: float)

Wait for the Triton Inference Server and the deployed model to be ready.

Parameters:

  • timeout_s (float) –

    timeout in seconds to wait for the server and model to be ready.

Raises:

  • PyTritonClientTimeoutError

    If the server and model are not ready before the given timeout.

  • PyTritonClientModelUnavailableError

    If the model with the given name (and version) is unavailable.

  • KeyboardInterrupt

    If the hosting process receives SIGINT.

  • PyTritonClientClosedError

    If the ModelClient is closed.

Source code in pytriton/client/client.py
def wait_for_model(self, timeout_s: float):
    """Wait for the Triton Inference Server and the deployed model to be ready.

    Args:
        timeout_s: timeout in seconds to wait for the server and model to be ready.

    Raises:
        PyTritonClientTimeoutError: If the server and model are not ready before the given timeout.
        PyTritonClientModelUnavailableError: If the model with the given name (and version) is unavailable.
        KeyboardInterrupt: If the hosting process receives SIGINT.
        PyTritonClientClosedError: If the ModelClient is closed.
    """
    if self._general_client is None:
        raise PyTritonClientClosedError("ModelClient is closed")
    wait_for_model_ready(self._general_client, self._model_name, self._model_version, timeout_s=timeout_s)

wait_for_server

wait_for_server(timeout_s: float)

Wait for Triton Inference Server readiness.

Parameters:

  • timeout_s (float) –

    timeout to server get into readiness state.

Raises:

  • PyTritonClientTimeoutError

    If server is not in readiness state before given timeout.

  • KeyboardInterrupt

    If hosting process receives SIGINT

Source code in pytriton/client/client.py
def wait_for_server(self, timeout_s: float):
    """Wait for Triton Inference Server readiness.

    Args:
        timeout_s: timeout to server get into readiness state.

    Raises:
        PyTritonClientTimeoutError: If server is not in readiness state before given timeout.
        KeyboardInterrupt: If hosting process receives SIGINT
    """
    wait_for_server_ready(self._general_client, timeout_s=timeout_s)

pytriton.client.AsyncioDecoupledModelClient

AsyncioDecoupledModelClient(url: str, model_name: str, model_version: Optional[str] = None, *, lazy_init: bool = True, init_timeout_s: Optional[float] = None, inference_timeout_s: Optional[float] = None, model_config: Optional[TritonModelConfig] = None, ensure_model_is_ready: bool = True)

Bases: AsyncioModelClient

Asyncio client for model deployed on the Triton Inference Server.

This client is based on Triton Inference Server Python clients and GRPC library: * tritonclient.grpc.aio.InferenceServerClient

It can wait for server to be ready with model loaded and then perform inference on it. AsyncioDecoupledModelClient supports asyncio context manager protocol.

The client is intended to be used with decoupled models and will raise an error if model is coupled.

Typical usage:

from pytriton.client import AsyncioDecoupledModelClient
import numpy as np

input1_sample = np.random.rand(1, 3, 224, 224).astype(np.float32)
input2_sample = np.random.rand(1, 3, 224, 224).astype(np.float32)

async with AsyncioDecoupledModelClient("grpc://localhost", "MyModel") as client:
    async for result_dict in client.infer_sample(input1_sample, input2_sample):
        print(result_dict["output_name"])

Inits ModelClient for given model deployed on the Triton Inference Server.

If lazy_init argument is False, model configuration will be read from inference server during initialization.

Parameters:

  • url (str) –

    The Triton Inference Server url, e.g. 'grpc://localhost:8001'. In case no scheme is provided http scheme will be used as default. In case no port is provided default port for given scheme will be used - 8001 for grpc scheme, 8000 for http scheme.

  • model_name (str) –

    name of the model to interact with.

  • model_version (Optional[str], default: None ) –

    version of the model to interact with. If model_version is None inference on latest model will be performed. The latest versions of the model are numerically the greatest version numbers.

  • lazy_init (bool, default: True ) –

    if initialization should be performed just before sending first request to inference server.

  • init_timeout_s (Optional[float], default: None ) –

    timeout for server and model being ready.

  • model_config (Optional[TritonModelConfig], default: None ) –

    model configuration. If not passed, it will be read from inference server during initialization.

  • ensure_model_is_ready (bool, default: True ) –

    if model should be checked if it is ready before first inference request.

Raises:

  • PyTritonClientModelUnavailableError

    If model with given name (and version) is unavailable.

  • PyTritonClientTimeoutError

    if lazy_init argument is False and wait time for server and model being ready exceeds init_timeout_s.

  • PyTritonClientUrlParseError

    In case of problems with parsing url.

Source code in pytriton/client/client.py
def __init__(
    self,
    url: str,
    model_name: str,
    model_version: Optional[str] = None,
    *,
    lazy_init: bool = True,
    init_timeout_s: Optional[float] = None,
    inference_timeout_s: Optional[float] = None,
    model_config: Optional[TritonModelConfig] = None,
    ensure_model_is_ready: bool = True,
):
    """Inits ModelClient for given model deployed on the Triton Inference Server.

    If `lazy_init` argument is False, model configuration will be read
    from inference server during initialization.

    Args:
        url: The Triton Inference Server url, e.g. 'grpc://localhost:8001'.
            In case no scheme is provided http scheme will be used as default.
            In case no port is provided default port for given scheme will be used -
            8001 for grpc scheme, 8000 for http scheme.
        model_name: name of the model to interact with.
        model_version: version of the model to interact with.
            If model_version is None inference on latest model will be performed.
            The latest versions of the model are numerically the greatest version numbers.
        lazy_init: if initialization should be performed just before sending first request to inference server.
        init_timeout_s: timeout for server and model being ready.
        model_config: model configuration. If not passed, it will be read from inference server during initialization.
        ensure_model_is_ready: if model should be checked if it is ready before first inference request.

    Raises:
        PyTritonClientModelUnavailableError: If model with given name (and version) is unavailable.
        PyTritonClientTimeoutError: if `lazy_init` argument is False and wait time for server and model being ready exceeds `init_timeout_s`.
        PyTritonClientUrlParseError: In case of problems with parsing url.
    """
    super().__init__(
        url=url,
        model_name=model_name,
        model_version=model_version,
        lazy_init=lazy_init,
        init_timeout_s=init_timeout_s,
        inference_timeout_s=inference_timeout_s,
        model_config=model_config,
        ensure_model_is_ready=ensure_model_is_ready,
    )

model_config async property

model_config

Obtain configuration of model deployed on the Triton Inference Server.

Also waits for server to get into readiness state.

__aenter__ async

__aenter__()

Create context for use AsyncioModelClient as a context manager.

Source code in pytriton/client/client.py
async def __aenter__(self):
    """Create context for use AsyncioModelClient as a context manager."""
    _LOGGER.debug("Entering AsyncioModelClient context")
    try:
        if not self._lazy_init:
            _LOGGER.debug("Waiting in AsyncioModelClient context for model to be ready")
            await self._wait_and_init_model_config(self._init_timeout_s)
            _LOGGER.debug("Model is ready in AsyncioModelClient context")
        return self
    except Exception as e:
        _LOGGER.error("Error occurred during AsyncioModelClient context initialization")
        await self.close()
        raise e

__aexit__ async

__aexit__(*_)

Close resources used by AsyncioModelClient when exiting from context.

Source code in pytriton/client/client.py
async def __aexit__(self, *_):
    """Close resources used by AsyncioModelClient when exiting from context."""
    await self.close()
    _LOGGER.debug("Exiting AsyncioModelClient context")

close async

close()

Close resources used by _ModelClientBase.

Source code in pytriton/client/client.py
async def close(self):
    """Close resources used by _ModelClientBase."""
    _LOGGER.debug("Closing InferenceServerClient")
    await self._general_client.close()
    await self._infer_client.close()
    _LOGGER.debug("InferenceServerClient closed")

create_client_from_url

create_client_from_url(url: str, network_timeout_s: Optional[float] = None)

Create Triton Inference Server client.

Parameters:

  • url (str) –

    url of the server to connect to. If url doesn't contain scheme (e.g. "localhost:8001") http scheme is added. If url doesn't contain port (e.g. "localhost") default port for given scheme is added.

  • network_timeout_s (Optional[float], default: None ) –

    timeout for client commands. Default value is 60.0 s.

Returns:

  • Triton Inference Server client.

Raises:

  • PyTritonClientInvalidUrlError

    If provided Triton Inference Server url is invalid.

Source code in pytriton/client/client.py
def create_client_from_url(self, url: str, network_timeout_s: Optional[float] = None):
    """Create Triton Inference Server client.

    Args:
        url: url of the server to connect to.
            If url doesn't contain scheme (e.g. "localhost:8001") http scheme is added.
            If url doesn't contain port (e.g. "localhost") default port for given scheme is added.
        network_timeout_s: timeout for client commands. Default value is 60.0 s.

    Returns:
        Triton Inference Server client.

    Raises:
        PyTritonClientInvalidUrlError: If provided Triton Inference Server url is invalid.
    """
    self._triton_url = TritonUrl.from_url(url)
    self._url = self._triton_url.without_scheme
    self._triton_client_lib = self.get_lib()
    self._monkey_patch_client()

    if self._triton_url.scheme == "grpc":
        # by default grpc client has very large number of timeout, thus we want to make it equal to http client timeout
        network_timeout_s = _DEFAULT_NETWORK_TIMEOUT_S if network_timeout_s is None else network_timeout_s
        warnings.warn(
            f"tritonclient.grpc doesn't support timeout for other commands than infer. Ignoring network_timeout: {network_timeout_s}.",
            NotSupportedTimeoutWarning,
            stacklevel=1,
        )

    triton_client_init_kwargs = self._get_init_extra_args()

    _LOGGER.debug(
        f"Creating InferenceServerClient for {self._triton_url.with_scheme} with {triton_client_init_kwargs}"
    )
    return self._triton_client_lib.InferenceServerClient(self._url, **triton_client_init_kwargs)

from_existing_client classmethod

from_existing_client(existing_client: BaseModelClient)

Create a new instance from an existing client using the same class.

Common usage:

client = BaseModelClient.from_existing_client(existing_client)

Parameters:

  • existing_client (BaseModelClient) –

    An instance of an already initialized subclass.

Returns:

  • A new instance of the same subclass with shared configuration and readiness state.

Source code in pytriton/client/client.py
@classmethod
def from_existing_client(cls, existing_client: "BaseModelClient"):
    """Create a new instance from an existing client using the same class.

    Common usage:
    ```python
    client = BaseModelClient.from_existing_client(existing_client)
    ```

    Args:
        existing_client: An instance of an already initialized subclass.

    Returns:
        A new instance of the same subclass with shared configuration and readiness state.
    """
    kwargs = {}
    # Copy model configuration and readiness state if present
    if hasattr(existing_client, "_model_config"):
        kwargs["model_config"] = existing_client._model_config
        kwargs["ensure_model_is_ready"] = False

    new_client = cls(
        url=existing_client._url,
        model_name=existing_client._model_name,
        model_version=existing_client._model_version,
        init_timeout_s=existing_client._init_timeout_s,
        inference_timeout_s=existing_client._inference_timeout_s,
        **kwargs,
    )

    return new_client

get_lib

get_lib()

Get Triton Inference Server Python client library.

Source code in pytriton/client/client.py
def get_lib(self):
    """Get Triton Inference Server Python client library."""
    return {"grpc": tritonclient.grpc.aio, "http": tritonclient.http.aio}[self._triton_url.scheme.lower()]

infer_batch async

infer_batch(*inputs, parameters: Optional[Dict[str, Union[str, int, bool]]] = None, headers: Optional[Dict[str, Union[str, int, bool]]] = None, **named_inputs)

Run asynchronous inference on batched data.

Typical usage:

async with AsyncioDecoupledModelClient("grpc://localhost", "MyModel") as client:
    async for result_dict in client.infer_batch(input1_sample, input2_sample):
        print(result_dict["output_name"])

Inference inputs can be provided either as positional or keyword arguments:

results_iterator = client.infer_batch(input1, input2)
results_iterator = client.infer_batch(a=input1, b=input2)

Mixing of argument passing conventions is not supported and will raise PyTritonClientRuntimeError.

Parameters:

  • *inputs

    inference inputs provided as positional arguments.

  • parameters (Optional[Dict[str, Union[str, int, bool]]], default: None ) –

    custom inference parameters.

  • headers (Optional[Dict[str, Union[str, int, bool]]], default: None ) –

    custom inference headers.

  • **named_inputs

    inference inputs provided as named arguments.

Returns:

  • Asynchronous generator, which generates dictionaries with partial inference results, where dictionary keys are output names.

Raises:

  • PyTritonClientValueError

    if mixing of positional and named arguments passing detected.

  • PyTritonClientTimeoutError

    in case of first method call, lazy_init argument is False and wait time for server and model being ready exceeds init_timeout_s or inference time exceeds timeout_s.

  • PyTritonClientModelDoesntSupportBatchingError

    if model doesn't support batching.

  • PyTritonClientModelUnavailableError

    If model with given name (and version) is unavailable.

  • PyTritonClientInferenceServerError

    If error occurred on inference callable or Triton Inference Server side.

Source code in pytriton/client/client.py
async def infer_batch(
    self,
    *inputs,
    parameters: Optional[Dict[str, Union[str, int, bool]]] = None,
    headers: Optional[Dict[str, Union[str, int, bool]]] = None,
    **named_inputs,
):
    """Run asynchronous inference on batched data.

    Typical usage:

    ```python
    async with AsyncioDecoupledModelClient("grpc://localhost", "MyModel") as client:
        async for result_dict in client.infer_batch(input1_sample, input2_sample):
            print(result_dict["output_name"])
    ```

    Inference inputs can be provided either as positional or keyword arguments:

    ```python
    results_iterator = client.infer_batch(input1, input2)
    results_iterator = client.infer_batch(a=input1, b=input2)
    ```

    Mixing of argument passing conventions is not supported and will raise PyTritonClientRuntimeError.

    Args:
        *inputs: inference inputs provided as positional arguments.
        parameters: custom inference parameters.
        headers: custom inference headers.
        **named_inputs: inference inputs provided as named arguments.

    Returns:
        Asynchronous generator, which generates dictionaries with partial inference results, where dictionary keys are output names.

    Raises:
        PyTritonClientValueError: if mixing of positional and named arguments passing detected.
        PyTritonClientTimeoutError:
            in case of first method call, `lazy_init` argument is False
            and wait time for server and model being ready exceeds `init_timeout_s`
            or inference time exceeds `timeout_s`.
        PyTritonClientModelDoesntSupportBatchingError: if model doesn't support batching.
        PyTritonClientModelUnavailableError: If model with given name (and version) is unavailable.
        PyTritonClientInferenceServerError: If error occurred on inference callable or Triton Inference Server side.
    """
    _verify_inputs_args(inputs, named_inputs)
    _verify_parameters(parameters)
    _verify_parameters(headers)

    _LOGGER.debug(f"Running inference for {self._model_name}")
    model_config = await self.model_config
    _LOGGER.debug(f"Model config for {self._model_name} obtained")

    model_supports_batching = model_config.max_batch_size > 0
    if not model_supports_batching:
        _LOGGER.error(f"Model {model_config.model_name} doesn't support batching")
        raise PyTritonClientModelDoesntSupportBatchingError(
            f"Model {model_config.model_name} doesn't support batching - use infer_sample method instead"
        )

    _LOGGER.debug(f"Running _infer for {self._model_name}")
    result = self._infer(inputs or named_inputs, parameters, headers)
    _LOGGER.debug(f"_infer for {self._model_name} finished")
    async for item in result:
        yield item

infer_sample async

infer_sample(*inputs, parameters: Optional[Dict[str, Union[str, int, bool]]] = None, headers: Optional[Dict[str, Union[str, int, bool]]] = None, **named_inputs)

Run asynchronous inference on single data sample.

Typical usage:

async with AsyncioDecoupledModelClient("grpc://localhost", "MyModel") as client:
    async for result_dict in client.infer_sample(input1_sample, input2_sample):
        print(result_dict["output_name"])

Inference inputs can be provided either as positional or keyword arguments:

results_iterator = client.infer_sample(input1, input2)
results_iterator = client.infer_sample(a=input1, b=input2)

Mixing of argument passing conventions is not supported and will raise PyTritonClientRuntimeError.

Parameters:

  • *inputs

    inference inputs provided as positional arguments.

  • parameters (Optional[Dict[str, Union[str, int, bool]]], default: None ) –

    custom inference parameters.

  • headers (Optional[Dict[str, Union[str, int, bool]]], default: None ) –

    custom inference headers.

  • **named_inputs

    inference inputs provided as named arguments.

Returns:

  • Asynchronous generator, which generates dictionaries with partial inference results, where dictionary keys are output names.

Raises:

  • PyTritonClientValueError

    if mixing of positional and named arguments passing detected.

  • PyTritonClientTimeoutError

    in case of first method call, lazy_init argument is False and wait time for server and model being ready exceeds init_timeout_s or inference time exceeds timeout_s.

  • PyTritonClientModelUnavailableError

    If model with given name (and version) is unavailable.

  • PyTritonClientInferenceServerError

    If error occurred on inference callable or Triton Inference Server side.

Source code in pytriton/client/client.py
async def infer_sample(
    self,
    *inputs,
    parameters: Optional[Dict[str, Union[str, int, bool]]] = None,
    headers: Optional[Dict[str, Union[str, int, bool]]] = None,
    **named_inputs,
):
    """Run asynchronous inference on single data sample.

    Typical usage:

    ```python
    async with AsyncioDecoupledModelClient("grpc://localhost", "MyModel") as client:
        async for result_dict in client.infer_sample(input1_sample, input2_sample):
            print(result_dict["output_name"])
    ```

    Inference inputs can be provided either as positional or keyword arguments:

    ```python
    results_iterator = client.infer_sample(input1, input2)
    results_iterator = client.infer_sample(a=input1, b=input2)
    ```

    Mixing of argument passing conventions is not supported and will raise PyTritonClientRuntimeError.

    Args:
        *inputs: inference inputs provided as positional arguments.
        parameters: custom inference parameters.
        headers: custom inference headers.
        **named_inputs: inference inputs provided as named arguments.

    Returns:
        Asynchronous generator, which generates dictionaries with partial inference results, where dictionary keys are output names.

    Raises:
        PyTritonClientValueError: if mixing of positional and named arguments passing detected.
        PyTritonClientTimeoutError:
            in case of first method call, `lazy_init` argument is False
            and wait time for server and model being ready exceeds `init_timeout_s`
            or inference time exceeds `timeout_s`.
        PyTritonClientModelUnavailableError: If model with given name (and version) is unavailable.
        PyTritonClientInferenceServerError: If error occurred on inference callable or Triton Inference Server side.
    """
    _verify_inputs_args(inputs, named_inputs)
    _verify_parameters(parameters)
    _verify_parameters(headers)

    _LOGGER.debug(f"Running inference for {self._model_name}")
    model_config = await self.model_config
    _LOGGER.debug(f"Model config for {self._model_name} obtained")

    model_supports_batching = model_config.max_batch_size > 0
    if model_supports_batching:
        if inputs:
            inputs = tuple(data[np.newaxis, ...] for data in inputs)
        elif named_inputs:
            named_inputs = {name: data[np.newaxis, ...] for name, data in named_inputs.items()}

    _LOGGER.debug(f"Running _infer for {self._model_name}")
    result = self._infer(inputs or named_inputs, parameters, headers)
    _LOGGER.debug(f"_infer for {self._model_name} finished")

    async for item in result:
        if model_supports_batching:
            debatched_item = {name: data[0] for name, data in item.items()}
            yield debatched_item
        else:
            yield item

wait_for_model async

wait_for_model(timeout_s: float)

Asynchronous wait for Triton Inference Server and deployed on it model readiness.

Parameters:

  • timeout_s (float) –

    timeout to server and model get into readiness state.

Raises:

  • PyTritonClientTimeoutError

    If server and model are not in readiness state before given timeout.

  • PyTritonClientModelUnavailableError

    If model with given name (and version) is unavailable.

  • KeyboardInterrupt

    If hosting process receives SIGINT

Source code in pytriton/client/client.py
async def wait_for_model(self, timeout_s: float):
    """Asynchronous wait for Triton Inference Server and deployed on it model readiness.

    Args:
        timeout_s: timeout to server and model get into readiness state.

    Raises:
        PyTritonClientTimeoutError: If server and model are not in readiness state before given timeout.
        PyTritonClientModelUnavailableError: If model with given name (and version) is unavailable.
        KeyboardInterrupt: If hosting process receives SIGINT
    """
    _LOGGER.debug(f"Waiting for model {self._model_name} to be ready")
    try:
        await asyncio.wait_for(
            asyncio_wait_for_model_ready(
                self._general_client, self._model_name, self._model_version, timeout_s=timeout_s
            ),
            self._init_timeout_s,
        )
    except asyncio.TimeoutError as e:
        message = f"Timeout while waiting for model {self._model_name} to be ready for {self._init_timeout_s}s"
        _LOGGER.error(message)
        raise PyTritonClientTimeoutError(message) from e

pytriton.client.FuturesModelClient

FuturesModelClient(url: str, model_name: str, model_version: Optional[str] = None, *, max_workers: int = 128, max_queue_size: int = 128, non_blocking: bool = False, init_timeout_s: Optional[float] = None, inference_timeout_s: Optional[float] = None)

A client for interacting with a model deployed on the Triton Inference Server using concurrent.futures.

This client allows asynchronous inference requests using a thread pool executor. It can be used to perform inference on a model by providing input data and receiving the corresponding output data. The client can be used in a with statement to ensure proper resource management.

Example usage with context manager:

with FuturesModelClient("localhost", "MyModel") as client:
    result_future = client.infer_sample(input1=input1_data, input2=input2_data)
    # do something else
    print(result_future.result())

Usage without context manager:

client = FuturesModelClient("localhost", "MyModel")
result_future = client.infer_sample(input1=input1_data, input2=input2_data)
# do something else
print(result_future.result())
client.close()

Initializes the FuturesModelClient for a given model.

Parameters:

  • url (str) –

    The Triton Inference Server url, e.g. grpc://localhost:8001.

  • model_name (str) –

    The name of the model to interact with.

  • model_version (Optional[str], default: None ) –

    The version of the model to interact with. If None, the latest version will be used.

  • max_workers (int, default: 128 ) –

    The maximum number of threads that can be used to execute the given calls. If None, there is not limit on the number of threads.

  • max_queue_size (int, default: 128 ) –

    The maximum number of requests that can be queued. If None, there is not limit on the number of requests.

  • non_blocking (bool, default: False ) –

    If True, the client will raise a PyTritonClientQueueFullError if the queue is full. If False, the client will block until the queue is not full.

  • init_timeout_s (Optional[float], default: None ) –

    Timeout in seconds for server and model being ready. If non passed default 60 seconds timeout will be used.

  • inference_timeout_s (Optional[float], default: None ) –

    Timeout in seconds for the single model inference request. If non passed default 60 seconds timeout will be used.

Source code in pytriton/client/client.py
def __init__(
    self,
    url: str,
    model_name: str,
    model_version: Optional[str] = None,
    *,
    max_workers: int = 128,
    max_queue_size: int = 128,
    non_blocking: bool = False,
    init_timeout_s: Optional[float] = None,
    inference_timeout_s: Optional[float] = None,
):
    """Initializes the FuturesModelClient for a given model.

    Args:
        url: The Triton Inference Server url, e.g. `grpc://localhost:8001`.
        model_name: The name of the model to interact with.
        model_version: The version of the model to interact with. If None, the latest version will be used.
        max_workers: The maximum number of threads that can be used to execute the given calls. If None, there is not limit on the number of threads.
        max_queue_size: The maximum number of requests that can be queued. If None, there is not limit on the number of requests.
        non_blocking: If True, the client will raise a PyTritonClientQueueFullError if the queue is full. If False, the client will block until the queue is not full.
        init_timeout_s: Timeout in seconds for server and model being ready. If non passed default 60 seconds timeout will be used.
        inference_timeout_s: Timeout in seconds for the single model inference request. If non passed default 60 seconds timeout will be used.
    """
    self._url = url
    self._model_name = model_name
    self._model_version = model_version
    self._threads = []
    self._max_workers = max_workers
    self._max_queue_size = max_queue_size
    self._non_blocking = non_blocking

    if self._max_workers is not None and self._max_workers <= 0:
        raise ValueError("max_workers must be greater than 0")
    if self._max_queue_size is not None and self._max_queue_size <= 0:
        raise ValueError("max_queue_size must be greater than 0")

    kwargs = {}
    if self._max_queue_size is not None:
        kwargs["maxsize"] = self._max_queue_size
    self._queue = Queue(**kwargs)
    self._queue.put((_INIT, None, None))
    self._init_timeout_s = _DEFAULT_FUTURES_INIT_TIMEOUT_S if init_timeout_s is None else init_timeout_s
    self._inference_timeout_s = inference_timeout_s
    self._closed = False
    self._lock = Lock()
    self._existing_client = None

__enter__

__enter__()

Create context for using FuturesModelClient as a context manager.

Source code in pytriton/client/client.py
def __enter__(self):
    """Create context for using FuturesModelClient as a context manager."""
    return self

__exit__

__exit__(exc_type, exc_value, traceback)

Close resources used by FuturesModelClient instance when exiting from the context.

Source code in pytriton/client/client.py
def __exit__(self, exc_type, exc_value, traceback):
    """Close resources used by FuturesModelClient instance when exiting from the context."""
    self.close()

close

close(wait=True)

Close resources used by FuturesModelClient.

This method closes the resources used by the FuturesModelClient instance, including the Triton Inference Server connections. Once this method is called, the FuturesModelClient instance should not be used again.

Parameters:

  • wait

    If True, then shutdown will not return until all running futures have finished executing.

Source code in pytriton/client/client.py
def close(self, wait=True):
    """Close resources used by FuturesModelClient.

    This method closes the resources used by the FuturesModelClient instance, including the Triton Inference Server connections.
    Once this method is called, the FuturesModelClient instance should not be used again.

    Args:
        wait: If True, then shutdown will not return until all running futures have finished executing.
    """
    if self._closed:
        return
    _LOGGER.debug("Closing FuturesModelClient.")

    self._closed = True
    for _ in range(len(self._threads)):
        self._queue.put((_CLOSE, None, None))

    if wait:
        _LOGGER.debug("Waiting for futures to finish.")
        for thread in self._threads:
            thread.join()

infer_batch

infer_batch(*inputs, parameters: Optional[Dict[str, Union[str, int, bool]]] = None, headers: Optional[Dict[str, Union[str, int, bool]]] = None, **named_inputs) -> Future

Run asynchronous inference on batched data and return a Future object.

This method allows the user to perform inference on batched data by providing input data and receiving the corresponding output data. The method returns a Future object that wraps a dictionary of inference results, where dictionary keys are output names.

Example usage:

with FuturesModelClient("localhost", "BERT") as client:
    future = client.infer_batch(input1_sample, input2_sample)
    # do something else
    print(future.result())

Inference inputs can be provided either as positional or keyword arguments:

future = client.infer_batch(input1, input2)
future = client.infer_batch(a=input1, b=input2)

Mixing of argument passing conventions is not supported and will raise PyTritonClientValueError.

Parameters:

  • *inputs

    Inference inputs provided as positional arguments.

  • parameters (Optional[Dict[str, Union[str, int, bool]]], default: None ) –

    Optional dictionary of inference parameters.

  • headers (Optional[Dict[str, Union[str, int, bool]]], default: None ) –

    Optional dictionary of HTTP headers for the inference request.

  • **named_inputs

    Inference inputs provided as named arguments.

Returns:

  • Future

    A Future object wrapping a dictionary of inference results, where dictionary keys are output names.

Raises:

  • PyTritonClientClosedError

    If the FuturesModelClient is closed.

Source code in pytriton/client/client.py
def infer_batch(
    self,
    *inputs,
    parameters: Optional[Dict[str, Union[str, int, bool]]] = None,
    headers: Optional[Dict[str, Union[str, int, bool]]] = None,
    **named_inputs,
) -> Future:
    """Run asynchronous inference on batched data and return a Future object.

    This method allows the user to perform inference on batched data by providing input data and receiving the corresponding output data.
    The method returns a Future object that wraps a dictionary of inference results, where dictionary keys are output names.

    Example usage:

    ```python
    with FuturesModelClient("localhost", "BERT") as client:
        future = client.infer_batch(input1_sample, input2_sample)
        # do something else
        print(future.result())
    ```

    Inference inputs can be provided either as positional or keyword arguments:

    ```python
    future = client.infer_batch(input1, input2)
    future = client.infer_batch(a=input1, b=input2)
    ```

    Mixing of argument passing conventions is not supported and will raise PyTritonClientValueError.

    Args:
        *inputs: Inference inputs provided as positional arguments.
        parameters: Optional dictionary of inference parameters.
        headers: Optional dictionary of HTTP headers for the inference request.
        **named_inputs: Inference inputs provided as named arguments.

    Returns:
        A Future object wrapping a dictionary of inference results, where dictionary keys are output names.

    Raises:
        PyTritonClientClosedError: If the FuturesModelClient is closed.
    """
    return self._execute(name=_INFER_BATCH, request=(inputs, parameters, headers, named_inputs))

infer_sample

infer_sample(*inputs, parameters: Optional[Dict[str, Union[str, int, bool]]] = None, headers: Optional[Dict[str, Union[str, int, bool]]] = None, **named_inputs) -> Future

Run asynchronous inference on a single data sample and return a Future object.

This method allows the user to perform inference on a single data sample by providing input data and receiving the corresponding output data. The method returns a Future object that wraps a dictionary of inference results, where dictionary keys are output names.

Example usage:

with FuturesModelClient("localhost", "BERT") as client:
    result_future = client.infer_sample(input1=input1_data, input2=input2_data)
    # do something else
    print(result_future.result())

Inference inputs can be provided either as positional or keyword arguments:

future = client.infer_sample(input1, input2)
future = client.infer_sample(a=input1, b=input2)

Parameters:

  • *inputs

    Inference inputs provided as positional arguments.

  • parameters (Optional[Dict[str, Union[str, int, bool]]], default: None ) –

    Optional dictionary of inference parameters.

  • headers (Optional[Dict[str, Union[str, int, bool]]], default: None ) –

    Optional dictionary of HTTP headers for the inference request.

  • **named_inputs

    Inference inputs provided as named arguments.

Returns:

  • Future

    A Future object wrapping a dictionary of inference results, where dictionary keys are output names.

Raises:

  • PyTritonClientClosedError

    If the FuturesModelClient is closed.

Source code in pytriton/client/client.py
def infer_sample(
    self,
    *inputs,
    parameters: Optional[Dict[str, Union[str, int, bool]]] = None,
    headers: Optional[Dict[str, Union[str, int, bool]]] = None,
    **named_inputs,
) -> Future:
    """Run asynchronous inference on a single data sample and return a Future object.

    This method allows the user to perform inference on a single data sample by providing input data and receiving the
    corresponding output data. The method returns a Future object that wraps a dictionary of inference results, where dictionary keys are output names.

    Example usage:

    ```python
    with FuturesModelClient("localhost", "BERT") as client:
        result_future = client.infer_sample(input1=input1_data, input2=input2_data)
        # do something else
        print(result_future.result())
    ```

    Inference inputs can be provided either as positional or keyword arguments:

    ```python
    future = client.infer_sample(input1, input2)
    future = client.infer_sample(a=input1, b=input2)
    ```

    Args:
        *inputs: Inference inputs provided as positional arguments.
        parameters: Optional dictionary of inference parameters.
        headers: Optional dictionary of HTTP headers for the inference request.
        **named_inputs: Inference inputs provided as named arguments.

    Returns:
        A Future object wrapping a dictionary of inference results, where dictionary keys are output names.

    Raises:
        PyTritonClientClosedError: If the FuturesModelClient is closed.
    """
    return self._execute(
        name=_INFER_SAMPLE,
        request=(inputs, parameters, headers, named_inputs),
    )

model_config

model_config() -> Future

Obtain the configuration of the model deployed on the Triton Inference Server.

This method returns a Future object that will contain the TritonModelConfig object when it is ready. Client will wait init_timeout_s for the server to get into readiness state before obtaining the model configuration.

Returns:

  • Future

    A Future object that will contain the TritonModelConfig object when it is ready.

Raises:

  • PyTritonClientClosedError

    If the FuturesModelClient is closed.

Source code in pytriton/client/client.py
def model_config(self) -> Future:
    """Obtain the configuration of the model deployed on the Triton Inference Server.

    This method returns a Future object that will contain the TritonModelConfig object when it is ready.
    Client will wait init_timeout_s for the server to get into readiness state before obtaining the model configuration.

    Returns:
        A Future object that will contain the TritonModelConfig object when it is ready.

    Raises:
        PyTritonClientClosedError: If the FuturesModelClient is closed.
    """
    return self._execute(name=_MODEL_CONFIG)

wait_for_model

wait_for_model(timeout_s: float) -> Future

Returns a Future object which result will be None when the model is ready.

Typical usage:

with FuturesModelClient("localhost", "BERT") as client
    future = client.wait_for_model(300.)
    # do something else
    future.result()   # wait rest of timeout_s time
                        # till return None if model is ready
                        # or raise PyTritonClientTimeutError

Parameters:

  • timeout_s (float) –

    The maximum amount of time to wait for the model to be ready, in seconds.

Returns:

  • Future

    A Future object which result is None when the model is ready.

Source code in pytriton/client/client.py
def wait_for_model(self, timeout_s: float) -> Future:
    """Returns a Future object which result will be None when the model is ready.

    Typical usage:

    ```python
    with FuturesModelClient("localhost", "BERT") as client
        future = client.wait_for_model(300.)
        # do something else
        future.result()   # wait rest of timeout_s time
                            # till return None if model is ready
                            # or raise PyTritonClientTimeutError
    ```

    Args:
        timeout_s: The maximum amount of time to wait for the model to be ready, in seconds.

    Returns:
        A Future object which result is None when the model is ready.
    """
    return self._execute(
        name=_WAIT_FOR_MODEL,
        request=timeout_s,
    )