Model Clients
pytriton.client.ModelClient
ModelClient(url: str, model_name: str, model_version: Optional[str] = None, *, lazy_init: bool = True, init_timeout_s: Optional[float] = None, inference_timeout_s: Optional[float] = None, model_config: Optional[TritonModelConfig] = None, ensure_model_is_ready: bool = True)
Bases: BaseModelClient
Synchronous client for model deployed on the Triton Inference Server.
Inits ModelClient for given model deployed on the Triton Inference Server.
If lazy_init
argument is False, model configuration will be read
from inference server during initialization.
Common usage:
client = ModelClient("localhost", "BERT")
result_dict = client.infer_sample(input1_sample, input2_sample)
client.close()
Client supports also context manager protocol:
with ModelClient("localhost", "BERT") as client:
result_dict = client.infer_sample(input1_sample, input2_sample)
The creation of client requires connection to the server and downloading model configuration. You can create client from existing client using the same class:
Parameters:
-
url
(str
) –The Triton Inference Server url, e.g. 'grpc://localhost:8001'. In case no scheme is provided http scheme will be used as default. In case no port is provided default port for given scheme will be used - 8001 for grpc scheme, 8000 for http scheme.
-
model_name
(str
) –name of the model to interact with.
-
model_version
(Optional[str]
, default:None
) –version of the model to interact with. If model_version is None inference on latest model will be performed. The latest versions of the model are numerically the greatest version numbers.
-
lazy_init
(bool
, default:True
) –if initialization should be performed just before sending first request to inference server.
-
init_timeout_s
(Optional[float]
, default:None
) –timeout for maximum waiting time in loop, which sends retry requests ask if model is ready. It is applied at initialization time only when
lazy_init
argument is False. Default is to do retry loop at first inference. -
inference_timeout_s
(Optional[float]
, default:None
) –timeout in seconds for the model inference process. If non passed default 60 seconds timeout will be used. For HTTP client it is not only inference timeout but any client request timeout - get model config, is model loaded. For GRPC client it is only inference timeout.
-
model_config
(Optional[TritonModelConfig]
, default:None
) –model configuration. If not passed, it will be read from inference server during initialization.
-
ensure_model_is_ready
(bool
, default:True
) –if model should be checked if it is ready before first inference request.
Raises:
-
PyTritonClientModelUnavailableError
–If model with given name (and version) is unavailable.
-
PyTritonClientTimeoutError
–if
lazy_init
argument is False and wait time for server and model being ready exceedsinit_timeout_s
. -
PyTritonClientUrlParseError
–In case of problems with parsing url.
Source code in pytriton/client/client.py
is_batching_supported
property
Checks if model supports batching.
Also waits for server to get into readiness state.
model_config
property
Obtain the configuration of the model deployed on the Triton Inference Server.
This method waits for the server to get into readiness state before obtaining the model configuration.
Returns:
-
TritonModelConfig
(TritonModelConfig
) –configuration of the model deployed on the Triton Inference Server.
Raises:
-
PyTritonClientTimeoutError
–If the server and model are not in readiness state before the given timeout.
-
PyTritonClientModelUnavailableError
–If the model with the given name (and version) is unavailable.
-
KeyboardInterrupt
–If the hosting process receives SIGINT.
-
PyTritonClientClosedError
–If the ModelClient is closed.
__enter__
__exit__
close
Close resources used by ModelClient.
This method closes the resources used by the ModelClient instance, including the Triton Inference Server connections. Once this method is called, the ModelClient instance should not be used again.
Source code in pytriton/client/client.py
create_client_from_url
Create Triton Inference Server client.
Parameters:
-
url
(str
) –url of the server to connect to. If url doesn't contain scheme (e.g. "localhost:8001") http scheme is added. If url doesn't contain port (e.g. "localhost") default port for given scheme is added.
-
network_timeout_s
(Optional[float]
, default:None
) –timeout for client commands. Default value is 60.0 s.
Returns:
-
–
Triton Inference Server client.
Raises:
-
PyTritonClientInvalidUrlError
–If provided Triton Inference Server url is invalid.
Source code in pytriton/client/client.py
from_existing_client
classmethod
Create a new instance from an existing client using the same class.
Common usage:
Parameters:
-
existing_client
(BaseModelClient
) –An instance of an already initialized subclass.
Returns:
-
–
A new instance of the same subclass with shared configuration and readiness state.
Source code in pytriton/client/client.py
get_lib
infer_batch
infer_batch(*inputs, parameters: Optional[Dict[str, Union[str, int, bool]]] = None, headers: Optional[Dict[str, Union[str, int, bool]]] = None, **named_inputs) -> Dict[str, ndarray]
Run synchronous inference on batched data.
Typical usage:
client = ModelClient("localhost", "MyModel")
result_dict = client.infer_batch(input1, input2)
client.close()
Inference inputs can be provided either as positional or keyword arguments:
result_dict = client.infer_batch(input1, input2)
result_dict = client.infer_batch(a=input1, b=input2)
Parameters:
-
*inputs
–Inference inputs provided as positional arguments.
-
parameters
(Optional[Dict[str, Union[str, int, bool]]]
, default:None
) –Custom inference parameters.
-
headers
(Optional[Dict[str, Union[str, int, bool]]]
, default:None
) –Custom inference headers.
-
**named_inputs
–Inference inputs provided as named arguments.
Returns:
Raises:
-
PyTritonClientValueError
–If mixing of positional and named arguments passing detected.
-
PyTritonClientTimeoutError
–If the wait time for the server and model being ready exceeds
init_timeout_s
or inference request time exceedsinference_timeout_s
. -
PyTritonClientModelUnavailableError
–If the model with the given name (and version) is unavailable.
-
PyTritonClientInferenceServerError
–If an error occurred on the inference callable or Triton Inference Server side.
-
PyTritonClientModelDoesntSupportBatchingError
–If the model doesn't support batching.
-
PyTritonClientValueError
–if mixing of positional and named arguments passing detected.
-
PyTritonClientTimeoutError
–in case of first method call,
lazy_init
argument is False and wait time for server and model being ready exceedsinit_timeout_s
or inference time exceedsinference_timeout_s
passed to__init__
. -
PyTritonClientModelUnavailableError
–If model with given name (and version) is unavailable.
-
PyTritonClientInferenceServerError
–If error occurred on inference callable or Triton Inference Server side,
Source code in pytriton/client/client.py
infer_sample
infer_sample(*inputs, parameters: Optional[Dict[str, Union[str, int, bool]]] = None, headers: Optional[Dict[str, Union[str, int, bool]]] = None, **named_inputs) -> Dict[str, ndarray]
Run synchronous inference on a single data sample.
Typical usage:
client = ModelClient("localhost", "MyModel")
result_dict = client.infer_sample(input1, input2)
client.close()
Inference inputs can be provided either as positional or keyword arguments:
result_dict = client.infer_sample(input1, input2)
result_dict = client.infer_sample(a=input1, b=input2)
Parameters:
-
*inputs
–Inference inputs provided as positional arguments.
-
parameters
(Optional[Dict[str, Union[str, int, bool]]]
, default:None
) –Custom inference parameters.
-
headers
(Optional[Dict[str, Union[str, int, bool]]]
, default:None
) –Custom inference headers.
-
**named_inputs
–Inference inputs provided as named arguments.
Returns:
Raises:
-
PyTritonClientValueError
–If mixing of positional and named arguments passing detected.
-
PyTritonClientTimeoutError
–If the wait time for the server and model being ready exceeds
init_timeout_s
or inference request time exceedsinference_timeout_s
. -
PyTritonClientModelUnavailableError
–If the model with the given name (and version) is unavailable.
-
PyTritonClientInferenceServerError
–If an error occurred on the inference callable or Triton Inference Server side.
Source code in pytriton/client/client.py
load_model
Load model on the Triton Inference Server.
Parameters:
-
config
(Optional[str]
, default:None
) –str - Optional JSON representation of a model config provided for the load request, if provided, this config will be used for loading the model.
-
files
(Optional[dict]
, default:None
) –dict - Optional dictionary specifying file path (with "file:" prefix) in the override model directory to the file content as bytes. The files will form the model directory that the model will be loaded from. If specified, 'config' must be provided to be the model configuration of the override model directory.
Source code in pytriton/client/client.py
unload_model
wait_for_model
wait_for_model(timeout_s: float)
Wait for the Triton Inference Server and the deployed model to be ready.
Parameters:
-
timeout_s
(float
) –timeout in seconds to wait for the server and model to be ready.
Raises:
-
PyTritonClientTimeoutError
–If the server and model are not ready before the given timeout.
-
PyTritonClientModelUnavailableError
–If the model with the given name (and version) is unavailable.
-
KeyboardInterrupt
–If the hosting process receives SIGINT.
-
PyTritonClientClosedError
–If the ModelClient is closed.
Source code in pytriton/client/client.py
wait_for_server
wait_for_server(timeout_s: float)
Wait for Triton Inference Server readiness.
Parameters:
-
timeout_s
(float
) –timeout to server get into readiness state.
Raises:
-
PyTritonClientTimeoutError
–If server is not in readiness state before given timeout.
-
KeyboardInterrupt
–If hosting process receives SIGINT
Source code in pytriton/client/client.py
pytriton.client.AsyncioModelClient
AsyncioModelClient(url: str, model_name: str, model_version: Optional[str] = None, *, lazy_init: bool = True, init_timeout_s: Optional[float] = None, inference_timeout_s: Optional[float] = None, model_config: Optional[TritonModelConfig] = None, ensure_model_is_ready: bool = True)
Bases: BaseModelClient
Asyncio client for model deployed on the Triton Inference Server.
This client is based on Triton Inference Server Python clients and GRPC library
tritonclient.http.aio.InferenceServerClient
tritonclient.grpc.aio.InferenceServerClient
It can wait for server to be ready with model loaded and then perform inference on it.
AsyncioModelClient
supports asyncio context manager protocol.
Typical usage:
from pytriton.client import AsyncioModelClient
import numpy as np
input1_sample = np.random.rand(1, 3, 224, 224).astype(np.float32)
input2_sample = np.random.rand(1, 3, 224, 224).astype(np.float32)
client = AsyncioModelClient("localhost", "MyModel")
result_dict = await client.infer_sample(input1_sample, input2_sample)
print(result_dict["output_name"])
await client.close()
Inits ModelClient for given model deployed on the Triton Inference Server.
If lazy_init
argument is False, model configuration will be read
from inference server during initialization.
Parameters:
-
url
(str
) –The Triton Inference Server url, e.g. 'grpc://localhost:8001'. In case no scheme is provided http scheme will be used as default. In case no port is provided default port for given scheme will be used - 8001 for grpc scheme, 8000 for http scheme.
-
model_name
(str
) –name of the model to interact with.
-
model_version
(Optional[str]
, default:None
) –version of the model to interact with. If model_version is None inference on latest model will be performed. The latest versions of the model are numerically the greatest version numbers.
-
lazy_init
(bool
, default:True
) –if initialization should be performed just before sending first request to inference server.
-
init_timeout_s
(Optional[float]
, default:None
) –timeout for server and model being ready.
-
model_config
(Optional[TritonModelConfig]
, default:None
) –model configuration. If not passed, it will be read from inference server during initialization.
-
ensure_model_is_ready
(bool
, default:True
) –if model should be checked if it is ready before first inference request.
Raises:
-
PyTritonClientModelUnavailableError
–If model with given name (and version) is unavailable.
-
PyTritonClientTimeoutError
–if
lazy_init
argument is False and wait time for server and model being ready exceedsinit_timeout_s
. -
PyTritonClientUrlParseError
–In case of problems with parsing url.
Source code in pytriton/client/client.py
model_config
async
property
Obtain configuration of model deployed on the Triton Inference Server.
Also waits for server to get into readiness state.
__aenter__
async
Create context for use AsyncioModelClient as a context manager.
Source code in pytriton/client/client.py
__aexit__
async
Close resources used by AsyncioModelClient when exiting from context.
close
async
Close resources used by _ModelClientBase.
Source code in pytriton/client/client.py
create_client_from_url
Create Triton Inference Server client.
Parameters:
-
url
(str
) –url of the server to connect to. If url doesn't contain scheme (e.g. "localhost:8001") http scheme is added. If url doesn't contain port (e.g. "localhost") default port for given scheme is added.
-
network_timeout_s
(Optional[float]
, default:None
) –timeout for client commands. Default value is 60.0 s.
Returns:
-
–
Triton Inference Server client.
Raises:
-
PyTritonClientInvalidUrlError
–If provided Triton Inference Server url is invalid.
Source code in pytriton/client/client.py
from_existing_client
classmethod
Create a new instance from an existing client using the same class.
Common usage:
Parameters:
-
existing_client
(BaseModelClient
) –An instance of an already initialized subclass.
Returns:
-
–
A new instance of the same subclass with shared configuration and readiness state.
Source code in pytriton/client/client.py
get_lib
infer_batch
async
infer_batch(*inputs, parameters: Optional[Dict[str, Union[str, int, bool]]] = None, headers: Optional[Dict[str, Union[str, int, bool]]] = None, **named_inputs)
Run asynchronous inference on batched data.
Typical usage:
client = AsyncioModelClient("localhost", "MyModel")
result_dict = await client.infer_batch(input1, input2)
await client.close()
Inference inputs can be provided either as positional or keyword arguments:
result_dict = await client.infer_batch(input1, input2)
result_dict = await client.infer_batch(a=input1, b=input2)
Mixing of argument passing conventions is not supported and will raise PyTritonClientValueError.
Parameters:
-
*inputs
–inference inputs provided as positional arguments.
-
parameters
(Optional[Dict[str, Union[str, int, bool]]]
, default:None
) –custom inference parameters.
-
headers
(Optional[Dict[str, Union[str, int, bool]]]
, default:None
) –custom inference headers.
-
**named_inputs
–inference inputs provided as named arguments.
Returns:
-
–
dictionary with inference results, where dictionary keys are output names.
Raises:
-
PyTritonClientValueError
–if mixing of positional and named arguments passing detected.
-
PyTritonClientTimeoutError
–in case of first method call,
lazy_init
argument is False and wait time for server and model being ready exceedsinit_timeout_s
or inference time exceedstimeout_s
. -
PyTritonClientModelDoesntSupportBatchingError
–if model doesn't support batching.
-
PyTritonClientModelUnavailableError
–If model with given name (and version) is unavailable.
-
PyTritonClientInferenceServerError
–If error occurred on inference callable or Triton Inference Server side.
Source code in pytriton/client/client.py
1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 |
|
infer_sample
async
infer_sample(*inputs, parameters: Optional[Dict[str, Union[str, int, bool]]] = None, headers: Optional[Dict[str, Union[str, int, bool]]] = None, **named_inputs)
Run asynchronous inference on single data sample.
Typical usage:
client = AsyncioModelClient("localhost", "MyModel")
result_dict = await client.infer_sample(input1, input2)
await client.close()
Inference inputs can be provided either as positional or keyword arguments:
result_dict = await client.infer_sample(input1, input2)
result_dict = await client.infer_sample(a=input1, b=input2)
Mixing of argument passing conventions is not supported and will raise PyTritonClientRuntimeError.
Parameters:
-
*inputs
–inference inputs provided as positional arguments.
-
parameters
(Optional[Dict[str, Union[str, int, bool]]]
, default:None
) –custom inference parameters.
-
headers
(Optional[Dict[str, Union[str, int, bool]]]
, default:None
) –custom inference headers.
-
**named_inputs
–inference inputs provided as named arguments.
Returns:
-
–
dictionary with inference results, where dictionary keys are output names.
Raises:
-
PyTritonClientValueError
–if mixing of positional and named arguments passing detected.
-
PyTritonClientTimeoutError
–in case of first method call,
lazy_init
argument is False and wait time for server and model being ready exceedsinit_timeout_s
or inference time exceedstimeout_s
. -
PyTritonClientModelUnavailableError
–If model with given name (and version) is unavailable.
-
PyTritonClientInferenceServerError
–If error occurred on inference callable or Triton Inference Server side.
Source code in pytriton/client/client.py
1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 |
|
wait_for_model
async
wait_for_model(timeout_s: float)
Asynchronous wait for Triton Inference Server and deployed on it model readiness.
Parameters:
-
timeout_s
(float
) –timeout to server and model get into readiness state.
Raises:
-
PyTritonClientTimeoutError
–If server and model are not in readiness state before given timeout.
-
PyTritonClientModelUnavailableError
–If model with given name (and version) is unavailable.
-
KeyboardInterrupt
–If hosting process receives SIGINT
Source code in pytriton/client/client.py
pytriton.client.DecoupledModelClient
DecoupledModelClient(url: str, model_name: str, model_version: Optional[str] = None, *, lazy_init: bool = True, init_timeout_s: Optional[float] = None, inference_timeout_s: Optional[float] = None, model_config: Optional[TritonModelConfig] = None, ensure_model_is_ready: bool = True)
Bases: ModelClient
Synchronous client for decoupled model deployed on the Triton Inference Server.
Inits DecoupledModelClient for given decoupled model deployed on the Triton Inference Server.
Common usage:
client = DecoupledModelClient("localhost", "BERT")
for response in client.infer_sample(input1_sample, input2_sample):
print(response)
client.close()
Parameters:
-
url
(str
) –The Triton Inference Server url, e.g.
grpc://localhost:8001
. In case no scheme is provided http scheme will be used as default. In case no port is provided default port for given scheme will be used - 8001 for grpc scheme, 8000 for http scheme. -
model_name
(str
) –name of the model to interact with.
-
model_version
(Optional[str]
, default:None
) –version of the model to interact with. If model_version is None inference on latest model will be performed. The latest versions of the model are numerically the greatest version numbers.
-
lazy_init
(bool
, default:True
) –if initialization should be performed just before sending first request to inference server.
-
init_timeout_s
(Optional[float]
, default:None
) –timeout in seconds for the server and model to be ready. If not passed, the default timeout of 300 seconds will be used.
-
inference_timeout_s
(Optional[float]
, default:None
) –timeout in seconds for a single model inference request. If not passed, the default timeout of 60 seconds will be used.
-
model_config
(Optional[TritonModelConfig]
, default:None
) –model configuration. If not passed, it will be read from inference server during initialization.
-
ensure_model_is_ready
(bool
, default:True
) –if model should be checked if it is ready before first inference request.
Raises:
-
PyTritonClientModelUnavailableError
–If model with given name (and version) is unavailable.
-
PyTritonClientTimeoutError
–if
lazy_init
argument is False and wait time for server and model being ready exceedsinit_timeout_s
. -
PyTritonClientInvalidUrlError
–If provided Triton Inference Server url is invalid.
Source code in pytriton/client/client.py
is_batching_supported
property
Checks if model supports batching.
Also waits for server to get into readiness state.
model_config
property
Obtain the configuration of the model deployed on the Triton Inference Server.
This method waits for the server to get into readiness state before obtaining the model configuration.
Returns:
-
TritonModelConfig
(TritonModelConfig
) –configuration of the model deployed on the Triton Inference Server.
Raises:
-
PyTritonClientTimeoutError
–If the server and model are not in readiness state before the given timeout.
-
PyTritonClientModelUnavailableError
–If the model with the given name (and version) is unavailable.
-
KeyboardInterrupt
–If the hosting process receives SIGINT.
-
PyTritonClientClosedError
–If the ModelClient is closed.
__enter__
__exit__
close
Close resources used by DecoupledModelClient.
Source code in pytriton/client/client.py
create_client_from_url
Create Triton Inference Server client.
Parameters:
-
url
(str
) –url of the server to connect to. If url doesn't contain scheme (e.g. "localhost:8001") http scheme is added. If url doesn't contain port (e.g. "localhost") default port for given scheme is added.
-
network_timeout_s
(Optional[float]
, default:None
) –timeout for client commands. Default value is 60.0 s.
Returns:
-
–
Triton Inference Server client.
Raises:
-
PyTritonClientInvalidUrlError
–If provided Triton Inference Server url is invalid.
Source code in pytriton/client/client.py
from_existing_client
classmethod
Create a new instance from an existing client using the same class.
Common usage:
Parameters:
-
existing_client
(BaseModelClient
) –An instance of an already initialized subclass.
Returns:
-
–
A new instance of the same subclass with shared configuration and readiness state.
Source code in pytriton/client/client.py
get_lib
infer_batch
infer_batch(*inputs, parameters: Optional[Dict[str, Union[str, int, bool]]] = None, headers: Optional[Dict[str, Union[str, int, bool]]] = None, **named_inputs) -> Dict[str, ndarray]
Run synchronous inference on batched data.
Typical usage:
client = ModelClient("localhost", "MyModel")
result_dict = client.infer_batch(input1, input2)
client.close()
Inference inputs can be provided either as positional or keyword arguments:
result_dict = client.infer_batch(input1, input2)
result_dict = client.infer_batch(a=input1, b=input2)
Parameters:
-
*inputs
–Inference inputs provided as positional arguments.
-
parameters
(Optional[Dict[str, Union[str, int, bool]]]
, default:None
) –Custom inference parameters.
-
headers
(Optional[Dict[str, Union[str, int, bool]]]
, default:None
) –Custom inference headers.
-
**named_inputs
–Inference inputs provided as named arguments.
Returns:
Raises:
-
PyTritonClientValueError
–If mixing of positional and named arguments passing detected.
-
PyTritonClientTimeoutError
–If the wait time for the server and model being ready exceeds
init_timeout_s
or inference request time exceedsinference_timeout_s
. -
PyTritonClientModelUnavailableError
–If the model with the given name (and version) is unavailable.
-
PyTritonClientInferenceServerError
–If an error occurred on the inference callable or Triton Inference Server side.
-
PyTritonClientModelDoesntSupportBatchingError
–If the model doesn't support batching.
-
PyTritonClientValueError
–if mixing of positional and named arguments passing detected.
-
PyTritonClientTimeoutError
–in case of first method call,
lazy_init
argument is False and wait time for server and model being ready exceedsinit_timeout_s
or inference time exceedsinference_timeout_s
passed to__init__
. -
PyTritonClientModelUnavailableError
–If model with given name (and version) is unavailable.
-
PyTritonClientInferenceServerError
–If error occurred on inference callable or Triton Inference Server side,
Source code in pytriton/client/client.py
infer_sample
infer_sample(*inputs, parameters: Optional[Dict[str, Union[str, int, bool]]] = None, headers: Optional[Dict[str, Union[str, int, bool]]] = None, **named_inputs) -> Dict[str, ndarray]
Run synchronous inference on a single data sample.
Typical usage:
client = ModelClient("localhost", "MyModel")
result_dict = client.infer_sample(input1, input2)
client.close()
Inference inputs can be provided either as positional or keyword arguments:
result_dict = client.infer_sample(input1, input2)
result_dict = client.infer_sample(a=input1, b=input2)
Parameters:
-
*inputs
–Inference inputs provided as positional arguments.
-
parameters
(Optional[Dict[str, Union[str, int, bool]]]
, default:None
) –Custom inference parameters.
-
headers
(Optional[Dict[str, Union[str, int, bool]]]
, default:None
) –Custom inference headers.
-
**named_inputs
–Inference inputs provided as named arguments.
Returns:
Raises:
-
PyTritonClientValueError
–If mixing of positional and named arguments passing detected.
-
PyTritonClientTimeoutError
–If the wait time for the server and model being ready exceeds
init_timeout_s
or inference request time exceedsinference_timeout_s
. -
PyTritonClientModelUnavailableError
–If the model with the given name (and version) is unavailable.
-
PyTritonClientInferenceServerError
–If an error occurred on the inference callable or Triton Inference Server side.
Source code in pytriton/client/client.py
load_model
Load model on the Triton Inference Server.
Parameters:
-
config
(Optional[str]
, default:None
) –str - Optional JSON representation of a model config provided for the load request, if provided, this config will be used for loading the model.
-
files
(Optional[dict]
, default:None
) –dict - Optional dictionary specifying file path (with "file:" prefix) in the override model directory to the file content as bytes. The files will form the model directory that the model will be loaded from. If specified, 'config' must be provided to be the model configuration of the override model directory.
Source code in pytriton/client/client.py
unload_model
wait_for_model
wait_for_model(timeout_s: float)
Wait for the Triton Inference Server and the deployed model to be ready.
Parameters:
-
timeout_s
(float
) –timeout in seconds to wait for the server and model to be ready.
Raises:
-
PyTritonClientTimeoutError
–If the server and model are not ready before the given timeout.
-
PyTritonClientModelUnavailableError
–If the model with the given name (and version) is unavailable.
-
KeyboardInterrupt
–If the hosting process receives SIGINT.
-
PyTritonClientClosedError
–If the ModelClient is closed.
Source code in pytriton/client/client.py
wait_for_server
wait_for_server(timeout_s: float)
Wait for Triton Inference Server readiness.
Parameters:
-
timeout_s
(float
) –timeout to server get into readiness state.
Raises:
-
PyTritonClientTimeoutError
–If server is not in readiness state before given timeout.
-
KeyboardInterrupt
–If hosting process receives SIGINT
Source code in pytriton/client/client.py
pytriton.client.AsyncioDecoupledModelClient
AsyncioDecoupledModelClient(url: str, model_name: str, model_version: Optional[str] = None, *, lazy_init: bool = True, init_timeout_s: Optional[float] = None, inference_timeout_s: Optional[float] = None, model_config: Optional[TritonModelConfig] = None, ensure_model_is_ready: bool = True)
Bases: AsyncioModelClient
Asyncio client for model deployed on the Triton Inference Server.
This client is based on Triton Inference Server Python clients and GRPC library:
* tritonclient.grpc.aio.InferenceServerClient
It can wait for server to be ready with model loaded and then perform inference on it.
AsyncioDecoupledModelClient
supports asyncio context manager protocol.
The client is intended to be used with decoupled models and will raise an error if model is coupled.
Typical usage:
from pytriton.client import AsyncioDecoupledModelClient
import numpy as np
input1_sample = np.random.rand(1, 3, 224, 224).astype(np.float32)
input2_sample = np.random.rand(1, 3, 224, 224).astype(np.float32)
async with AsyncioDecoupledModelClient("grpc://localhost", "MyModel") as client:
async for result_dict in client.infer_sample(input1_sample, input2_sample):
print(result_dict["output_name"])
Inits ModelClient for given model deployed on the Triton Inference Server.
If lazy_init
argument is False, model configuration will be read
from inference server during initialization.
Parameters:
-
url
(str
) –The Triton Inference Server url, e.g. 'grpc://localhost:8001'. In case no scheme is provided http scheme will be used as default. In case no port is provided default port for given scheme will be used - 8001 for grpc scheme, 8000 for http scheme.
-
model_name
(str
) –name of the model to interact with.
-
model_version
(Optional[str]
, default:None
) –version of the model to interact with. If model_version is None inference on latest model will be performed. The latest versions of the model are numerically the greatest version numbers.
-
lazy_init
(bool
, default:True
) –if initialization should be performed just before sending first request to inference server.
-
init_timeout_s
(Optional[float]
, default:None
) –timeout for server and model being ready.
-
model_config
(Optional[TritonModelConfig]
, default:None
) –model configuration. If not passed, it will be read from inference server during initialization.
-
ensure_model_is_ready
(bool
, default:True
) –if model should be checked if it is ready before first inference request.
Raises:
-
PyTritonClientModelUnavailableError
–If model with given name (and version) is unavailable.
-
PyTritonClientTimeoutError
–if
lazy_init
argument is False and wait time for server and model being ready exceedsinit_timeout_s
. -
PyTritonClientUrlParseError
–In case of problems with parsing url.
Source code in pytriton/client/client.py
model_config
async
property
Obtain configuration of model deployed on the Triton Inference Server.
Also waits for server to get into readiness state.
__aenter__
async
Create context for use AsyncioModelClient as a context manager.
Source code in pytriton/client/client.py
__aexit__
async
Close resources used by AsyncioModelClient when exiting from context.
close
async
Close resources used by _ModelClientBase.
Source code in pytriton/client/client.py
create_client_from_url
Create Triton Inference Server client.
Parameters:
-
url
(str
) –url of the server to connect to. If url doesn't contain scheme (e.g. "localhost:8001") http scheme is added. If url doesn't contain port (e.g. "localhost") default port for given scheme is added.
-
network_timeout_s
(Optional[float]
, default:None
) –timeout for client commands. Default value is 60.0 s.
Returns:
-
–
Triton Inference Server client.
Raises:
-
PyTritonClientInvalidUrlError
–If provided Triton Inference Server url is invalid.
Source code in pytriton/client/client.py
from_existing_client
classmethod
Create a new instance from an existing client using the same class.
Common usage:
Parameters:
-
existing_client
(BaseModelClient
) –An instance of an already initialized subclass.
Returns:
-
–
A new instance of the same subclass with shared configuration and readiness state.
Source code in pytriton/client/client.py
get_lib
infer_batch
async
infer_batch(*inputs, parameters: Optional[Dict[str, Union[str, int, bool]]] = None, headers: Optional[Dict[str, Union[str, int, bool]]] = None, **named_inputs)
Run asynchronous inference on batched data.
Typical usage:
async with AsyncioDecoupledModelClient("grpc://localhost", "MyModel") as client:
async for result_dict in client.infer_batch(input1_sample, input2_sample):
print(result_dict["output_name"])
Inference inputs can be provided either as positional or keyword arguments:
results_iterator = client.infer_batch(input1, input2)
results_iterator = client.infer_batch(a=input1, b=input2)
Mixing of argument passing conventions is not supported and will raise PyTritonClientRuntimeError.
Parameters:
-
*inputs
–inference inputs provided as positional arguments.
-
parameters
(Optional[Dict[str, Union[str, int, bool]]]
, default:None
) –custom inference parameters.
-
headers
(Optional[Dict[str, Union[str, int, bool]]]
, default:None
) –custom inference headers.
-
**named_inputs
–inference inputs provided as named arguments.
Returns:
-
–
Asynchronous generator, which generates dictionaries with partial inference results, where dictionary keys are output names.
Raises:
-
PyTritonClientValueError
–if mixing of positional and named arguments passing detected.
-
PyTritonClientTimeoutError
–in case of first method call,
lazy_init
argument is False and wait time for server and model being ready exceedsinit_timeout_s
or inference time exceedstimeout_s
. -
PyTritonClientModelDoesntSupportBatchingError
–if model doesn't support batching.
-
PyTritonClientModelUnavailableError
–If model with given name (and version) is unavailable.
-
PyTritonClientInferenceServerError
–If error occurred on inference callable or Triton Inference Server side.
Source code in pytriton/client/client.py
1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 |
|
infer_sample
async
infer_sample(*inputs, parameters: Optional[Dict[str, Union[str, int, bool]]] = None, headers: Optional[Dict[str, Union[str, int, bool]]] = None, **named_inputs)
Run asynchronous inference on single data sample.
Typical usage:
async with AsyncioDecoupledModelClient("grpc://localhost", "MyModel") as client:
async for result_dict in client.infer_sample(input1_sample, input2_sample):
print(result_dict["output_name"])
Inference inputs can be provided either as positional or keyword arguments:
results_iterator = client.infer_sample(input1, input2)
results_iterator = client.infer_sample(a=input1, b=input2)
Mixing of argument passing conventions is not supported and will raise PyTritonClientRuntimeError.
Parameters:
-
*inputs
–inference inputs provided as positional arguments.
-
parameters
(Optional[Dict[str, Union[str, int, bool]]]
, default:None
) –custom inference parameters.
-
headers
(Optional[Dict[str, Union[str, int, bool]]]
, default:None
) –custom inference headers.
-
**named_inputs
–inference inputs provided as named arguments.
Returns:
-
–
Asynchronous generator, which generates dictionaries with partial inference results, where dictionary keys are output names.
Raises:
-
PyTritonClientValueError
–if mixing of positional and named arguments passing detected.
-
PyTritonClientTimeoutError
–in case of first method call,
lazy_init
argument is False and wait time for server and model being ready exceedsinit_timeout_s
or inference time exceedstimeout_s
. -
PyTritonClientModelUnavailableError
–If model with given name (and version) is unavailable.
-
PyTritonClientInferenceServerError
–If error occurred on inference callable or Triton Inference Server side.
Source code in pytriton/client/client.py
1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 |
|
wait_for_model
async
wait_for_model(timeout_s: float)
Asynchronous wait for Triton Inference Server and deployed on it model readiness.
Parameters:
-
timeout_s
(float
) –timeout to server and model get into readiness state.
Raises:
-
PyTritonClientTimeoutError
–If server and model are not in readiness state before given timeout.
-
PyTritonClientModelUnavailableError
–If model with given name (and version) is unavailable.
-
KeyboardInterrupt
–If hosting process receives SIGINT
Source code in pytriton/client/client.py
pytriton.client.FuturesModelClient
FuturesModelClient(url: str, model_name: str, model_version: Optional[str] = None, *, max_workers: int = 128, max_queue_size: int = 128, non_blocking: bool = False, init_timeout_s: Optional[float] = None, inference_timeout_s: Optional[float] = None)
A client for interacting with a model deployed on the Triton Inference Server using concurrent.futures.
This client allows asynchronous inference requests using a thread pool executor. It can be used to perform inference
on a model by providing input data and receiving the corresponding output data. The client can be used in a with
statement to ensure proper resource management.
Example usage with context manager:
with FuturesModelClient("localhost", "MyModel") as client:
result_future = client.infer_sample(input1=input1_data, input2=input2_data)
# do something else
print(result_future.result())
Usage without context manager:
client = FuturesModelClient("localhost", "MyModel")
result_future = client.infer_sample(input1=input1_data, input2=input2_data)
# do something else
print(result_future.result())
client.close()
Initializes the FuturesModelClient for a given model.
Parameters:
-
url
(str
) –The Triton Inference Server url, e.g.
grpc://localhost:8001
. -
model_name
(str
) –The name of the model to interact with.
-
model_version
(Optional[str]
, default:None
) –The version of the model to interact with. If None, the latest version will be used.
-
max_workers
(int
, default:128
) –The maximum number of threads that can be used to execute the given calls. If None, there is not limit on the number of threads.
-
max_queue_size
(int
, default:128
) –The maximum number of requests that can be queued. If None, there is not limit on the number of requests.
-
non_blocking
(bool
, default:False
) –If True, the client will raise a PyTritonClientQueueFullError if the queue is full. If False, the client will block until the queue is not full.
-
init_timeout_s
(Optional[float]
, default:None
) –Timeout in seconds for server and model being ready. If non passed default 60 seconds timeout will be used.
-
inference_timeout_s
(Optional[float]
, default:None
) –Timeout in seconds for the single model inference request. If non passed default 60 seconds timeout will be used.
Source code in pytriton/client/client.py
__enter__
__exit__
close
Close resources used by FuturesModelClient.
This method closes the resources used by the FuturesModelClient instance, including the Triton Inference Server connections. Once this method is called, the FuturesModelClient instance should not be used again.
Parameters:
-
wait
–If True, then shutdown will not return until all running futures have finished executing.
Source code in pytriton/client/client.py
infer_batch
infer_batch(*inputs, parameters: Optional[Dict[str, Union[str, int, bool]]] = None, headers: Optional[Dict[str, Union[str, int, bool]]] = None, **named_inputs) -> Future
Run asynchronous inference on batched data and return a Future object.
This method allows the user to perform inference on batched data by providing input data and receiving the corresponding output data. The method returns a Future object that wraps a dictionary of inference results, where dictionary keys are output names.
Example usage:
with FuturesModelClient("localhost", "BERT") as client:
future = client.infer_batch(input1_sample, input2_sample)
# do something else
print(future.result())
Inference inputs can be provided either as positional or keyword arguments:
Mixing of argument passing conventions is not supported and will raise PyTritonClientValueError.
Parameters:
-
*inputs
–Inference inputs provided as positional arguments.
-
parameters
(Optional[Dict[str, Union[str, int, bool]]]
, default:None
) –Optional dictionary of inference parameters.
-
headers
(Optional[Dict[str, Union[str, int, bool]]]
, default:None
) –Optional dictionary of HTTP headers for the inference request.
-
**named_inputs
–Inference inputs provided as named arguments.
Returns:
-
Future
–A Future object wrapping a dictionary of inference results, where dictionary keys are output names.
Raises:
-
PyTritonClientClosedError
–If the FuturesModelClient is closed.
Source code in pytriton/client/client.py
infer_sample
infer_sample(*inputs, parameters: Optional[Dict[str, Union[str, int, bool]]] = None, headers: Optional[Dict[str, Union[str, int, bool]]] = None, **named_inputs) -> Future
Run asynchronous inference on a single data sample and return a Future object.
This method allows the user to perform inference on a single data sample by providing input data and receiving the corresponding output data. The method returns a Future object that wraps a dictionary of inference results, where dictionary keys are output names.
Example usage:
with FuturesModelClient("localhost", "BERT") as client:
result_future = client.infer_sample(input1=input1_data, input2=input2_data)
# do something else
print(result_future.result())
Inference inputs can be provided either as positional or keyword arguments:
Parameters:
-
*inputs
–Inference inputs provided as positional arguments.
-
parameters
(Optional[Dict[str, Union[str, int, bool]]]
, default:None
) –Optional dictionary of inference parameters.
-
headers
(Optional[Dict[str, Union[str, int, bool]]]
, default:None
) –Optional dictionary of HTTP headers for the inference request.
-
**named_inputs
–Inference inputs provided as named arguments.
Returns:
-
Future
–A Future object wrapping a dictionary of inference results, where dictionary keys are output names.
Raises:
-
PyTritonClientClosedError
–If the FuturesModelClient is closed.
Source code in pytriton/client/client.py
model_config
model_config() -> Future
Obtain the configuration of the model deployed on the Triton Inference Server.
This method returns a Future object that will contain the TritonModelConfig object when it is ready. Client will wait init_timeout_s for the server to get into readiness state before obtaining the model configuration.
Returns:
-
Future
–A Future object that will contain the TritonModelConfig object when it is ready.
Raises:
-
PyTritonClientClosedError
–If the FuturesModelClient is closed.
Source code in pytriton/client/client.py
wait_for_model
Returns a Future object which result will be None when the model is ready.
Typical usage:
with FuturesModelClient("localhost", "BERT") as client
future = client.wait_for_model(300.)
# do something else
future.result() # wait rest of timeout_s time
# till return None if model is ready
# or raise PyTritonClientTimeutError
Parameters:
-
timeout_s
(float
) –The maximum amount of time to wait for the model to be ready, in seconds.
Returns:
-
Future
–A Future object which result is None when the model is ready.