High-Level Design
The diagram below presents the schema of how the Python models are served through Triton Inference Server using PyTriton. The solution consists of two main components:
- Triton Inference Server: for exposing the HTTP/gRPC API and benefiting from performance features like dynamic batching or response cache.
- Python Model Environment: your environment where the Python model is executed.
The Triton Inference Server binaries are provided as part of the PyTriton installation. The Triton Server is
installed in your current environment (system or container). The PyTriton controls the Triton Server process
through the Triton Controller
.
Exposing the model through PyTriton requires the definition of an Inference Callable
- a Python function that is
connected to Triton Inference Server and executes the model or ensemble for predictions. The integration layer binds
the Inference Callable
to Triton Server and exposes it through the Triton HTTP/gRPC API under a provided <model name>
. Once
the integration is done, the defined Inference Callable
receives data sent to the HTTP/gRPC API endpoint
v2/models/<model name>/infer
. Read more about HTTP/gRPC interface in Triton Inference Server
documentation.
The HTTP/gRPC requests sent to v2/models/<model name>/infer
are handled by Triton
Inference Server. The server batches requests and passes them to the Proxy Backend
, which sends the batched requests to the appropriate
Inference Callable
. The data is sent as a numpy
array. Once the Inference Callable
finishes execution of
the model prediction, the result is returned to the Proxy Backend
, and a response is created by Triton Server.