PyTriton
The PyTriton is a Flask/FastAPI-like interface that simplifies
Triton's deployment in Python environments. In general using PyTriton can serve any Python function. The Model Navigator
provide a runner
- an abstraction that connects the model checkpoint with its runtime, making the inference process
more accessible and straightforward. The runner
is a Python API through which an optimized model can serve inference.
Obtaining runner from Package
The Navigator Package provides an API for obtaining the model for serving inference. One of the
option is to obtain the runner
:
The default behavior is to select the model and runner which during profiling obtained the smallest latency and the
largest throughput. This runner is considered as most optimal for serving inference queries. Learn more
about get_runner
method in Navigator Package API.
In order to use the runner in PyTriton additional information for the serving model is required. For that purpose we
provide
a PyTritonAdapter
that contains all minimal information required to prepare successful deployment of model using
PyTriton.
Using PyTritonAdapter
Model Navigator provide a dedicated PyTritonAdapter
to retrieve the runner
and other information required
to bind a model for serving inference. Following that, you can initialize the PyTriton server using the adapter
information:
pytriton_adapter = nav.pytriton.PyTritonAdapter(package=package, strategy=nav.MaxThroughputStrategy())
runner = pytriton_adapter.runner
runner.activate()
@batch
def infer_func(**inputs):
return runner.infer(inputs)
with Triton() as triton:
triton.bind(
model_name="resnet50",
infer_func=infer_func,
inputs=pytriton_adapter.inputs,
outputs=pytriton_adapter.outputs,
config=pytriton_adapter.config,
)
triton.serve()
Once the python script is executed, the model inference is served through HTTP/gRPC endpoints.