PyTriton

The PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments. In general using PyTriton can serve any Python function. The Model Navigator provide a runner - an abstraction that connects the model checkpoint with its runtime, making the inference process more accessible and straightforward. The runner is a Python API through which an optimized model can serve inference.

Obtaining runner from Package

The Navigator Package provides an API for obtaining the model for serving inference. One of the option is to obtain the runner:

runner = package.get_runner()

The default behavior is to select the model and runner which during profiling obtained the smallest latency and the largest throughput. This runner is considered as most optimal for serving inference queries. Learn more about get_runner method in Navigator Package API.

In order to use the runner in PyTriton additional information for the serving model is required. For that purpose we provide a PyTritonAdapter that contains all minimal information required to prepare successful deployment of model using PyTriton.

Using PyTritonAdapter

Model Navigator provide a dedicated PyTritonAdapter to retrieve the runner and other information required to bind a model for serving inference. Following that, you can initialize the PyTriton server using the adapter information:

pytriton_adapter = nav.pytriton.PyTritonAdapter(package=package, strategy=nav.MaxThroughputStrategy())
runner = pytriton_adapter.runner

runner.activate()


@batch
def infer_func(**inputs):
    return runner.infer(inputs)


with Triton() as triton:
    triton.bind(
        model_name="resnet50",
        infer_func=infer_func,
        inputs=pytriton_adapter.inputs,
        outputs=pytriton_adapter.outputs,
        config=pytriton_adapter.config,
    )
    triton.serve()

Once the python script is executed, the model inference is served through HTTP/gRPC endpoints.