Custom HTTP/gRPC headers and parameters

This document provides guidelines for using custom HTTP/gRPC headers and parameters with PyTriton. Original Triton documentation related to parameters can be found here. Now, undecorated inference function accepts list of Request instances. Request class contains following fields:

data - for inputs (stored as dictionary, but can be also accessed with request dict interface e.g. request["input_name"])
parameters - for combined parameters and HTTP/gRPC headers

Parameters/headers usage limitations

Currently, custom parameters and headers can be only accessed in undecorated inference function (they don't work with decorators). There is separate example how to use parameters/headers in preprocessing step (see here)

Parameters

Parameters are passed to the inference callable as a dictionary. The dictionary is stored in HTTP/gRPC request body payload.

HTTP/gRPC headers

Custom HTTP/gRPC headers are passed to the inference callable in the same dictionary as parameters, but they are stored in HTTP/gRPC request headers instead of the request body payload. For the headers it is also necessary to specify the header prefix in Triton config, which is used to distinguish the custom headers from standard ones (only headers with specified prefix are passed to the inference callable).

Usage

Define inference callable (that one uses one parameter and one header):

import numpy as np
from pytriton.model_config import ModelConfig, Tensor
from pytriton.triton import Triton, TritonConfig

def _infer_with_params_and_headers(requests):
    responses = []
    for req in requests:
        a_batch, b_batch = req.values()
        scaled_add_batch = (a_batch + b_batch) / float(req.parameters["header_divisor"])
        scaled_sub_batch = (a_batch - b_batch) * float(req.parameters["parameter_multiplier"])
        responses.append({"scaled_add": scaled_add_batch, "scaled_sub": scaled_sub_batch})
    return responses

Bind inference callable to Triton ("header" is the prefix for custom headers):

triton = Triton(config=TritonConfig(http_header_forward_pattern="header.*"))
triton.bind(
    model_name="ParamsAndHeaders",
    infer_func=_infer_with_params_and_headers,
    inputs=[
        Tensor(dtype=np.float32, shape=(-1,)),
        Tensor(dtype=np.float32, shape=(-1,)),
    ],
    outputs=[
        Tensor(name="scaled_add", dtype=np.float32, shape=(-1,)),
        Tensor(name="scaled_sub", dtype=np.float32, shape=(-1,)),
    ],
    config=ModelConfig(max_batch_size=128),
)

triton.run()

Call the model using ModelClient:

import numpy as np
from pytriton.client import ModelClient

batch_size = 2
a_batch = np.ones((batch_size, 1), dtype=np.float32) * 2
b_batch = np.ones((batch_size, 1), dtype=np.float32)

with ModelClient("localhost", "ParamsAndHeaders") as client:
    result_batch = client.infer_batch(a_batch, b_batch, parameters={"parameter_multiplier": 2}, headers={"header_divisor": 3})