There is no one-to-one match between our solution and Triton Inference Server features, especially in terms of supporting a user model store.
Running multiple scripts hosting PyTriton on the same machine or container is not feasible.
Deadlocks may occur in some models when employing the NCCL communication library and multiple Inference Callables are triggered concurrently. This issue can be observed when deploying multiple instances of the same model or multiple models within a single server script. Additional information about this issue can be found here.
Enabling verbose logging may cause a significant performance drop in model inference.
GRPC ModelClient doesn't support timeouts for model configuration and model metadata requests due to a limitation in the underlying tritonclient library.
HTTP ModelClient may not respect the specified timeouts for model initialization and inference requests, especially when they are smaller than 1 second, resulting in longer waiting times. This issue is related to the underlying implementation of HTTP protocol.