Feature Disparity: PyTriton does not provide feature parity with Triton Inference Server, particularly in supporting a user model store.
Concurrent Execution: Running multiple scripts hosting PyTriton on the same machine or container is not supported.
Stability and Performance Issues
NCCL Deadlocks: When using the NCCL communication library, deadlocks may occur if multiple Inference Callables are triggered concurrently (such as when deploying multiple instances of the same model or multiple models within a single server script). For more details, see the NCCL documentation.
Logging Performance Impact: Enabling verbose logging can significantly reduce model inference performance.
Client Limitations
GRPC Timeout Support: The GRPC ModelClient does not support timeouts for model configuration and model metadata requests due to limitations in the underlying tritonclient library.
HTTP Timeout Handling: The HTTP ModelClient may not correctly respect specified timeouts for model initialization and inference requests, especially for timeouts under 1 second. This is caused by the underlying HTTP protocol implementation.
Benign Issues
False Error Messages: Triton logs may contain a false negative error: Failed to set config modification time: model_config_content_name_ is empty. This message can be safely ignored.