Skip to content

Changelog

0.13.1

0.13.0

  • new: Introducing custom_args in TensorConfig for custom runners to use which allows dynamic shapes setup for TorchTensorRT compilation
  • new: autocast_dtype added Torch runner configuration to set the dtype for autocast
  • new: New version of Onnx Runtime 1.20 for python version >= 3.10
  • new: Use torch.compile path in heuristic search for max batch size
  • change: Removed TensorFlow dependencies for nav.jax.optimize
  • change: Removed PyTorch dependencies from nav.profile
  • change: Collect all Python packages in status instead of filtered list
  • change: Use default throughput cutoff threshold for max batch size heuristic when None provided in configuration
  • change: Updated default ONNX opset to 20 for Torch >= 2.5
  • fix: Exception is raised with Python >=3.11 due to wrong dataclass initialization
  • fix: Removed option from ExportOption removed from Torch 2.5
  • fix: Improved preprocessing stage in Torch based runners
  • fix: Warn when using autocast with bfloat16 in Torch
  • fix: Pass runner configuration to runners in nav.profile

  • Version of external components used during testing:

0.12.0

  • new: simple and detailed reporting of the optimization process
  • new: adjusted exporting TensorFlow SavedModel for Keras 3.x
  • new: inform user when wrapped a module which is not called during optimize
  • new: inform user when module uses a custom forward function
  • new: support for dynamic shapes in Torch ExportedProgram
  • new: use ExportedProgram for Torch-TensorRT conversion
  • new: support back-off policy during profiling to avoid reporting local minimum
  • new: automatically scale conversion batch size when modules have different batch sizes in scope of a single pipeline
  • change: TensorRT conversion max batch size search rely on saturating throughput for base formats
  • change: adjusted profiling configuration for throughput cutoff search
  • change: include optimized pipeline to list of examined variants during nav.profile
  • change: performance is not executed when correctness failed for format and runtime
  • change: verify command is not executed when verify function is not provided
  • change: do not create a model copy before executing torch.compile
  • fix: pipelines sometimes obtain model and tensors on different devices during nav.profile
  • fix: extract graph from ExportedProgram for running inference
  • fix: runner configuration not propagated to pre-processing steps

  • Version of external components used during testing:

0.11.0

  • new: Python 3.12 support
  • new: Improved logging
  • new: optimized in-place module can be stored to Triton model repository
  • new: multi-profile support for TensorRT model build and runtime
  • new: measure duration of each command executed in optimization pipeline
  • new: TensorRT-LLM model store generation for deployment on Triton Inference Server
  • change: filter unsupported runners instead of raising an error when running optimize
  • change: moved JAX to support to experimental module and limited support
  • change: use autocast=True for Torch based runners
  • change: use torch.inference_mode or torch.no_grad context in nav.profile measurements
  • change: use multiple strategies to select optimized runtime, defaults to [MaxThroughputAndMinLatencyStrategy, MinLatencyStrategy]
  • change: trt_profiles are not set automatically for module when using nav.optimize
  • fix: properly revert log level after torch onnx dynamo export

  • Version of external components used during testing:

0.10.1

0.10.0

  • new: inplace nav.Module accepts batching flag which overrides a config setting and precision which allows setting appropriate configuration for TensorRT
  • new: Allow to set device when loading optimized modules using nav.load_optimized()
  • new: Add support for custom i/o names and dynamic shapes in Torch ONNX Dynamo path
  • new: Added nav.bundle.save and nav.bundle.load to save and load optimized models from cache
  • change: Improved optimize and profile status in inplace mode
  • change: Improved handling defaults for ONNX Dynamo when executing nav.package.optimize
  • fix: Maintaining modules device in nav.profile()
  • fix: Add support for all precisions for TensorRT in nav.profile()
  • fix: Forward method not passed to other inplace modules.

  • Version of external components used during testing:

0.9.0

  • new: TensorRT Timing Tactics Cache Management - using timing tactics cache files for optimization performance improvements
  • new: Added throughput saturation verification in nav.profile() (enabled by default)
  • new: Allow to override Inplace cache dir through MODEL_NAVIGATOR_DEFAULT_CACHE_DIR env variable
  • new: inplace nav.Module can now receive a function name to be used instead of call in modules/submodules, allows customizing modules with non-standard calls
  • fix: torch dynamo export and torch dynamo onnx export
  • fix: measurement stabilization in nav.profile()
  • fix: inplace inference through Torch
  • fix: trt_profiles argument handling in ONNX to TRT conversion
  • fix: optimal shape configuration for batch size in Inplace API
  • change: Disable TensorRT profile builder
  • change: nav.optimize() does not override module configuration

  • Version of external components used during testing:

0.8.1

0.8.0

0.7.7

0.7.6

0.7.5

  • new: FP8 precision support for TensorRT
  • new: Support for autocast and inference mode configuration for Torch runners
  • new: Allow to select device for Torch and ONNX runners
  • new: Add support for default_model_filename in Triton model configuration
  • new: Detailed profiling of inference steps (pre- and postprocessing, memcpy and compute)
  • fix: JAX export and TensorRT conversion fails when custom workspace is used
  • fix: Missing max workspace size passed to TensorRT conversion
  • fix: Execution of TensorRT optimize raise error during handling output metadata
  • fix: Limited Polygraphy version to work correctly with onnxruntime-gpu package

  • Version of external components used during testing:

0.7.4

  • new: decoupled mode configuration in Triton Model Config
  • new: support for PyTorch ExportedProgram and ONNX dynamo export
  • new: added GraphSurgeon ONNX optimization
  • fix: compatibility of generating PyTriton model config through adapter
  • fix: installation of packages that are platform dependent
  • fix: update package config with model loaded from source
  • change: in TensorRT runner, when TensorType.TORCH is the return type lazily convert tensor to Torch
  • change: move from Polygraphy CLI to Polygraphy Python API
  • change: removed Windows from support list

  • Version of external components used during testing:

0.7.3

  • new: Data dependent dynamic control flow support in nav.Module (multiple computation graphs per module)
  • new: Added find max batch size utility
  • new: Added utilities API documentation
  • new: Add Timer class for measuring execution time of models and Inplace modules.
  • fix: Use wide range of shapes for TensorRT conversion
  • fix: Sorting of samples loaded from workspace
  • change: in Inplace, store one sample by default per module and store shape info for all samples
  • change: always execute export for all supported formats

  • Known issues and limitations:

    • nav.Module moves original torch.nn.Module to the CPU, in case of weight sharing that might result in unexpected behaviour
    • For data dependent dynamic control flow (multiple computation graphs) nav.Module might copy the weights for each separate graph
  • Version of external components used during testing:

0.7.2

0.7.1

0.7.0

  • new: Inplace Optimize feature - optimize models directly in the Python code
  • new: Non-tensor inputs and outputs support
  • new: Model warmup support in Triton model configuration
  • new: nav.tensorrt.optimize api added for testing and measuring performance of TensorRT models
  • new: Extended custom configs to pass arguments directly to export and conversion operations like torch.onnx.export or polygraphy convert
  • new: Collect GPU clock during model profiling
  • new: Add option to configure minimal trials and stabilization windows for performance verification and profiling
  • change: Navigator package version change to 0.2.3. Custom configurations now use trt_profiles list instead single value
  • change: Store separate reproduction scripts for runners used during correctness and profiling

  • Version of external components used during testing:

0.6.3

0.6.2

0.6.1

0.6.0

  • new: Zero-copy runners for Torch, ONNX and TensorRT - omit H2D and D2H memory copy between runners execution
  • new: nav.pacakge.profile API method to profile generated models on provided dataloader
  • change: ProfilerConfig replaced with OptimizationProfile:
    • new: OptimizationProfile impact the conversion for TensorRT
    • new: batch_sizes and max_batch_size limit the max profile in TensorRT conversion
    • new: Allow to provide separate dataloader for profiling - first sample used only
  • new: allow to run nav.package.optimize on empty package - status generation only
  • new: use torch.inference_mode for inference runner when PyTorch 2.x is available
  • fix: Missing model in config when passing package generated during nav.{framework}.optimize directly to nav.package.optimize command
  • Other minor fixes and improvements

  • Version of external components used during testing:

0.5.6

0.5.5

0.5.4

  • new: Custom implementation for ONNX and TensorRT runners
  • new: Use CUDA 12 for JAX in unit tests and functional tests
  • new: Step-by-step examples
  • new: Updated documentation
  • new: TensorRTCUDAGraph runner introduced with support for CUDA graphs
  • fix: Optimal shape not set correctly during adaptive conversion
  • fix: Find max batch size command for JAX
  • fix: Save stdout to logfiles in debug mode

  • Version of external components used during testing:

0.5.3

0.5.2

0.5.1

0.5.0

  • new: Support for PyTriton deployment
  • new: Support for Python models with python.optimize API
  • new: PyTorch 2 compile CPU and CUDA runners
  • new: Collect conversion max batch size in status
  • new: PyTorch runners with compile support
  • change: Improved handling CUDA and CPU runners
  • change: Reduced finding device max batch size time by running it once as separate pipeline
  • change: Stored find max batch size result in separate filed in status

  • Version of external components used during testing:

0.4.4

0.4.3

0.4.2

0.4.1

0.4.0

  • new: optimize method that replace export and perform max batch size search and improved profiling during process
  • new: Introduced custom configs in optimize for better parametrization of export/conversion commands
  • new: Support for adding user runners for model correctness and profiling
  • new: Search for max possible batch size per format during conversion and profiling
  • new: API for creating Triton model store from Navigator Package and user provided models
  • change: Improved status structure for Navigator Package
  • deprecated: Optimize for Triton Inference Server support
  • deprecated: HuggingFace contrib module
  • Bug fixes and other improvements

  • Version of external components used during testing:

0.3.8

0.3.7

0.3.6

  • Updated NVIDIA containers defaults to 22.09
  • Model Navigator Export API:

    • new: cast int64 input data to int32 in runner for Torch-TensorRT
    • new: cast 64-bit data samples to 32-bit values for TensorRT
    • new: verbose flag for logging export and conversion commands to console
    • new: debug flag to enable debug mode for export and conversion commands
    • change: logs from commands are streamed to console during command run
    • change: package load omit the log files and autogenerated scripts
  • Version of external components used during testing:

0.3.5

  • Updated NVIDIA containers defaults to 22.08
  • Model Navigator Export API:

    • new: TRTExec runner use use_cuda_graph=True by default
    • new: log warning instead of raising error when dataloader dump inputs with nan or inf values
    • new: enabled logging for command input parameters
    • fix: invalid use of Polygraphy TRT profile when trt_dynamic_axes is passed to export function
  • Version of external components used during testing:

0.3.4

  • Updated NVIDIA containers defaults to 22.07
  • Model Navigator OTIS:
    • deprecated: TF32 precision for TensorRT from CLI options - will be removed in future versions
    • fix: Tensorflow module was imported when obtaining model signature during conversion
  • Model Navigator Export API:

    • new: Support for building framework containers with Model Navigator installed
    • new: Example for loading Navigator Package for reproducing the results
    • new: Create reproducing script for correctness and performance steps
    • new: TrtexecRunner for correctness and performance tests with trtexec tool
    • new: Use TF32 support by default for models with FP32 precision
    • new: Reset conversion parameters to defaults when using load for package
    • new: Testing all options for JAX export enable_xla and jit_compile parameters
    • change: Profiling stability improvements
    • change: Rename of onnx_runtimes export function parameters to runtimes
    • deprecated: TF32 precision for TensorRT from available options - will be removed in future versions
    • fix: Do not save TF-TRT models to the .nav package
    • fix: Do not save TF-TRT models from the .nav package
    • fix: Correctly load .nav packages when _input_names or _output_names specified
    • fix: Adjust TF and TF-TRT model signatures to match input_names
    • fix: Save ONNX opset for CLI configuration inside package
    • fix: Reproduction scripts were missing for failing paths
  • Version of external components used during testing:

0.3.3

  • Model Navigator Export API:

    • new: Improved handling inputs and outputs metadata
    • new: Navigator Package version updated to 0.1.3
    • new: Backward compatibility with previous versions of Navigator Package
    • fix: Dynamic shapes for output shapes were read incorrectly
  • Version of external components used during testing:

0.3.2

  • Updated NVIDIA containers defaults to 22.06
  • Model Navigator OTIS:
    • new: Perf Analyzer profiling data use base64 format for content
    • fix: Signature for TensorRT model when has uint64 or int64 input and/or outputs defined
  • Model Navigator Export API:

    • new: Updated navigator package format to 0.1.1
    • new: Added Model Navigator version to status file
    • new: Add atol and rtol configuration to CLI config for model
    • new: Added experimental support for JAX models
    • new: In case of export or conversion failures prepare minimal scripts to reproduce errors
    • fix: Conversion parameters are not stored in Navigator Package for CLI execution
  • Version of external components used during testing:

0.3.1

  • Updated NVIDIA containers defaults to 22.05
  • Model Navigator OTIS:
    • fix: Saving paths inside the Triton package status file
    • fix: Empty list of gpus cause the process run on CPU only
    • fix: Reading content from zipped Navigator Package
    • fix: When no GPU or target device set to CPU optimize avoid running unsupported conversions in CLI
    • new: Converter accept passing target device kind to selected CPU or GPU supported conversions
    • new: Added support for OpenVINO accelerator for ONNXRuntime
    • new: Added option --config-search-early-exit-enable for Model Analyzer early exit support in manual profiling mode
    • new: Added option --model-config-name to the select command. It allows to pick a particular model configuration for deployment from the set of all configurations generated by Triton Model Analyzer, even if it's not the best performing one.
    • removed: The --tensorrt-strict-types option has been removed due to deprecation of the functionality in upstream libraries.
  • Model Navigator Export API:

    • new: Added dynamic shapes support and trt dynamic shapes support for TensorFlow2 export
    • new: Improved per format logging
    • new: PyTorch to Torch-TRT precision selection added
    • new: Advanced profiling (measurement windows, configurable batch sizes)
  • Version of external components used during testing:

0.3.0

  • Updated NVIDIA containers defaults to 22.04
  • Model Navigator Export API
    • Support for exporting models from TensorFlow2 and PyTorch source code to supported target formats
    • Support for conversion from ONNX to supported target formats
    • Support for exporting HuggingFace models
    • Conversion, Correctness and performance tests for exported models
    • Definition of package structure for storing all exported models and additional metadata
  • Model Navigator OTIS:
    • change: run command has been deprecated and may be removed in a future release
    • new: optimize command replace run and produces an output *.triton.nav package
    • new: select selects the best-performing configuration from *.triton.nav package and create a Triton Inference Server model repository
    • new: Added support for using shared memory option for Perf Analyzer
  • Remove wkhtmltopdf package dependency

  • Version of external components used during testing:

0.2.7

  • Updated NVIDIA containers defaults to 22.02
  • Removed support for Python 3.7
  • Triton Model configuration related:
    • Support dynamic batching without setting preferred batch size value
  • Profiling related:

    • Deprecated --config-search-max-preferred-batch-size flag as is no longer supported in Triton Model Analyzer
  • Version of external components used during testing:

0.2.6

  • Updated NVIDIA containers defaults to 22.01
  • Removed support for Python 3.6 due to EOL
  • Conversion related:
    • Added support for Torch-TensorRT conversion
  • Fixes and improvements

    • Processes inside containers started by Model Navigator now run without root privileges
    • Fix for volume mounts while running Triton Inference Server in container from other container
    • Fix for conversion of models without file extension on input and output paths
    • Fix using --model-format argument when input and output files have no extension
  • Version of external components used during testing:

  • Known issues and limitations

    • missing support for stateful models (ex. time-series one)
    • no verification of conversion results for conversions: TF -> ONNX, TF->TF-TRT, TorchScript -> ONNX
    • possible to define a single profile for TensorRT
    • no custom ops support
    • Triton Inference Server stays in the background when the profile process is interrupted by the user
    • TF-TRT conversion lost outputs shapes info

0.2.5

  • Updated NVIDIA containers defaults to 21.12
  • Conversion related:
    • [Experimental] TF-TRT - fixed default dataset profile generation
  • Configuration Model on Triton related

    • Fixed name for onnxruntime backend in Triton model deployment configuration
  • Version of external components used during testing:

  • Known issues and limitations

    • missing support for stateful models (ex. time-series one)
    • no verification of conversion results for conversions: TF -> ONNX, TF->TF-TRT, TorchScript -> ONNX
    • possible to define a single profile for TensorRT
    • no custom ops support
    • Triton Inference Server stays in the background when the profile process is interrupted by the user
    • TF-TRT conversion lost outputs shapes info

0.2.4

  • Updated NVIDIA containers defaults to 21.10
  • Fixed generating profiling data when dtypes are not passed
  • Conversion related:
    • [Experimental] Added support for TF-TRT conversion
  • Configuration Model on Triton related
    • Added possibility to select batching mode - default, dynamic and disabled options supported
  • Install dependencies from pip packages instead of wheels for Polygraphy and Triton Model Analyzer
  • fixes and improvements

  • Version of external components used during testing:

  • Known issues and limitations

    • missing support for stateful models (ex. time-series one)
    • no verification of conversion results for conversions: TF -> ONNX, TF->TF-TRT, TorchScript -> ONNX
    • possible to define a single profile for TensorRT
    • no custom ops support
    • Triton Inference Server stays in the background when the profile process is interrupted by the user
    • TF-TRT conversion lost outputs shapes info

0.2.3

  • Updated NVIDIA containers defaults to 21.09
  • Improved naming of arguments specific for TensorRT conversion and acceleration with backward compatibility
  • Use pip package for Triton Model Analyzer installation with minimal version 1.8.0
  • Fixed model_repository path to be not relative to <navigator_workspace> dir
  • Handle exit codes correctly from CLI commands
  • Support for use device ids for --gpus argument
  • Conversion related
    • Added support for precision modes to support multiple precisions during conversion to TensorRT
    • Added --tensorrt-sparse-weights flag for sparse weight optimization for TensorRT
    • Added --tensorrt-strict-types flag forcing it to choose tactics based on the layer precision for TensorRT
    • Added --tensorrt-explicit-precision flag enabling explicit precision mode
    • Fixed nan values appearing in relative tolerance during conversion to TensorRT
  • Configuration Model on Triton related
    • Removed default value for engine_count_per_device
    • Added possibility to define Triton Custom Backend parameters with triton_backend_parameters command
    • Added possibility to define max workspace size for TensorRT backend accelerator using argument tensorrt_max_workspace_size
  • Profiling related
    • Added config_search prefix to all profiling parameters (BREAKING CHANGE)
    • Added config_search_max_preferred_batch_size parameter
    • Added config_search_backend_parameters parameter
  • fixes and improvements

  • Versions of used external components:

  • Known issues and limitations

    • missing support for stateful models (ex. time-series one)
    • missing support for models without batching support
    • no verification of conversion results for conversions: TF -> ONNX, TorchScript -> ONNX
    • possible to define a single profile for TensorRT

0.2.2

  • Updated NVIDIA containers defaults to 21.08

  • Versions of used external components:

  • Known issues and limitations

    • missing support for stateful models (ex. time-series one)
    • missing support for models without batching support
    • no verification of conversion results for conversions: TF -> ONNX, TorchScript -> ONNX
    • possible to define a single profile for TensorRT

0.2.1

  • Fixed triton-model-config error when tensorrt_capture_cuda_graph flag is not passed
  • Dump Conversion Comparator inputs and outputs into JSON files
  • Added information in logs on the tolerance parameters values to pass the conversion verification
  • Use count_windows mode as default option for Perf Analyzer
  • Added possibility to define custom docker images
  • Bugfixes

  • Versions of used external components:

  • Known issues and limitations

    • missing support for stateful models (ex. time-series one)
    • missing support for models without batching support
    • no verification of conversion results for conversions: TF -> ONNX, TorchScript -> ONNX
    • possible to define a single profile for TensorRT
    • TensorRT backend acceleration not supported for ONNX Runtime in Triton Inference Server ver. 21.07

0.2.0

  • comprehensive refactor of command-line API in order to provide more gradual pipeline steps execution

  • Versions of used external components:

    • Triton Model Analyzer: 21.05
    • tf2onnx: v1.8.5 (support for ONNX opset 13, tf 1.15 and 2.5)
    • Other component versions depend on the used framework and Triton Inference Server containers versions. See its support matrix for a detailed summary.
  • Known issues and limitations

    • missing support for stateful models (ex. time-series one)
    • missing support for models without batching support
    • no verification of conversion results for conversions: TF -> ONNX, TorchScript -> ONNX
    • issues with TorchScript -> ONNX conversion due to issue in PyTorch 1.8
      • affected NVIDIA PyTorch containers: 20.12, 21.02, 21.03
      • workaround: use PyTorch containers newer than 21.03
    • possible to define a single profile for TensorRT

0.1.1

  • documentation update

0.1.0

  • Release of main components:

    • Model Converter - converts the model to a set of variants optimized for inference or to be later optimized by Triton Inference Server backend.
    • Model Repo Builder - setup Triton Inference Server Model Repository, including its configuration.
    • Model Analyzer - select optimal Triton Inference Server configuration based on models compute and memory requirements, available computation infrastructure, and model application constraints.
    • Helm Chart Generator - deploy Triton Inference Server and model with optimal configuration to cloud.
  • Versions of used external components:

    • Triton Model Analyzer: 21.03+616e8a30
    • tf2onnx: v1.8.4 (support for ONNX opset 13, tf 1.15 and 2.4)
    • Other component versions depend on the used framework and Triton Inference Server containers versions. Refer to its support matrix for a detailed summary.
  • Known issues

    • missing support for stateful models (ex. time-series one)
    • missing support for models without batching support
    • no verification of conversion results for conversions: TF -> ONNX, TorchScript -> ONNX
    • issues with TorchScript -> ONNX conversion due to issue in PyTorch 1.8
      • affected NVIDIA PyTorch containers: 20.12, 21.03
      • workaround: use containers different from above
    • Triton Inference Server stays in the background when the profile process is interrupted by the user