You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Release notes:
New API
Graph Slice Operation: Introduced the graph.slice operation for slicing input tensors. Refer to docs/operations/Slice.md for detailed documentation and samples/cpp/misc/slice.cpp for a C++ sample. Pybinds for this operation have also been added.
SM Carveout Feature: Added the set_sm_count(int32_t type) graph property to support the SM Carveout feature introduced in Ampere and Hopper GPUs. Engines that do not support SM_COUNT will return NOT_SUPPORTED.
Bug Fixes
Convolution Mode Attribute: Added the missing set_convolution_mode attribute to convolution attributes in forward propagation (fprop), data gradient (dgrad), and weight gradient (wgrad). Previously, this was hardcoded to CUDNN_CROSS_CORRELATION in the 1.x API.
SDPA FP8 Backward Node: Fixed an issue with the deserialization of the sdpa_fp8_backward node.
Enhancements
Graph Execution Overhead: Reduced the overhead of graph.execute() by optimizing sub-node tree traversal, collected UIDs, workspace modifications, and workspace size.
Graph Validation Performance: Significantly improved (~10x) the performance of graph.validate() by deferring graph expansion to a later stage (build_operation_graph).
Optional Running Stats for BatchNorm: Made the running statistics for the batch normalization operation optional, supported by cuDNN backend version 9.3.0 and later.
Shape and Stride Inferencing: Enhanced shape and stride inferencing to preserve the stride order of the input.
Diagnostic Error Message: Added a diagnostic error message to create_execution_plans if called without the preceding build_operation_graph.
JSON Schema and Deserialization: Improved the JSON schema and deserialization logic with additional checks.
Logging Overhead: Reduced logging overhead, resulting in faster graph.build() calls.
CMake Integration: Replaced CMAKE_SOURCE_DIR with PROJECT_SOURCE_DIR in CMake files for better integration. See the relevant pull request for more details.
Samples
Jupyter Notebooks: Added Jupyter notebooks for RMSNorm, InstanceNorm, and LayerNorm. Refer to the samples/python folder for more information.