cudnn frontend v1.8 release:
New API
Paged Attention API
SDPA forward operation now supports paged attention on cudnn 9.5.0 and later by setting the appropriate page table descriptors. SDPA_attributes
now accepts set_paged_attention_k_table
and set_paged_attention_v_table
to input these descriptors. Please refer to samples for usage : cpp samples, python samples. See docs for more API details. Paged attention allows for more efficient memory usage by storing K/V caches in non-contiguous memory, and using page tables to reconstruct them. For more information, refer to the cudnn_graph Library, and the Paged Attention paper
cuda Graph API
cudnn graph now allows user to directly build native cuda_graph for given sub_graph (requires cudnn 9.5.0). There are two APIs:
populate_cuda_graph
: add the cudnn nodes to the empty cuda_graph provided as input.update_cuda_graph
: update the populated cuda graph with necessary data pointers.
See docs and backend documentation for more details.
Enhancements
-
Kernel cache for dynamic shapes are now supported in python. Added a sample to showcase usage.
-
graph.deselect_engines(str: )
has now a python equivalent through pybind11. -
graph.tensor(...)
can now acceptint64_t
scalars directly. (Previously limited to int32_t, float and fp16 data types). -
fp8 sdpa attention now allows dropout and padding mask. Requires cudnn 9.5.0 and above.
-
More enhancements to pointwise output stride inferencing (for broadcast operation). For non-unary operands, the broadcasted tensor can now be either at IN_0 or IN_1.
-
SDPA backward operation now allows d upto 256 for Hopper. Requires cudnn 9.5.0 and above.
Bug fixes
-
Fixed an issue while querying
cudnnGetLastErrorString()
from the backend. The error_t object will now have more meaningful message. -
Fixed build issues seen with clang-19 compiler.
-
Fixed an issue where it was assumed a graph with bias in sdpa_bprop will always have a dbias.