cudnn frontend v1.8 release:

New API

Paged Attention API

SDPA forward operation now supports paged attention on cudnn 9.5.0 and later by setting the appropriate page table descriptors. SDPA_attributes now accepts set_paged_attention_k_table and set_paged_attention_v_table to input these descriptors. Please refer to samples for usage : cpp samples, python samples. See docs for more API details. Paged attention allows for more efficient memory usage by storing K/V caches in non-contiguous memory, and using page tables to reconstruct them. For more information, refer to the cudnn_graph Library, and the Paged Attention paper

cuda Graph API

cudnn graph now allows user to directly build native cuda_graph for given sub_graph (requires cudnn 9.5.0). There are two APIs:

populate_cuda_graph : add the cudnn nodes to the empty cuda_graph provided as input.
update_cuda_graph : update the populated cuda graph with necessary data pointers.
See docs and backend documentation for more details.

Enhancements

Kernel cache for dynamic shapes are now supported in python. Added a sample to showcase usage.
graph.deselect_engines(str: ) has now a python equivalent through pybind11.
graph.tensor(...) can now accept int64_t scalars directly. (Previously limited to int32_t, float and fp16 data types).
fp8 sdpa attention now allows dropout and padding mask. Requires cudnn 9.5.0 and above.
More enhancements to pointwise output stride inferencing (for broadcast operation). For non-unary operands, the broadcasted tensor can now be either at IN_0 or IN_1.
SDPA backward operation now allows d upto 256 for Hopper. Requires cudnn 9.5.0 and above.

Bug fixes

Fixed an issue while querying cudnnGetLastErrorString() from the backend. The error_t object will now have more meaningful message.
Fixed build issues seen with clang-19 compiler.
Fixed an issue where it was assumed a graph with bias in sdpa_bprop will always have a dbias.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.8.0 release