Releases: NVIDIA/cudnn-frontend
cudnn FE 1.0 pre-release 3
cudnn prerelease_3:
Improvements over prerelease 2:
[Feature] Added SDPA flash attention backward node.
[Bug fix] Resolved an issue where the computed Alibi slopes were copied onto GPU memory on default stream instead of user specified stream in the handle.
[Bug fix] Fix windows compilation error when pedantic warnings are treated as errors.
[Bug fix] Fixed issue in causal padding where the masked values were `std::numeric_limits<float>::min()` instead of `std::numeric_limits<float>::lowest()`
Under investigation and development:
- We are still working on additional features for SDPA back prop.
- Better error messages and logging
cudnn FE 1.0 pre-release 2
Release Notes:
Improvements over prerelease 1:
[Feature] Added missing python bindings for several pointwise ops.
[Feature] SDPA flash attention feature parity with the backend API.
[Bug fixes] Shape inferencing fixes for dgrad, wgrad where the output dimension cannot be computed deterministically.
Under investigation and development:
- We are still working on additional features for SDPA back prop.
- CPU overhead when using the python bindings are under investigation.
- Better error messages and logging
Miscelleanous updates to the v0.x API:
[Bug fix] Some tests were failing on Ampere GPUs because no plans with 0 size were available. This has been fixed.
[Bug fix] Median of three sampling was incorrectly sorting the results, when cudnnFind was used. This has been fixed.
[Feature] Layer Norm API has been added. And can be used with the v0.x API.
This release is experimental
v1.0-pre-release
cudnn_frontend v1.0 prerelease introduces new API aimed to simplify graph construction.
The purpose of this pre-release is to solicit feedback on the new API and gather requests for enhancement.
Please create a github issue for any changes or enhancement you would like to see.
[New API] In FE v1.0 API, users can describe multiple operations that
form subgraph through cudnn_frontend::graph::Graph object.
Unlike the FE v0.x API, users dont need to worry about specifying shapes
and sizes of the intermediate virtual tensors. See README.FE.1.0.md for
more details.
[New Feature] Python bindings for the FE 1.0 API. See, Python API
section in README.md for building the python bindings. Details of python
API and its kw arguments are in the README.FE.1.0.md. Python API samples
are in samples/python/*.py
[Deprecation] v0.x API are now labelled deprecated and may be removed in v2.0.
Consider, moving to v1.0 API. If there are issues or missing features, please create a
github issue.
v0.9.2
v0.9.1
[Bug Fix] Updated version numbers of the cudnn frontend release.
[Update] Updated the documentation to reflect latest version numbers.
[Update] Readme updated with cmake build instructions.
[Samples] Added a new Batch Norm sample forward and backward example.
v0.9
[Enhancement] Added ability to filter by shape of tensors to errata filter.
[Enhancement] Added ability to override the default feature vector in the opGraph manually.
[Enhancement] Added support for CUDNN_POINTWISE_RECIPROCAL pointwise operation.
[Enhancement] Added an option to limit the number of kernels benchmarked in find-plan.
[Bug Fix] Fixed "Scale Bias Conv BNGenstats" test case where the sum and square sum channel dimensions were incorrect.
[Bug Fix] Fixed a compiler error "dereferencing type-punned pointer will break strict-aliasing rules" seen in certain compiler while type-casting floating point alpha/beta to int64_t.
[Bug Fix] Waived "ConvScaleBiasAct_int8 sample" for V100 because of lack of int8 support.
[Samples] Added BF16/FP16/FP8 Flash Attention Fprop/Bprop samples.
v0.8.1
v0.8
[New API] Added support for Reshape operation.
[New API] Added support for DgradDreluBNBwdWeight operation
[Minor Enhancement] Added cudnn frontend enums to simplify Resample operation creation.
[Minor Enhancement] Added alpha and beta values as key for the plan caches.
[Bug Fix] Fixed an error which was causing reference code to fail with segmentation fault.
[Bug Fix] Fixed an issue where stride/padding and dilation values were incorrectly cached for 2d convolutions.
[Bug Fix] Fixed issues where error statuses were not handled correctly during tensor creation.
[Samples] Added a new sample to show case how fMHA graph can be programmed through FE API. This sample contains both fprop and backprop graphs.
[Samples] Added a new sample to show case DgradDreluBNBwdWeight operation.
[Samples] Added a modular block which models fprop of residual block resnet.
v0.7.3
v0.7.3
Release Notes:
[Enhancement] Added a CUDNN_FRONTEND_VERSION macro to cudnn_frontend.
[Enhancement] Added the inline keyword to the get_plan functions to enable inclusion in multiple compilation units.
[Bug fix] Replace CUDNN with CUDNN_VERSION as the right macro names.
v0.7.2
Release Notes:
cudnn_frontend v0.7 aims to target the new features introduced in cudnn version v8.5 (https://developer.nvidia.com/cudnn). The following are the changes in the v0.7 release.
[New API] Added support for Resample operation.
[New API] Tensor class has a clone method which allows a user to quickly create a new Tensor object with similar attributes.
[New API] Added support for new pointwise operations CUDNN_POINTWISE_ERF, CUDNN_POINTWISE_GELU_APPROX_TANH_FWD, CUDNN_POINTWISE_GELU_APPROX_TANH_BWD, CUDNN_POINTWISE_IDENTITY.
[New API] Several API names have been unified and made consistent across multiple descriptors for readability.
setComputePrecision/setMathPrecision/setMathType have been unified into setComputeType in cudnn_frontend_ConvDesc.h, cudnn_frontend_MatMulDesc.h, cudnn_frontend_Operation.h, cudnn_frontend_PointWiseDesc.h, cudnn_frontend_ReductionDesc.h, cudnn_frontend_Resample.h
Math operations like ConvDesc, ResampleDesc have getSpatialDimCount instead of getDimCount to avoid confusion with Tensor Dimensions.
Accessors for arrays will have [g,s]et[Spatial] as the API. [Spatial] is only needed when the attribute is common to both Tensor descriptor and Operation descriptor. Currently, its only the Stride and DimCount attributes that have ambiguity.
setArray functions will take size and pointer as arguments eg. setStride(int dim, int64_t* arr), setSpatialStride(int dim, int64_t* arr)
getArray functions will return a pointer to the array whose size is determined by getDimCount or getSpatialDimCount
[Minor Enhancement] Execution plans and Operation Graph printout more information in their describe() method.
[Bug Fixes] Some samples have been updated to go over all fallback configs to ensure that a successful plan is built.
[Bug Fixes] Execution plans had wrongly initialized numerical note CUDNN_NUMERICAL_NOTE_TYPE_TENSOR_CORE. This has been fixed.
[Samples] Added a new sample that does scale and bias of two tensors, adds them followed by a ReLU operation to show how fused operations work.
[Samples] Added a sample to demonstrate how the resample operation works.
[Samples] Added a new sample which shows convolution followed by multiple scales.
[Samples] Added a sample to show Fully Connected Layer fused with GeLU forward.
[Samples] Added a new sample to show fused backward activation, backward bias and backward Data Grad operation.
The current FE is designed to be compatible with all minor releases in the cuDNN 8.x version
v0.7.1
[Enhancement] Additional commit to remove an extraneous include to cudnn_ops_infer.h
v0.7.2
[Enhancement] Fixed issues in the code which caused warnings in MSVC and clang compilers.
[Enhancement] Fixed errors in get_heuristics_list where for certain heuristics mode in older cuDNN versions, the heuristics list might be incorrect.
[Bug fixes] Fixed several test cases failing on unsupported GPUs to exit gracefully.
[Samples] Added a sample to showcase fp8 convolution forward in Nvidia Hopper GPUs. The sample also showcases post convolution book-keeping operations such as scaling and absolute maximum reduction.
[Samples] Added a sample which converts fp16 tensor to fp8 and performs transpose and absolute maximum reduction.
[Samples] Added a sample to demonstrate Max pooling operation including tensor index dump, necessary to speed up the backward pass.
[Samples] Added a sample to showcase the backward pooling operation.