Skip to content

Releases: NVIDIA/cudnn-frontend

v1.2.0

12 Mar 19:26
b780db8
Compare
Choose a tag to compare

[New artifacts] Pre-built (alpha version) pip installable wheels for linux will be made available as part of this release. The pip wheels are compatible from python 3.8 through 3.12. The source builds will continue to work as expected.

[Documentation] We are updating our contribution policy and will be accepting small PRs targetting improving the cudnn-frontend. For full contribution guide refer to our contribution policy.

[API updates] [Python] The graph.execute function in python now takes an optional handle. This is to help user provide a custom handle to the execute function(and achieve parity with the C++ API).

[API updates] Pointwise ops can now take scalars directly as an argument. This simplifies the graph creation process in general. For eg.

auto C = graph.pointwise(A,
        graph.tensor(5.0f),
        fe::graph::Pointwise_attributes()
        .set_mode(fe::PointwiseMode_t::ADD)
        .set_compute_data_type(fe::DataType_t::FLOAT));

[Installation] Addresses RFE #64 to provide installation as cmake install

[Installation] Addresses RFE #63 to provide custom installation of catch2. If catch2 is not found, cudnn frontend fetches it automatically from the upstream github repository.

[Logging] Improved logging to print legible tensor names. We will be working on further improvements in future releases to make the logging more streamlined.

[Samples] Add a sample for showcasing auto-tuning to select the best plan among the ones returned from heuristics.

[Samples] As part of v1.2 release, we have created new Jupyter notebooks, showcasing the python API usage. At this point, these will work on A100 and H100 cards only as mentioned in the notebooks. With future releases, we plan to simplify the installation process and elaborate the API usage. Please refer to samples/python directory.

[Bug fixes] Fixed issues related to auto-tuning when the always plan 0 was executed, even though a different plan was chosen as the best candidate.

[Unit Tests] We are adding some unit tests which will provide a way for developers to test parts of the their code before submitting the pull requests. It is highly encouraged to add unit-tests and samples before submitting a pull request.

Note on source installation of python bindings:
In Ubuntu 22.04 debian based systems, when installing without the virtual environment, set ENV DEB_PYTHON_INSTALL_LAYOUT=deb_system. See related issue

v1.1.2

28 Feb 18:06
150798f
Compare
Choose a tag to compare

[Bug fix] Fixed an issue where the heuristic, when returning 0 results was throwing an error. This is considered a success as this can be appended with the fallback heuristics to build a valid plan.

v1.1.1

27 Feb 18:16
fd27d76
Compare
Choose a tag to compare

[Bug Fix] Fixed an issue in older cudnn versions where heurisitcs would return SUCCESS even if the number of heuristics results were zero.

v1.1.0

07 Feb 18:50
c29d609
Compare
Choose a tag to compare

[New API] A new overloaded variant of execute has been added which allows the variant pack to be mentioned as pair of "uid, device pointer". In order to use this, the expectation is user will provide the uid for the tensors created.

error_t
cudnn_frontend::graph::Graph::execute(cudnnHandle_t handle, 
            std::unordered_map<int64_t, void*>& tensor_to_pointer_map, void *workspace) const;

[New API] Serialization: Graph class now supports serialization and deserialization after the final plan is built. Serialization is only supported on Runtime compiled engines in the cuDNN backend as of today, but may be extended to other engines in future. Deserialization requires a cuDNN handle that is created for an identical GPU the original graph/plan was created with. New samples showcasing this have been added in samples/cpp/serialization.cpp

error_t
cudnn_frontend::graph::Graph::serialize(std::vector<uint8_t>& data) const;

error_t
cudnn_frontend::graph::Graph::deserialize(cudnnHandle_t handle, 
                   std::vector<uint8_t> const& data);

[New API] Autotuning: If the graph allows multiple engine configs for a given topology, each of this can now be built and executed in parallel. The expected flow is user queries the number of plans present and spawns a new thread for each plan to be finalized in parallel. The set of APIs to support this are as follows:

int64_t 
Graph::get_execution_plan_count() const;

error_t
Graph::build_plan_at_index(cudnnHandle_t const &handle, int64_t index);

error_t
Graph::execute_plan_at_index(cudnnHandle_t const &handle, 
                         std::unordered_map<int64_t, void*>& ,  
                         void* workspace,  
                         int64_t plan_index) const;

int64_t
get_workspace_size_plan_at_index(int64_t plan_index) const;

[New feature] sdpa_node now allows ragged offset to be set in the input and output tensors.

[Bug Fix] Certain parts of the FE code, used to throw excpetion even with DISABLE_EXCEPTION flag set. This has been cleaned up.

[Bug Fix] For sdpa node, cudnn now correctly returns NOT_SUPPORTED when s_q is not a multiple of 64 and padding mask is on and cudnn version is less than 9.0.0.

[Bug Fix] For sdpa backward node, cudnn now correctly returns NOT_SUPPORTED when s_q is less than 64 and cudnn version is less than 9.0.0.

[Bug Fix] Fixed an issue with pointwise Modulo operation.

[Bug Fix] Fixed an issue in sdpa node, where the intermediate data types were wrong.

[Samples] Added a sample to showcase matmul with int8 and FP8 precisions.

[Cleanup] Python samples have moved from samples/python to tests/python_fe.

[Cleanup] Removed the cudnn_frontend::throw_if function.

v1.0.3 release

31 Jan 20:37
a86ad70
Compare
Choose a tag to compare

[Bug fix] Fixed an issue where in some cases with padding, SDPA backward node can produce NaNs.

[Bug fix] In some older cuda toolkits, eg. cuda 11.4, float to half conversion is not implicit. This was raised in PR-57. Thanks @drisspg for reporting this. A more explicit fix using __float2half has been implemented in this patch.

[Enhancement] Accepting github PR-55. Thanks @r-barnes for the suggestion.

v1.0.2 release

09 Jan 22:33
9e17716
Compare
Choose a tag to compare

v1.0.2

[Cleanup] Remove the cudnn_backend.h dependency, since the correct header is already included in cudnn.h

v1.0.1 release

04 Jan 19:04
f87101b
Compare
Choose a tag to compare

v1.0.1

[Bug Fix] Fixed an issue in the sdpa node when kv-sequence length is not a multiple of 64 and padding mask is not enabled. This allows graphs with kv-sequence length not a multiple of 64 to be executed on cudnn version 8.9.5 onwards. cudnn versions prior to this now correctly return NOT_SUPPORTED as expected.

[Bug Fix] Fixed an issue where creation of graph object leads to compilation error in some compilers.

[Bug Fix] cudnn frontend now correctly sets the stream to on the handle. This affected only the python bindings.

[Internal change] Streamlined includes of cudnn graph API header files into cudnn_frontend.h.

v1.0.0 release

04 Dec 23:45
Compare
Choose a tag to compare

cudnn_frontend v1.0 release introduces new API aimed to simplify graph construction.

[New API] In FE v1.0 API, users can describe multiple operations that form subgraph through cudnn_frontend::graph::Graph object.
Unlike the FE v0.x API, users dont need to worry about specifying shapes and sizes of the intermediate virtual tensors. See README.FE.1.0.md for
more details. For more information on historical 1.0 changes, pre-release release notes are here.

Graph class consist of three types of API, viz.

  • APIs that return reference to the graph itself.
    This is necessary for chaining.
    These can be used for setting the global properties of the graph. Example,
graph.set_compute_data_type(...).set_io_data_type(...);
  • APIs that return a shared pointer to the tensor. These are required to denote entry tensors or output of nodes which can be exit points of graph or inputs to other nodes. Example,
X = graph.tensor(...); 
W = graph.tensor(...);
Y = graph.conv_frop(X,W, Conv_fprop_attributes(...));
  • APIs that return a error type which is a combination of error code and error message. These APIs generally mutate the graph object, or are responsible for calling the cudnn backend API. Example,
auto error = graph.validate();
auto error = graph.build_operation_graph(handle); 

[New Feature] Python bindings for the FE 1.0 API. See, Python API section in README.md for building the python bindings. Details of python
API and its kw arguments are in the README.FE.1.0.md. Python API samples are in samples/python/*.py

[New Feature] Added a compound SDPA op (both forward and back prop). More details in docs/operations/Attention.md

[New Feature] Better error reporting, where in addition to error codes, we also provide error messages which provide more information on specific cause of failure.

[Deprecation] v0.x API are now labelled deprecated and may be removed in v2.0. Consider moving to v1.0 API. If there are issues or missing features, please create a github issue.

Changes over pre-release-5:

[New Feature] Scaled_Dot_Product_Attention op now supports GQA in Fprop and bprop.

[Breaking change] Output dim and strides of SDPA fprop and bprop outputs are now mandatory. Since, the inference of output shapes are non-deterministic.

[Samples] Added samples to showcase,

  • INT8 convolution ("Conv with Int8 datatypes")
  • Mixed precision multiplication ("Mixed Precision Matmul")
  • Simple Convolutions, MatMuls and Matmuls with simple epilogues(matmuls.cpp, wgrads.cpp, dgrads.cpp)

[Update] The default value of cudnnNanPropagation_t has been set to CUDNN_PROPAGATE_NAN instead of CUDNN_NOT_PROPAGATE_NAN.

[Update] Have added a typedef for scaled_dot_product_flash_attention as SDPA as a convenience.

Miscellaneous updates to v0.x API and the legacy samples:

[Bug fix] Some tests were failing on Ampere GPUs because no plans with 0 size were available. This has been fixed.

[Bug fix] Median of three sampling was incorrectly sorting the results, when cudnnFind was used. This has been fixed.

[Bug fix] Thanks to @Riottomsk for pointing out the bug in port count of Pointwise mode POW in his [PR] (#49). This fix has been incorporated.

[Bug fix] Have fixed a bug in resample backprop operation, where CUDNN_ATTR_OPERATION_RESAMPLE_BWD_XDESC and CUDNN_ATTR_OPERATION_RESAMPLE_BWD_YDESC were not set correctly.

[Feature] Layer Norm API has been added. And can be used with the v0.x API.

cudnn FE 1.0 pre-release-5

20 Nov 20:56
Compare
Choose a tag to compare
Pre-release

Pre-release-5 release notes:

[API change] Based on user feedback, we have removed distinction between the graph and plan objects. With the new API, plan remains embedded in the graph and all operations are performed on the graph object.

Previously,

    REQUIRE(graph.validate().is_good());
    REQUIRE(graph.build_operation_graph(handle).is_good());
    auto plans = graph.get_execution_plan_list({fe::HeurMode_t::A});
    REQUIRE(plans.check_support(handle).is_good());
    REQUIRE(graph.set_execution_plans(plans).is_good());

Now,

    REQUIRE(graph.validate().is_good());
    REQUIRE(graph.build_operation_graph(handle).is_good());
    REQUIRE(graph.create_execution_plans({fe::HeurMode_t::A}).is_good());
    REQUIRE(graph.check_support(handle).is_good());
    REQUIRE(graph.build_plans(handle).is_good());

Also, with this change the following new API have been introduced on the graph class.

error_t
build_plans(cudnnHandle_t const &handle,
            BuildPlanPolicy_t const policy     = BuildPlanPolicy_t::HEURISTICS_CHOICE,
            bool const do_multithreaded_builds = false);

Graph & deselect_workspace_greater_than(int64_t const workspace);

Graph & deselect_behavior_notes(std::vector<BehaviorNote_t> const &notes);

Graph & deselect_numeric_notes(std::vector<NumericalNote_t> const &notes);

int64_t get_workspace_size() const 


int64_t get_autotune_workspace_size() const;

error_t autotune(cudnnHandle_t handle,
             std::unordered_map<std::shared_ptr<Tensor_attributes>, void *> variants,
             void *workspace,
             void *user_impl = nullptr);

[API change] Removes the implicit validate call made in build_operation_graph. Now, the expectation is that the user explicitly calls validate on the graph before calling build_operation_graph. This helps the user distinguish errors between malformed graphs and error occuring due to lowering into cudnn.

[API change] Return error codes from the graph API have now been marked nodiscard.

[New API] Have added a new graph::key() -> int64_t as an API that returns a hash on the graph object. This can be used as key for graph caching. Eg. of this usage is shown in the samples.

[New API] Have added new python API create_handle, destroy_handle, set_stream, get_stream to allow custom handle and stream management on the graph object.

[New functionality] sdpa backward can now compute dbias if the fprop had a bias operation. This functionality was added in cudnn 8.9.6.

[Enhancement] There is a extension in behavior of CUDNN_FRONTEND_ATTN_DP_WORKSPACE_LIMIT. This is documented in docs/operation/Attention.md

[Enhancement] Have added better error checks to make sure all the tensors of the node have been created. This prevents unexpected segmentation faults seen earlier.

[Bug Fix] Fix issues in instancenorm, which had caused invalid memory access earlier.

[Enhancement] Have moved the v0.9 API samples to samples/legacy_samples folder for better organization.

cudnn FE 1.0 pre-release-4

19 Oct 03:29
Compare
Choose a tag to compare
Pre-release

[API change] Scaled_dot_product_flash_attention_attributes, Scaled_dot_product_flash_attention_backward_attributes now accepts K, V tensors instead of K-transpose and V-transpose. This is a deviation from the backend API. This change is made based on multiple customer feedback.

[New API] Add tensor_like python API which accepts a DLPack-compstible tensor. This simplifies the cudnn tensor creation.

[New Feature] Setting CUDNN_FRONTEND_ATTN_DP_WORKSPACE_LIMIT environment variable allows to choose between different optimized cudnn backend kernels. See docs/operations/mha for more details.
[New Feature] Add RMSNorm and InstanceNorm forward and backward implementations.
[New Feature] Add alibi, padding, layout support for attention bprop node.
[New Feature] Introduce python bindings for plans. Allows validate graph, filter plans.

[Bug Fix] Fix relative includes of filenames in cudnn_frontend headers. This resolves compilation issues in certain toolchains
[Bug Fix] Fix Segfault when dropout was set for some scaled dot product flash attention nodes.

[New samples] Add new samples for apply_rope, layernorm forward and backward, rmsnorm forward and backward