📚Modern CUDA Learn Notes with PyTorch: Tensor/CUDA Cores, 📖150+ CUDA Kernels with PyTorch bindings, 📖HGEMM/SGEMM (95%~99% cuBLAS performance), 📖100+ LLM/CUDA Blogs.
-
Updated
Nov 15, 2024 - Cuda
📚Modern CUDA Learn Notes with PyTorch: Tensor/CUDA Cores, 📖150+ CUDA Kernels with PyTorch bindings, 📖HGEMM/SGEMM (95%~99% cuBLAS performance), 📖100+ LLM/CUDA Blogs.
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.
Standard library strided math functions.
Strided array math operations.
Base strided.
Compute the absolute value.
Standard library strided array special math functions.
Standard library special math functions.
Apply a function to each element in an array and assign the result to an element in an output array, iterating from right to left.
Add a description, image, and links to the elementwise topic page so that developers can more easily learn about it.
To associate your repository with the elementwise topic, visit your repo's landing page and select "manage topics."