📚 Modern CUDA Learn Notes with PyTorch for Beginners: It includes Tensor/CUDA Cores, TF32/F16/BF16/F8, 📖150+ CUDA Kernels🔥🔥 with PyTorch bindings, 📖30+ LLM/VLM🔥, 📖40+ CV/C++...🔥, 📖50+ CUDA/CuTe...🔥 Blogs and 📖HGEMM/SGEMM🔥🔥 which has been fully optimized, check 📖HGEMM/SGEMM Supported Matrix👇 for more details. Welcome to 🌟👆🏻star this repo to support me, many thanks ~ 🎉🎉
Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's default Tensor Cores math algorithm CUBLAS_GEMM_DEFAULT_TENSOR_OP
, the HGEMM (WMMA/MMA)
implemented in this repo(sky blue
🔵) can achieve 95%~99%
of its(orange
🟠) performance. Please check hgemm benchmark for more details.
CUDA Cores | Sliced K(Loop over K) | Tile Block | Tile Thread |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
WMMA(m16n16k16) | MMA(m16n8k16) | Pack LDST(128 bits) | SMEM Padding |
✔️ | ✔️ | ✔️ | ✔️ |
Copy Async | Tile MMA(More Threads) | Tile Warp(More Values) | Multi Stages |
✔️ | ✔️ | ✔️ | ✔️ |
Reg Double Buffers | Block Swizzle | Warp Swizzle | Collective Store(Shfl) |
✔️ | ✔️ | ✔️ | ✔️ |
Row Major(NN) | Col Major(TN) | SGEMM TF32 | SMEM Swizzle(Permuted) |
✔️ | ✔️ | ✔️ | ... |
📖 150+ CUDA Kernels 🔥🔥 (面试常考题目) (©️back👆🏻)
Workflow: custom CUDA kernel impl -> PyTorch Python bindings -> Run tests. 👉TIPS: *
= Tensor Cores(WMMA/MMA), otherwise, CUDA Cores; /
= not supported; ✔️
= supported; ❔
= in my plan.
📖 大模型|多模态|Diffusion|推理优化 (本人作者) (©️back👆🏻)
📖 CV推理部署|C++|算法|技术随笔 (本人作者) (©️back👆🏻)
📖 CUTLASS|CuTe|NCCL|CUDA|文章推荐 (其他作者) (©️back👆🏻)
💡说明: 大佬们写的文章实在是太棒了,学到了很多东西。欢迎大家提PR推荐更多优秀的文章!
©️License (©️back👆🏻)
GNU General Public License v3.0
🎉Contribute (©️back👆🏻)
How to contribute? please check 🌤🌤CONTRIBUTE🎉🎉.