Skip to content
This repository has been archived by the owner on Oct 4, 2024. It is now read-only.
/ Needle Public archive

Imperative deep learning framework with customized GPU and CPU backend

Notifications You must be signed in to change notification settings

YconquestY/Needle

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Needle

An imperative deep learning framework with customized GPU and CPU backend.

TODO

  • To refactor the codebase
    Needle adopts a monolithic design, e.g., the NDArray backend for a device is in a single file. A modular design allows agile optimization of operators, layers, modules, etc.
  • To extend the reservoir of operators
    To support self-attention, the batched matrix multiplication operator is necessary. BatchNorm2d can be more efficient if fused. Other customized operators include fused differentiable volume rendering, sparse matrix multiplication for graph embeddings, I/O kernels, etc.
  • To optimize the NDArray backend
    This summary gathers a series of blog posts on maximizing the throughput of operators. Also refer to Programming Massively Parallel Processors for more topics on CUDA. The goal is to exceed the performance of official CUDA libraris like cuBLAS with hand-crafted kernels in certain tasks.
  • To incorporate tcnn as MLP intrinsics
  • To accelerate computational graph traversal with CUDA12

Acknowledgement

This project is inspired by 10-414/714 Deep Learning Systems by Carnegie Mellon University. Switch to the branch hw for homework, proj for the course project, and lec for lectures.

Footnotes

  1. Gunrock: a high-performance graph processing library on the GPU

  2. GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU