You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, Marlin does not use any INT4 tensor cores, 4-bit weights are decompressed on-the-fly and then the actual computation is carried out in FP16. The reason Turning is not support is that Marlin heavily relies on the cp.async instruction which was introduced with compute capability 8.0; this allows explicitly fetching global memory in the background while doing other work at the same time, which is crucial to reach peak performance in an FP16xINT4 setting. While you could probably reuse quite some work of Marlin for writing a Turing kernel, some significant changes will likely be necessary.
Why is Ampere or Ada (RTX 3000 and RTX 4000 series) required to support this?
Turing (RTX 2000 series) has INT4 tensor cores.
The text was updated successfully, but these errors were encountered: