-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate marlin fp16/bf16-int4/int8 matrix multiplication kernel #239
Comments
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This issue was closed because it has been stalled for 5 days with no activity. |
The kernel has been integrated in quanto CUDA extension in https://github.com/huggingface/optimum-quanto/tree/add_marlin_int4_kernel (thanks to an initial work by @shcho1118). |
@dacorvo what should be done to integrate this at inference? |
What is missing is a |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Done in #333 |
Since the introduction of mixed-precision fp16-int4 MARLIN (Mixed Auto-Regressive Linear) kernels by IST-DASLab, new mixed-precision MARLIN kernels have been introduced for other data types.
In particular, mixed-precision fp16/bf16-int4/int8 kernels have been contributed to TGI and could be integrated in
optimum-quanto
as well with companionInt8MarlinQBytesTensor
andInt4MarlinQBitsTensor
to pack the weights.The text was updated successfully, but these errors were encountered: