Integrate marlin fp16/bf16-int4/int8 matrix multiplication kernel #239

dacorvo · 2024-07-12T09:02:18Z

Since the introduction of mixed-precision fp16-int4 MARLIN (Mixed Auto-Regressive Linear) kernels by IST-DASLab, new mixed-precision MARLIN kernels have been introduced for other data types.

In particular, mixed-precision fp16/bf16-int4/int8 kernels have been contributed to TGI and could be integrated in optimum-quanto as well with companion Int8MarlinQBytesTensor and Int4MarlinQBitsTensor to pack the weights.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-08-12T01:57:14Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions · 2024-08-18T01:58:11Z

This issue was closed because it has been stalled for 5 days with no activity.

dacorvo · 2024-09-13T15:31:03Z

The kernel has been integrated in quanto CUDA extension in https://github.com/huggingface/optimum-quanto/tree/add_marlin_int4_kernel (thanks to an initial work by @shcho1118).
It needs now to be fully integrated at inference.

shovan777 · 2024-09-17T13:57:50Z

@dacorvo what should be done to integrate this at inference?

dacorvo · 2024-09-17T14:22:04Z

What is missing is a MarlinWeightsQBitsTensor class in the same spirit as AWQBitsTensor class, and a modification to QBitsTensor.create to select that class instead of AWQBitsTensor (because this kernel is more stable).
This is a bit involved as you have to do things right to stay compatible with serialization, gradients and compilation (all things done already in AWQBitsTensor).
Also, this requires writing some tests.

github-actions · 2024-10-18T02:03:22Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

dacorvo · 2024-10-18T07:28:43Z

Done in #333

dacorvo added the enhancement New feature or request label Jul 12, 2024

github-actions bot added the Stale label Aug 12, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 18, 2024

dacorvo reopened this Aug 26, 2024

dacorvo added help wanted Extra attention is needed and removed Stale labels Aug 26, 2024

github-actions bot added the Stale label Oct 18, 2024

dacorvo closed this as completed Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate marlin fp16/bf16-int4/int8 matrix multiplication kernel #239

Integrate marlin fp16/bf16-int4/int8 matrix multiplication kernel #239

dacorvo commented Jul 12, 2024

github-actions bot commented Aug 12, 2024

github-actions bot commented Aug 18, 2024

dacorvo commented Sep 13, 2024 •

edited

Loading

shovan777 commented Sep 17, 2024

dacorvo commented Sep 17, 2024 •

edited

Loading

github-actions bot commented Oct 18, 2024

dacorvo commented Oct 18, 2024

Integrate marlin fp16/bf16-int4/int8 matrix multiplication kernel #239

Integrate marlin fp16/bf16-int4/int8 matrix multiplication kernel #239

Comments

dacorvo commented Jul 12, 2024

github-actions bot commented Aug 12, 2024

github-actions bot commented Aug 18, 2024

dacorvo commented Sep 13, 2024 • edited Loading

shovan777 commented Sep 17, 2024

dacorvo commented Sep 17, 2024 • edited Loading

github-actions bot commented Oct 18, 2024

dacorvo commented Oct 18, 2024

dacorvo commented Sep 13, 2024 •

edited

Loading

dacorvo commented Sep 17, 2024 •

edited

Loading