Add marlin int4 kernel #333

dacorvo · 2024-10-06T14:35:41Z

What does this PR do?

This adds a modified Marlin fp16/int4 kernel to the library and creates two new QTensor subclasses to use it:

MarlinInt4PackedTensor,
MarlinInt4WeightQBitsTensor.

There are issues with the weight/scales/zero-point readback as soon as parallelization increases. The consequence is that output features higher than 128 are corrupted when a sufficient amount of inputs are parallelized.

As a consequence, the AWQ kernel is still used despite lower performances as the number of tokens increases.

The code is however merged as is, and #332 is created to investigate the issues.

source: https://github.com/shcho1118/marlin-scaled-zero-point

Original fix in vLLM project: The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.

This is to guarantee Marlin kernels output is similar to the output obtained using dequantized weights.

Adding more tests revealed a bug in the Marlin int4 kernel when the weights and inputs are large enough. Failing configurations are marked as xfail.

dacorvo force-pushed the add_marlin_int4_kernel branch from 6803193 to 4d7bd36 Compare October 10, 2024 09:15

dacorvo and others added 10 commits October 10, 2024 12:07

test(qbits): increase coverage for weight linear tests

19e6513

feat(library): add original marlin fp16i4 kernel

b1c07d3

feat: modify marlin fp16 int4 kernel to use scaled zeropoint

0130cc6

source: https://github.com/shcho1118/marlin-scaled-zero-point

feat(library): add Marlin gemm_f16_i4 op

72c3382

feat: add MarlinInt4PackedTensor

cf895d0

feat(marlin): add scales/shifts permutations

dfc0acf

test: add test_gemm_marlin_fp16_int4

f6fc238

This is to guarantee Marlin kernels output is similar to the output obtained using dequantized weights.

perf: add Marlin to w4a16 benchmark

1ffc4e5

feat(qtensor): add MarlinQBitsTensor

ce90118

Adding more tests revealed a bug in the Marlin int4 kernel when the weights and inputs are large enough. Failing configurations are marked as xfail.

dacorvo force-pushed the add_marlin_int4_kernel branch from 4d7bd36 to ce90118 Compare October 10, 2024 10:07

dacorvo merged commit 852bb9c into main Oct 10, 2024
16 checks passed

dacorvo deleted the add_marlin_int4_kernel branch October 10, 2024 11:38

dacorvo mentioned this pull request Oct 18, 2024

Integrate marlin fp16/bf16-int4/int8 matrix multiplication kernel #239

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add marlin int4 kernel #333

Add marlin int4 kernel #333

dacorvo commented Oct 6, 2024

Add marlin int4 kernel #333

Add marlin int4 kernel #333

Conversation

dacorvo commented Oct 6, 2024

What does this PR do?