Support int4 weight in quantized matmul/linear #7235

lsy323 · 2024-06-10T22:34:40Z

int4 weight can be enabled by torch.ops.xla.quantized_matmul(x, weight, weight_scaler, int4_weight=True), XlaQuantizedLinear(...,int4_weight=True)

The matmul w/ int4 workflow is:

The int4 weight is stored in int8 container (unpacked)
During HLO lowering, xla::Literal will be created for the int4 weights
F.linear on the activation and int4 weight

Original plan was to pack int4 values in int8 container, and do reinterpret cast, but reinterpret cast does't work on TPU now.

Test:
Added tests for quantized op and linear module.

test/quantized_ops/test_quantized_matmul.py

torch_xla/csrc/ops/cast_int4.cpp

torch_xla/experimental/xla_quantized_matmul.py

lsy323 · 2024-06-10T23:17:38Z

Removed pack/unpack logic and test since not used now.

Siyuan Liu and others added 16 commits June 5, 2024 20:31

add quantized layers per channel

f7c200a

enhance tests, clean up

f48c666

add q ops to ci

65f6fca

add README

c042e2f

update readme

878d7e7

update readme

b454237

initial commit for int4

e69627f

add some tests

810f104

use literal

b8ed810

fix bad malloc

27acbbb

add a subchannel test

7c52bf9

add tests

9fd7caa

add TPU numerical check

fa29ba2

refactor

9c47f63

format

256a261

merge

059053b

lsy323 marked this pull request as ready for review June 10, 2024 22:41

Siyuan Liu added 2 commits June 10, 2024 22:45

update docl

5c4c7f0

rename to cast_int4

03f46f1

lsy323 force-pushed the lsiyuan/int4-quant-ops branch from 4330117 to 03f46f1 Compare June 10, 2024 23:01

Siyuan Liu added 2 commits June 10, 2024 23:03

remove dup files

11be78b

format

3a1d83f

JackCaoG self-requested a review June 10, 2024 23:05

Siyuan Liu added 2 commits June 10, 2024 23:06

remove comment

62a0b17

remove comment

5fe2f09