Skip to content

Conversation

@lsy323
Copy link
Collaborator

@lsy323 lsy323 commented Jun 5, 2024

Add the first xla quantized ops for per-channel weight-only quantized matmul.

The math is out[bf16] = matmul(act[bf16], weight[s8]) * scaler[bf16], the same as what was adopted in the XLA llama quant implementation

User experience:

  • Call quantized op with already quantized weight in model code
  • Swap the nn.Linear Module with the added quantized module in model code

More details about user experience can be found in the added README.

Changes:

  • Added custom torch op and nn.Module for the quantized op
  • Added user guide

Test:

  • Test the lowered HLO is doing what we expect
  • Test it works with Dynamo
  • Test numerical correctness

int4 and blockwise quant support will be added in following PRs

@lsy323 lsy323 requested review from JackCaoG and miladm June 5, 2024 23:42
@lsy323 lsy323 marked this pull request as ready for review June 5, 2024 23:44
@lsy323 lsy323 changed the title Add int8 per channel quantized matmul Add int8 per channel weight-only quantized matmul Jun 5, 2024
@lsy323 lsy323 requested a review from qihqi June 6, 2024 00:08
Copy link
Collaborator

@JackCaoG JackCaoG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly lgtm, minor nits

@lsy323 lsy323 requested a review from JackCaoG June 6, 2024 16:54
@lsy323 lsy323 merged commit 56ddd5d into master Jun 7, 2024
@lsy323 lsy323 deleted the lsiyuan/quant-ops branch June 7, 2024 04:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

2 participants