- Notifications
You must be signed in to change notification settings - Fork 31.1k
Skip non-selected experts for mixtral and qwen2_moe #32429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @ArthurZucker, I encountered a conflict between torch.fx and a dynamic for loop in my implementation. I haven't found a concise solution for this issue yet. Could you help or just give up for this? |
| Thanks for the info. I am not quite familiar with 'torch.fx'. Seems like there is a trade-off between enabling FX tracing and skipping experts. |
ArthurZucker left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#31173 is related as well.
Yeah if fx's failing not 100% sure we want to do this, but we do needs benches of some sort regarding reduced IO, because for qwen with a lot of experts, it can be significatn
ArthurZucker left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still down to get this merged! WOuld you mind just producing small benches of before / after?
| same problem. |
| Thanks for the PR 🤗 |
ArthurZucker left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually would like to get rid of the to_list() that introduce device synch
| The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
ArthurZucker left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM not that bad
* Skip non-selected experts for mixtral and qwen2_moe * Fix: tensor tolist() * WIP: tokenization test * fix modular source of truth * nits --------- Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Skip non-selected experts for mixtral and qwen2_moe * Fix: tensor tolist() * WIP: tokenization test * fix modular source of truth * nits --------- Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
- Replace data-dependent .nonzero() operation with static expert loop - Resolves GuardOnDataDependentSymNode error during torch.export - Maintains identical functionality while enabling export compatibility - Fixes issue introduced in PR huggingface#32429
- Replace data-dependent .nonzero() operation with static expert loop - Resolves GuardOnDataDependentSymNode error during torch.export - Maintains identical functionality while enabling export compatibility - Fixes issue introduced in PR huggingface#32429 - Add tests for torch.export compatibility
Fixes #32283
This PR avoids redundant computation for some MoE models (mixtral and qwen2_moe).
The current implementation loops all the experts and inevitably loads experts' weight, which brings extra IO costs.
@ArthurZucker