Skip to content

Conversation

@wenduwan
Copy link
Contributor

@wenduwan wenduwan commented Jun 7, 2023

There is a bug in the current impl which attempts to compare local rank vs global rank. Furthermore, when this fails it triggers a double free on the error path, i.e. crash.

This change set addresss the root cause.

Copy link
Contributor

@lrbison lrbison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Wenduo. No issue with the changes, but two requests:

  1. The double-free seems like it only happens on error path. Could you expend the existing commit message from just "bugfix" to something to that effect, or mention the function name. Bugfix isn't particularly helpful.

  2. The rank issue seems to be more important to me. Seems like this PR's description should focus on that rather than the double-free.

I assume these problems were introduced recently and from the same PR, otherwise I would suggest separate PRs.

wenduwan added 2 commits June 7, 2023 15:57
This patch fixes a bug in package rank calculation. The peer rank returned via PMIX_LOCAL_PEERS is the global rank. Matching it with local rank was wrong. Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
Fix a double free bug on the error path. This causes a harsh crash instead of returning an error code. Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
@wenduwan wenduwan changed the title opal/mca/ofi: fix a double free opal/mca/ofi: Patch series to fix a crash due to double free on error path Jun 7, 2023
@wenduwan
Copy link
Contributor Author

wenduwan commented Jun 7, 2023

@lrbison Yes sir. Addressed.

@wenduwan wenduwan requested a review from lrbison June 7, 2023 16:03
Copy link
Contributor

@lrbison lrbison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@wenduwan
Copy link
Contributor Author

wenduwan commented Jun 7, 2023

@lrbison Can we merge? I think 1 review is good enough for this PR.

@gpaulsen
Copy link
Member

gpaulsen commented Jun 8, 2023

bot:ibm:retest

@wenduwan
Copy link
Contributor Author

wenduwan commented Jun 8, 2023

@gpaulsen Is it possible to share the IBM CI failure? Thanks

@gpaulsen
Copy link
Member

gpaulsen commented Jun 9, 2023

Sorry, the IBM failure appears to be a firewall issue on our end. Working to get it resolved soon.

@gpaulsen
Copy link
Member

gpaulsen commented Jun 9, 2023

bot:ibm:retest

@gpaulsen
Copy link
Member

IBM lab folks are still investigating the IBM CI failure this morning.

@wenduwan
Copy link
Contributor Author

@gpaulsen Thank you. Please let me know if the issue is directly caused by this change.

@awlauria
Copy link
Contributor

@wenduwan it is not..it's a lab machine/hardware issue.

@awlauria
Copy link
Contributor

IBM CI is offline - please disregard the failures. We disabled our CI until the issues are resolved.

@wenduwan
Copy link
Contributor Author

@lrbison Since we can bypass IBM CI, shall we merge this change?

@lrbison lrbison merged commit ceceeaf into open-mpi:main Jun 12, 2023
@lrbison
Copy link
Contributor

lrbison commented Jun 12, 2023

Yes, merged. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

4 participants