Skip to content

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Jan 5, 2025

The btl/uct code can be quite aggressive at sends connection messages over the connection endpoint. This could lead to a large number of unnecessary messages in some cases. This commit adds code to restrict the retry rate to 2ms. This timing is controlled by a new MCA variable: btl_uct_connection_retry_timeout.

The btl/uct code can be quite aggressive at sends connection messages over the connection endpoint. This could lead to a large number of unnecessary messages in some cases. This commit adds code to restrict the retry rate to 2ms. This timing is controlled by a new MCA variable: btl_uct_connection_retry_timeout. Signed-off-by: Nathan Hjelm <hjelmn@google.com>
@hppritcha
Copy link
Member

is there a simple test which demonstrates the problem that this PR is addressing?

@hjelmn
Copy link
Member Author

hjelmn commented Jan 13, 2025

UD is a bit different on our hardware and the extra message don't break things per-se but I added logging of UD sends and, for a 384 (2 * 192 ppn) process run, there were over 250,000 UD messages in a mostly connected case when there should have been no more than 2 * 192^2 (each process should send one for each off-node connection). With this fix the number dropped to 20k because it was not fully connecting.

I may actually axe the retry entirely since UD and other similar TLs are treated as reliable by UCT. I have another change coming in to clean up the code a bit (leaving the retry) which has been fully tested but I need some time to test without the retry mechanism.

Copy link
Member

@hppritcha hppritcha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can't really test this but built for me and passed basic multi-node sanity using ob1.

@hjelmn hjelmn merged commit a1544c0 into open-mpi:main Feb 4, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

2 participants