[Computation Hash] Introduce deterministic hash for user computations #8539

rpsilva-aws · 2025-01-07T01:33:51Z

rpsilva-aws · 2025-01-07T01:34:33Z

rpsilva-aws · 2025-01-07T01:41:41Z

Test results without the deterministic serialization (instead, relying on the former SerializeAsString()):

[ RUN ] XlaUtilTest.TestDeterministicComputationSerialization torch_xla/csrc/runtime/xla_util_test.cc:281: Failure Expected equality of these values: hash1 Which is: 43931100196028486903611743554166252076 hash2 Which is: 6626235922799979895316908395105923211 Hashes should match regardless of the frontend attribute ordering [ FAILED ] XlaUtilTest.TestDeterministicComputationSerialization (0 ms) [----------] 5 tests from XlaUtilTest (1 ms total)

So hash1 and hash2 differ, through it is serializing the exact same HLO module proto.

tengyifei · 2025-01-08T22:16:39Z

LGTM however the tests fail.

From what I can tell, both failed tests involve an XlaComputation. The test_conditional feeds XlaComputation into an HLO Cond op. The scan feeds an XlaComputation into an HLO While op. So probably the hash is broken for ComputationPtr in a way that causes different graphs to hash into the same value.

tengyifei

LGTM however the tests fail.

From what I can tell, both failed tests involve an XlaComputation. The test_conditional feeds XlaComputation into an HLO Cond op. The scan feeds an XlaComputation into an HLO While op. So probably the hash is broken for ComputationPtr in a way that causes different graphs to hash into the same value.

rpsilva-aws · 2025-01-09T00:31:44Z

Indeed, looking. A first look shows that the resulting hash for both user computations are the same, though the computation IRs differ in the constant parameter:

0.9 case:

HloModule test_conditional.18, entry_computation_layout={(f32[], f32[2,2]{1,0}, f32[2,2]{1,0})->f32[2,2]{1,0}} %CondTrue.7 (p0.8: (f32[2,2], f32[2,2])) -> f32[2,2] { %p0.8 = (f32[2,2]{1,0}, f32[2,2]{1,0}) parameter(0) %get-tuple-element.9 = f32[2,2]{1,0} get-tuple-element((f32[2,2]{1,0}, f32[2,2]{1,0}) %p0.8), index=0 %get-tuple-element.10 = f32[2,2]{1,0} get-tuple-element((f32[2,2]{1,0}, f32[2,2]{1,0}) %p0.8), index=1 ROOT %add.11 = f32[2,2]{1,0} add(f32[2,2]{1,0} %get-tuple-element.9, f32[2,2]{1,0} %get-tuple-element.10) } %CondFalse.12 (p0.13: (f32[2,2], f32[2,2])) -> f32[2,2] { %p0.13 = (f32[2,2]{1,0}, f32[2,2]{1,0}) parameter(0) %get-tuple-element.14 = f32[2,2]{1,0} get-tuple-element((f32[2,2]{1,0}, f32[2,2]{1,0}) %p0.13), index=0 %get-tuple-element.15 = f32[2,2]{1,0} get-tuple-element((f32[2,2]{1,0}, f32[2,2]{1,0}) %p0.13), index=1 ROOT %subtract.16 = f32[2,2]{1,0} subtract(f32[2,2]{1,0} %get-tuple-element.14, f32[2,2]{1,0} %get-tuple-element.15) } ENTRY %test_conditional.18 (p0.1: f32[], p1.2: f32[2,2], p2.3: f32[2,2]) -> f32[2,2] { %p0.1 = f32[] parameter(0) %constant.4 = f32[] constant(0.9) %compare.5 = pred[] compare(f32[] %p0.1, f32[] %constant.4), direction=GT %p1.2 = f32[2,2]{1,0} parameter(1) %p2.3 = f32[2,2]{1,0} parameter(2) %tuple.6 = (f32[2,2]{1,0}, f32[2,2]{1,0}) tuple(f32[2,2]{1,0} %p1.2, f32[2,2]{1,0} %p2.3) ROOT %conditional.17 = f32[2,2]{1,0} conditional(pred[] %compare.5, (f32[2,2]{1,0}, f32[2,2]{1,0}) %tuple.6, (f32[2,2]{1,0}, f32[2,2]{1,0}) %tuple.6), true_computation=%CondTrue.7, false_computation=%CondFalse.12 }

0.1 case:

HloModule test_conditional.18, entry_computation_layout={(f32[], f32[2,2]{1,0}, f32[2,2]{1,0})->f32[2,2]{1,0}} %CondTrue.7 (p0.8: (f32[2,2], f32[2,2])) -> f32[2,2] { %p0.8 = (f32[2,2]{1,0}, f32[2,2]{1,0}) parameter(0) %get-tuple-element.9 = f32[2,2]{1,0} get-tuple-element((f32[2,2]{1,0}, f32[2,2]{1,0}) %p0.8), index=0 %get-tuple-element.10 = f32[2,2]{1,0} get-tuple-element((f32[2,2]{1,0}, f32[2,2]{1,0}) %p0.8), index=1 ROOT %add.11 = f32[2,2]{1,0} add(f32[2,2]{1,0} %get-tuple-element.9, f32[2,2]{1,0} %get-tuple-element.10) } %CondFalse.12 (p0.13: (f32[2,2], f32[2,2])) -> f32[2,2] { %p0.13 = (f32[2,2]{1,0}, f32[2,2]{1,0}) parameter(0) %get-tuple-element.14 = f32[2,2]{1,0} get-tuple-element((f32[2,2]{1,0}, f32[2,2]{1,0}) %p0.13), index=0 %get-tuple-element.15 = f32[2,2]{1,0} get-tuple-element((f32[2,2]{1,0}, f32[2,2]{1,0}) %p0.13), index=1 ROOT %subtract.16 = f32[2,2]{1,0} subtract(f32[2,2]{1,0} %get-tuple-element.14, f32[2,2]{1,0} %get-tuple-element.15) } ENTRY %test_conditional.18 (p0.1: f32[], p1.2: f32[2,2], p2.3: f32[2,2]) -> f32[2,2] { %p0.1 = f32[] parameter(0) %constant.4 = f32[] constant(0.1) %compare.5 = pred[] compare(f32[] %p0.1, f32[] %constant.4), direction=GT %p1.2 = f32[2,2]{1,0} parameter(1) %p2.3 = f32[2,2]{1,0} parameter(2) %tuple.6 = (f32[2,2]{1,0}, f32[2,2]{1,0}) tuple(f32[2,2]{1,0} %p1.2, f32[2,2]{1,0} %p2.3) ROOT %conditional.17 = f32[2,2]{1,0} conditional(pred[] %compare.5, (f32[2,2]{1,0}, f32[2,2]{1,0}) %tuple.6, (f32[2,2]{1,0}, f32[2,2]{1,0}) %tuple.6), true_computation=%CondTrue.7, false_computation=%CondFalse.12 }

I'll see what is happening with the hash here.

rpsilva-aws · 2025-01-09T02:00:37Z

Thanks for capturing it @tengyifei. I was using the computation object arg after move semantics.

rpsilva-aws · 2025-01-09T06:14:11Z

@bhavya01, it seems that the TPU CI has an issue? Seeing the same failure on other runs:

bhavya01 · 2025-01-09T23:04:13Z

@bhavya01, it seems that the TPU CI has an issue? Seeing the same failure on other runs:

https://github.com/pytorch/xla/actions/runs/12680534280/job/35343649195

https://github.com/pytorch/xla/actions/runs/12683562173/job/35351599505#step:5:1359

Yifei reverted the breaking change #8547. I expect it to pass now

…#8539)

rpsilva-aws marked this pull request as ready for review January 7, 2025 01:33

rpsilva-aws force-pushed the rpsilva_computation_hash branch 2 times, most recently from 5e406d7 to 4372f2e Compare January 7, 2025 06:52

jeffhataws added the tpuci label Jan 8, 2025

tengyifei self-requested a review January 8, 2025 22:02

tengyifei requested changes Jan 8, 2025

View reviewed changes

[Computation Hash] Introduce deterministic hash for user computations

b9be44b

rpsilva-aws force-pushed the rpsilva_computation_hash branch from 50ed7c5 to a8b5b31 Compare January 9, 2025 01:55

tengyifei self-requested a review January 9, 2025 22:06

tengyifei approved these changes Jan 9, 2025

View reviewed changes

Deterministic serialization for lowering context GetHlo()

72f8c27

rpsilva-aws force-pushed the rpsilva_computation_hash branch from a8b5b31 to 72f8c27 Compare January 9, 2025 23:43

tengyifei merged commit 196cab3 into pytorch:master Jan 10, 2025
12 checks passed

rpsilva-aws deleted the rpsilva_computation_hash branch January 10, 2025 05:27

This was referenced Jan 10, 2025

[Computation Hash] Introduce deterministic hash for user computations #8554

Merged

2.6 backport PR request list #8455

Closed

tengyifei mentioned this pull request Jan 10, 2025

[scan] Test we don't recompile under debugging env flags #8555

Merged

qihqi pushed a commit that referenced this pull request Jan 16, 2025

[Computation Hash] Introduce deterministic hash for user computations (…

81faaf8

…#8539)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Computation Hash] Introduce deterministic hash for user computations #8539

[Computation Hash] Introduce deterministic hash for user computations #8539

Uh oh!

rpsilva-aws commented Jan 7, 2025

rpsilva-aws commented Jan 7, 2025

rpsilva-aws commented Jan 7, 2025 •

edited

Loading

tengyifei commented Jan 8, 2025

tengyifei left a comment

rpsilva-aws commented Jan 9, 2025

rpsilva-aws commented Jan 9, 2025

rpsilva-aws commented Jan 9, 2025

bhavya01 commented Jan 9, 2025 •

edited

Loading

Uh oh!

Labels

4 participants

Uh oh!

[Computation Hash] Introduce deterministic hash for user computations #8539

[Computation Hash] Introduce deterministic hash for user computations #8539

Uh oh!

Conversation

rpsilva-aws commented Jan 7, 2025

rpsilva-aws commented Jan 7, 2025

rpsilva-aws commented Jan 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

tengyifei commented Jan 8, 2025

tengyifei left a comment

Choose a reason for hiding this comment

rpsilva-aws commented Jan 9, 2025

rpsilva-aws commented Jan 9, 2025

rpsilva-aws commented Jan 9, 2025

bhavya01 commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Labels

4 participants

rpsilva-aws commented Jan 7, 2025 •

edited

Loading

bhavya01 commented Jan 9, 2025 •

edited

Loading