Skip to content

Conversation

@pulpdrew
Copy link
Contributor

@pulpdrew pulpdrew commented Oct 31, 2025

Closes HDX-2699

Summary

This PR adds a Service Map feature to HyperDX, based on (sampled) trace data.

Demo

Screen.Recording.2025-10-31.at.2.33.16.PM.mov

How the service map is constructed

The service map is created by querying client-server (or producer-consumer) relationships from a Trace source. Two spans have a client-server/producer-consumer relationship if (a) they have the same trace ID and (b) the server/consumer's parent span ID is equal to the client/producer's span ID. This is accomplished via a self-join on the Trace table (the query can be found in useServiceMap.ts.

To help keep this join performant, user's can set a sampling level as low as 1% and up to 100%. Lower sampling levels will result in fewer rows being joined, and thus a faster service map load. Sampling is done on cityHash64(TraceId) to ensure that either a trace is included in its entirety or not included at all.

@changeset-bot
Copy link

changeset-bot bot commented Oct 31, 2025

🦋 Changeset detected

Latest commit: 153ace9

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages
Name Type
@hyperdx/app Minor
@hyperdx/api Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link

vercel bot commented Oct 31, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
hyperdx-v2-oss-app Ready Ready Preview Comment Nov 4, 2025 8:11pm
@claude
Copy link

claude bot commented Oct 31, 2025

PR Review: Service Map Feature

Critical Issues

✅ No critical issues found.

Code Quality Observations

Good:

  • Comprehensive test coverage (340+ lines of tests)
  • Proper separation of concerns (presentation/container components)
  • TypeScript types well-defined
  • Error handling implemented with user notifications
  • Query optimization via sampling mechanism
  • Follows existing patterns (TanStack Query, Mantine UI, dynamic imports)

Minor suggestions (non-blocking):

  • ⚠️ DBServiceMapPage.tsx:148 - @ts-ignore on getLayout → Add proper type definition for getLayout property
  • ⚠️ useServiceMap.tsx:292 - staleTime: Infinity prevents fresh data on re-mount → Consider time-based staleness (e.g., 5 min) for better UX
  • ⚠️ ServiceMap.tsx:244-252 - Duplicate error notification (query already errors) → TanStack Query onError callback may be cleaner
  • ⚠️ useServiceMap.tsx:129 - Self-exclusion filter ServerSpans.serviceName != ClientSpans.serviceName → Consider documenting why (prevents self-loops)

Notes

  • Feature properly marked as "Beta" in UI
  • Dynamic import correctly disables SSR for React Flow
  • Sampling logic correctly bypassed for single trace view
  • TODO comment about React Flow support (line 211) is reasonable
@github-actions
Copy link
Contributor

github-actions bot commented Oct 31, 2025

E2E Test Results

All tests passed • 39 passed • 3 skipped • 304s

Status Count
✅ Passed 39
❌ Failed 0
⚠️ Flaky 0
⏭️ Skipped 3

View full report →

@pulpdrew pulpdrew marked this pull request as ready for review October 31, 2025 18:47
@pulpdrew pulpdrew requested review from a team and wrn14897 and removed request for a team October 31, 2025 18:52
Comment on lines +125 to +128
FROM ServerSpans
LEFT JOIN ClientSpans
ON ServerSpans.traceId = ClientSpans.traceId
AND ServerSpans.parentSpanId = ClientSpans.spanId
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perf: not within the scope of this ticket. I'm concerned about the performance implications here, since the default schema doesn't have indexes on traceId or spanId.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this join in particular is expected to not be very performant. Sampling is included in this PR to attempt to minimize the issue, but there are additional proposed steps to improve performance in the future.

navigateToTraceSearch({
dateRange,
source,
where: `${source.serviceNameExpression} = '${serviceName}' AND ${source.spanKindExpression} IN ('Server', 'Consumer')`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security: we should probably escape the serviceName here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I've wrapped these with SqlString

format: 'JSON',
abort_signal: signal,
clickhouse_settings: {
max_execution_time: 60,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we specify join_algorithm ? maybe 'auto' for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added join_algorithm: auto

@wrn14897
Copy link
Member

wrn14897 commented Nov 4, 2025

We can handle this issue later. I noticed that the number of requests in the server map under the 'Trace' panel shouldn’t be approximate.

image
Copy link
Member

@wrn14897 wrn14897 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome feature. I’m really excited about this! 🎉

@kodiakhq kodiakhq bot merged commit 91e443f into main Nov 4, 2025
9 checks passed
@kodiakhq kodiakhq bot deleted the drew/service-map branch November 4, 2025 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

3 participants