Skip to content

Conversation

masseyke
Copy link
Member

This builds on #134766 and #134585, adding the logic to perform random sampling. For now, it is only tested via unit tests because there is no REST API yet to create a sampling configuration, or to retrieve a sample. Those are coming up next.
This implements the basics of sampling behind a feature flag, including:

  • support for rates
  • support for conditional scripts
  • support for max_samples
  • basic statistics about sampling

It does not yet support the following, which will come in subsequent PRs:

  • TTL
  • sampling config changes
  • limits on memory usage
@masseyke masseyke added >non-issue :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v9.2.0 labels Sep 29, 2025
@masseyke masseyke marked this pull request as ready for review September 30, 2025 16:53
@masseyke masseyke requested a review from a team as a code owner September 30, 2025 16:53
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Sep 30, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Copy link
Contributor

@seanzatzdev seanzatzdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, i left some small suggestions to improve readability

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements the core random sampling functionality for Elasticsearch behind a feature flag, focusing on the sampling logic and statistics collection without REST API endpoints.

  • Adds complete sampling implementation with support for rates, conditional scripts, max samples, and statistics
  • Implements memory-efficient storage using SoftReferences to prevent memory issues
  • Provides comprehensive test coverage for all sampling scenarios and edge cases

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
SamplingService.java Core implementation with sampling logic, statistics tracking, and data structures
SamplingServiceTests.java Comprehensive unit tests covering sampling scenarios, conditions, and limits
SamplingServiceSampleStatsTests.java Tests for statistics serialization and combination logic
SamplingServiceRawDocumentTests.java Tests for raw document serialization and mutation
NodeConstruction.java Updates service construction to include required dependencies
IngestService.java Simplifies sampling integration by removing unnecessary IndexRequest copying

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

@seanzatzdev seanzatzdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@masseyke masseyke merged commit aef5c27 into elastic:main Oct 1, 2025
34 checks passed
@masseyke masseyke deleted the random-sampling branch October 1, 2025 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >non-issue Team:Data Management Meta label for data/management team v9.2.0

3 participants