[RFC] Context Aware Segments

## Abstract

This RFC proposes a new context aware/based segment creation and merging strategy for OpenSearch to improve query performance by co-locating related data within same physical segments, particularly benefiting log analytics and metrics use cases.

## Motivation

In OpenSearch, a typical workload involves log analytics and metrics data, where for the majority of search queries, only a subset of the data is more relevant. For instance, when analyzing application logs, users are often more interested in errors (4XX) and/or fault requests (5XX) requests, which generally constitute only a minor portion of the logs. Current segment creation (via flush) and merging policies/strategies (both Tiered and LogByteSize) does not incorporate the anticipated query context while grouping data into segments. 

* Tiered Merge policy: Tiered Merge policy tries to minimize write amplification by always favoring merges with lower skew (size of largest segment divided by smallest segment), smaller size and those reclaiming more deletes. However, this often disperses related data (data which will be queried together) across multiple segments.
* LogByteSizeMergePolicy: Preserves the initial document order upon insertion. However, the insertion sequence of data may be different from the order in which it will be queried. While this will be helpful for time series data where documents are sorted based on timestamp, it doesn't work well for out of order data or queries on fields other than timestamp.

This leads to segments containing a mix of relevant and less relevant documents. We can improve query performance by grouping relevant data together and removing this dispersion as:

* Grouping relevant documents together in same segment reduces the need to search through multiple segments for relevant documents, potentially boosting performance since fewer segments are involved in serving a query. Connecting with the above example for application logs, since users are often more interested in errors (4XX) and/or fault requests (5XX) requests, segregating these logs from the 2xx logs prevent their dispersion across multiple segments, leading to improved query efficiency due to fewer segments to examine.
* Without grouping relevant data together, each segment will contain only a fraction of data relevant to a query, necessitating the iteration over many less relevant documents as well. While the [optimisation to use BKD tree](https://github.com/apache/lucene-solr/pull/1351) to skip non competitive documents by the collectors significantly improves query performance, actual number of documents iterated still depends on arrangement of data in the segment and how underlying BKD gets constructed. Grouping relevant data together mitigates the sparsity of relevant documents in a segment and reduces the need to sift through a large amount of less relevant data to isolate the relevant ones.
* With the introduction of segment groups and the identification of less relevant data during ingestion, less relevant data can be directly stored in cheaper remote storage (e.g.: AWS S3, Google Cloud Storage, MinIO, etc.) , while more relevant data can be kept in hot storage (on the node's disk). This strategy conserves memory and disk space on local storage without affecting latency.
* Storing related data together enables the [pre-computation of aggregations](https://github.com/opensearch-project/OpenSearch/issues/12498) for frequently executed queries (e.g., count, minimum, maximum, average, percentiles) and store them as separate metadata. Corresponding queries can be served from the metadata itself, thus optimizing both on the latency and compute.

## Proposal

The solution introduces Context aware/based segments that group related data in the same physical segments. This grouping is achieved through a user defined predicate which can be specified as a configuration. This grouping through predicate evaluation occurs during both flush and segment merge flows, ensuring that related data is consistently co-located in the same segments.

* Flush: The DocumentWriter will assign the same DocumentWriterPerThread (DWPT) object to documents with same predicate evaluation. This ensures related data lands in same segment during flush.
* Merge: A new Context aware/based merge policy will ensure that predicate invariant is maintained even during segment merges.

## Example Use case

For application request logs, if anticipated queries will be on status codes, user can define a predicate to group data based on status codes as a configuration (like group all successful(2xx), faults(4xx) and error(5xx) status codes separately). This will ensure that during indexing, same DWPT gets assigned to log entry with same status code. ContextAware merge policy will ensure segment with same status codes get merged together. Consequently, search queries like “number of faults in the last hour” or “number of errors in the last three hours” will be more efficient, as they will need to only process segments with 4xx or 5xx status codes, which will be a much smaller dataset, improving query performance.

## Merging segment groups

The Context aware/based merge policy employs a hierarchical merging strategy to merge segments evaluated within the same group (based on the configured predicate). This strategy orchestrates the merging process across multiple hierarchical levels, in a way that reflects the natural hierarchy of data attributes. In this approach:

* During indexing, data is initially flushed at the lower hierarchical level. For eg, for time series data with day based grouping, data could be initially flushed at hourly level.
* Merges into broader hierarchical level are triggered when the total size of segments within a particular group exceeds a predefined threshold. In the above example, Context Aware merge policy could start merging hourly segments into daily segments once total size of segments for a day surpasses 1 GB (say).
* When segments are elevated to a higher hierarchical level, they are merged in an ascending order of their subordinate hierarchical attributes. Connecting with above example, hourly segments would be merged into daily segments in ascending order of the hour.

This approach ensures data within segments is nearly ordered, improving query performance as skipping non competitive documents via BKD works best when data is sorted. Moreover, this strategy reduces the frequency of merges, as merges to higher level are only executed upon reaching the threshold, thereby enhancing indexing performance. The trade-off, however, is an increased number of initial segments.

## Considerations

* In the initial implementation, grouping criteria will be a user defined predicate. Future exploration could involve automatic criteria selection based on workload.
* Selecting the appropriate grouping criteria is crucial. Too small groups can increase the number of DWPT required in DocumentWriter regressing indexing performance. Additionally, this can also lead to multiple small segments. Conversely, selecting too large group can regress query performance. Implementing a guardrail around grouping criteria can prevent excessive small or large grouping.
* OpenSearch queries with multiple clauses can be re-written to pick clause on the entity on which grouping is done for a more efficient query execution order.

## POC Links

grouping criteria: day based grouping for threshold = 300 MB
OpenSearch changes: https://github.com/opensearch-project/OpenSearch/compare/main...RS146BIJAY:OpenSearch:data-aware-merge-policy
Lucene changes: https://github.com/apache/lucene/compare/main...RS146BIJAY:lucene:grouping-segments

## Next Steps

Would appreciate any feedback on the overall idea and proposal. We are in the process of assessing benchmarks for memory usage, disk space usage, throughput and latency with this optimization. We will compile the results and publish it soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Context Aware Segments #13183

Abstract

Motivation

Proposal

Example Use case

Merging segment groups

Considerations

POC Links

Next Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development