ES|QL change point #119458

jan-elastic · 2025-01-02T14:03:52Z

This PR is adding the Change point aggregation to ES|QL.

Note: the current output format is a JSON string, which looks like:

change_point
------------
{"type":{"spike":{"p_value":1.1498834457157642E-54,"change_point":7}}}

This should be replaced by multiple columns or so, but that still needs some discussion.

elasticsearchmachine · 2025-01-02T14:04:57Z

Hi @jan-elastic, I've created a changelog YAML for you.

elasticsearchmachine · 2025-01-02T14:05:27Z

Hi @jan-elastic, I've created a changelog YAML for you.

nik9000

I'd be happy to get this in soon and iterate around what we have. I'm super interested in how to make this interact nicely with time series data.

I think @martijnvg should probably have a look, because he's been working on time series aggregations, and this is certainly one of them.

For anyone following along - this really wants data in time series order. So buffers it all up and puts it in that order. In TSDB we can read data in time series order strait from the disk. No buffering required.

Maybe important question - do we need the @timestamp to be passed to the agg if the data was guaranteed to be in sorted order?

nik9000 · 2025-01-08T14:34:58Z

...ugin/esql/compute/gen/src/main/java/org/elasticsearch/compute/gen/AggregatorImplementer.java

+            builder.beginControlFlow("if (timestampsVector == null) ");
+            builder.addStatement("throw new IllegalStateException($S)", "expected @timestamp vector; but got a block");
+            builder.endControlFlow();
+        }


First arity 2 aggregation function!

I suppose it's ok to make it timestamp-specific at this point. At some point we'll rework this when we more correlation or something, but it's all good.

The RATE aggregation already takes two args (also timestamp also 2nd arg), but that's still in snapshot mode. Unfortunately, that's just a GroupingAggregator. So, I basically ported the includeTimestamps from that.

nik9000 · 2025-01-08T14:38:53Z

...ugin/esql/compute/gen/src/main/java/org/elasticsearch/compute/gen/AggregatorImplementer.java

+            builder.addStatement("$T timestampsVector = timestampsBlock.asVector()", LONG_VECTOR);
+            builder.beginControlFlow("if (timestampsVector == null) ");
+            builder.addStatement("throw new IllegalStateException($S)", "expected @timestamp vector; but got a block");
+            builder.endControlFlow();


I think this is fine. But it's worth saying out loud: Things like this usually should become null and a warning. They'd put the agg in an "I'm broken" state and produce a warning on output. Like sum should do if it overflows. It doesn't do that now, but it should.

Anyway, I think this is fine to just hard fail here - just like this - because we're going to want to build machinery around the agg to make sure that it's input is a time series. Which will have the constraint that the timestamp is always single valued. And, probably, descending.

nik9000 · 2025-01-08T14:40:12Z

...ugin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/ChangePointStates.java

+            add(timestamps, values, 0);
+        }
+
+        void add(LongBlock timestamps, DoubleBlock values, int otherPosition) {


LongVector, right?

I don't know exactly what needs to be changed for that.

The AggregatorImplementer generates something that wants this method.

Got it. I'd be ok leaving a TODO on that one.

nik9000 · 2025-01-08T14:46:02Z

...ugin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/ChangePointStates.java

+        private final BigArrays bigArrays;
+        private int count;
+        private LongArray timestamps;
+        private DoubleArray values;


Looks to me like you are doing this assuming the timestamps are in any order. That's how it works now and probably what we should merge today, but I'm hopeful we can make a system where you can rely on them being in ascending order. That's mostly the machinery we've talked about for time series systems.

I'm not sure that this code will survive forever - we may add the "in sorted order" path and never use this after that. OTOH, let's get something that works in today. If we change it then we change it.

Indeed, that's the current assumption.

I'm totally fine with this code getting trashed in the future, once we have better time series support.

nik9000 · 2025-01-08T14:47:20Z

...ugin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/ChangePointStates.java

+
+        // TODO: this needs to output multiple columns or a composite object, not a JSON blob.
+        private BytesRef getChangePoint() {
+            // TODO: probably reuse ES|QL sort/orderBy to get results in order


This is partly what I mean by "make this a time series". ESQL would stick a sort before the aggregation operation. Or, well, iterate the documents in time series order. Which it totally can do for time series indices.

nik9000 · 2025-01-08T14:50:07Z

...ugin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/ChangePointStates.java

+        // TODO: this needs to output multiple columns or a composite object, not a JSON blob.
+        private BytesRef getChangePoint() {
+            // TODO: probably reuse ES|QL sort/orderBy to get results in order
+            // TODO: this copying/sorting doesn't account for memory


No, it super doesn't. If we're going to keep this code forever I'd prefer to make LongArray has a sortBetween method that sorts in place. Or something like that.

jan-elastic marked this pull request as draft January 2, 2025 14:04

elasticsearchmachine added the v9.0.0 label Jan 2, 2025

jan-elastic added >feature Team:ML Meta label for the ML team :ml Machine learning labels Jan 2, 2025

jan-elastic force-pushed the esql-changepoint-poc branch from c699481 to 6980738 Compare January 2, 2025 14:05

jan-elastic force-pushed the esql-changepoint-poc branch 2 times, most recently from d4fa597 to 4ace296 Compare January 3, 2025 09:37

jan-elastic and others added 11 commits January 6, 2025 10:28

ES|QL change_point poc

6e2a3a1

Add TODOs

d3cba2d

Make timestamp optional

2c3225a

Add "includeTimestamps" to non-grouping aggregator

faf49e5

Non-grouping change point long

ccaed49

Move state to ChangePointStates

5e4076b

Support int, double

81f2c7e

Add capability

f9f47d6

Update docs/changelog/119458.yaml

715bc66

polish code

5346535

Some fixes

7e9a3a5

jan-elastic force-pushed the esql-changepoint-poc branch from 4ace296 to 7e9a3a5 Compare January 6, 2025 09:29

jan-elastic requested a review from nik9000 January 7, 2025 09:21

nik9000 reviewed Jan 8, 2025

View reviewed changes

jan-elastic requested a review from martijnvg January 8, 2025 15:19

elasticsearchmachine added v9.1.0 and removed v9.0.0 labels Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ES|QL change point #119458

ES|QL change point #119458

jan-elastic commented Jan 2, 2025 •

edited

Loading

elasticsearchmachine commented Jan 2, 2025

elasticsearchmachine commented Jan 2, 2025

nik9000 left a comment

nik9000 Jan 8, 2025

jan-elastic Jan 8, 2025

nik9000 Jan 8, 2025

nik9000 Jan 8, 2025

jan-elastic Jan 8, 2025

nik9000 Jan 8, 2025

nik9000 Jan 8, 2025

jan-elastic Jan 8, 2025

nik9000 Jan 8, 2025

nik9000 Jan 8, 2025

ES|QL change point #119458

Are you sure you want to change the base?

ES|QL change point #119458

Conversation

jan-elastic commented Jan 2, 2025 • edited Loading

elasticsearchmachine commented Jan 2, 2025

elasticsearchmachine commented Jan 2, 2025

nik9000 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jan-elastic commented Jan 2, 2025 •

edited

Loading