Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES|QL change point #119458

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from
Draft

ES|QL change point #119458

wants to merge 11 commits into from

Conversation

jan-elastic
Copy link
Contributor

@jan-elastic jan-elastic commented Jan 2, 2025

This PR is adding the Change point aggregation to ES|QL.

Note: the current output format is a JSON string, which looks like:

change_point
------------
{"type":{"spike":{"p_value":1.1498834457157642E-54,"change_point":7}}}

This should be replaced by multiple columns or so, but that still needs some discussion.

@jan-elastic jan-elastic marked this pull request as draft January 2, 2025 14:04
@jan-elastic jan-elastic added >feature Team:ML Meta label for the ML team :ml Machine learning labels Jan 2, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @jan-elastic, I've created a changelog YAML for you.

@jan-elastic jan-elastic force-pushed the esql-changepoint-poc branch from c699481 to 6980738 Compare January 2, 2025 14:05
@elasticsearchmachine
Copy link
Collaborator

Hi @jan-elastic, I've created a changelog YAML for you.

@jan-elastic jan-elastic force-pushed the esql-changepoint-poc branch 2 times, most recently from d4fa597 to 4ace296 Compare January 3, 2025 09:37
@jan-elastic jan-elastic force-pushed the esql-changepoint-poc branch from 4ace296 to 7e9a3a5 Compare January 6, 2025 09:29
@jan-elastic jan-elastic requested a review from nik9000 January 7, 2025 09:21
Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to get this in soon and iterate around what we have. I'm super interested in how to make this interact nicely with time series data.

I think @martijnvg should probably have a look, because he's been working on time series aggregations, and this is certainly one of them.

For anyone following along - this really wants data in time series order. So buffers it all up and puts it in that order. In TSDB we can read data in time series order strait from the disk. No buffering required.

Maybe important question - do we need the @timestamp to be passed to the agg if the data was guaranteed to be in sorted order?

builder.beginControlFlow("if (timestampsVector == null) ");
builder.addStatement("throw new IllegalStateException($S)", "expected @timestamp vector; but got a block");
builder.endControlFlow();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First arity 2 aggregation function!

I suppose it's ok to make it timestamp-specific at this point. At some point we'll rework this when we more correlation or something, but it's all good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RATE aggregation already takes two args (also timestamp also 2nd arg), but that's still in snapshot mode. Unfortunately, that's just a GroupingAggregator. So, I basically ported the includeTimestamps from that.

builder.addStatement("$T timestampsVector = timestampsBlock.asVector()", LONG_VECTOR);
builder.beginControlFlow("if (timestampsVector == null) ");
builder.addStatement("throw new IllegalStateException($S)", "expected @timestamp vector; but got a block");
builder.endControlFlow();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine. But it's worth saying out loud: Things like this usually should become null and a warning. They'd put the agg in an "I'm broken" state and produce a warning on output. Like sum should do if it overflows. It doesn't do that now, but it should.

Anyway, I think this is fine to just hard fail here - just like this - because we're going to want to build machinery around the agg to make sure that it's input is a time series. Which will have the constraint that the timestamp is always single valued. And, probably, descending.

add(timestamps, values, 0);
}

void add(LongBlock timestamps, DoubleBlock values, int otherPosition) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LongVector, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know exactly what needs to be changed for that.

The AggregatorImplementer generates something that wants this method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I'd be ok leaving a TODO on that one.

private final BigArrays bigArrays;
private int count;
private LongArray timestamps;
private DoubleArray values;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks to me like you are doing this assuming the timestamps are in any order. That's how it works now and probably what we should merge today, but I'm hopeful we can make a system where you can rely on them being in ascending order. That's mostly the machinery we've talked about for time series systems.

I'm not sure that this code will survive forever - we may add the "in sorted order" path and never use this after that. OTOH, let's get something that works in today. If we change it then we change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, that's the current assumption.

I'm totally fine with this code getting trashed in the future, once we have better time series support.


// TODO: this needs to output multiple columns or a composite object, not a JSON blob.
private BytesRef getChangePoint() {
// TODO: probably reuse ES|QL sort/orderBy to get results in order
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is partly what I mean by "make this a time series". ESQL would stick a sort before the aggregation operation. Or, well, iterate the documents in time series order. Which it totally can do for time series indices.

// TODO: this needs to output multiple columns or a composite object, not a JSON blob.
private BytesRef getChangePoint() {
// TODO: probably reuse ES|QL sort/orderBy to get results in order
// TODO: this copying/sorting doesn't account for memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it super doesn't. If we're going to keep this code forever I'd prefer to make LongArray has a sortBetween method that sorts in place. Or something like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :ml Machine learning Team:ML Meta label for the ML team v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants