Reorganise aggregation by deciding how initial document is fetched #3102

chilagrow · 2023-07-24T10:25:29Z

Description

Closes #2423.

This PR demonstrates intended design for aggregation pipeline to handle special stage like $collStats.

$collStats created a bit of special case in aggregation. It's because it wants to decide if it wants to query DB or not, and if it queries DB, it uses query specific to its need.

In this PR, I'm intending to remove special handling for $collStats by allowing stage to decide how it's initial document iterator should be fetched. This would be useful for stages that haven't been added such as $documents stage which does not query DB at all.

Do not merge, as this PR is not intended to be merged. It demonstrates the design and that's the purpose. Some tests failing, that's okay.

Readiness checklist

I added/updated unit tests (and they pass).
I added/updated integration/compatibility tests (and they pass).
I added/updated comments and checked rendering.
I made spot refactorings.
I updated user documentation.
I ran task all, and it passed.
I ensured that PR title is good enough for the changelog.
(for maintainers only) I set Reviewers (@FerretDB/core), Labels, Project and project's Sprint fields.
I marked all done items in this checklist.

codecov · 2023-07-24T10:28:12Z

Codecov Report

Merging #3102 (3f38c30) into main (a71936d) will decrease coverage by 49.29%.
The diff coverage is 0.00%.

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #3102       +/-   ##
===========================================
- Coverage   75.26%   25.98%   -49.29%     
===========================================
  Files         400      400               
  Lines       22386    22436       +50     
===========================================
- Hits        16849     5829    -11020     
- Misses       4563    16050    +11487     
+ Partials      974      557      -417

Files Changed	Coverage Δ
.../handlers/common/aggregations/stages/add_fields.go	`0.00% <0.00%> (-75.00%)`	⬇️
...l/handlers/common/aggregations/stages/collstats.go	`0.00% <0.00%> (-85.25%)`	⬇️
...ernal/handlers/common/aggregations/stages/count.go	`0.00% <0.00%> (-100.00%)`	⬇️
...l/handlers/common/aggregations/stages/documents.go	`0.00% <0.00%> (ø)`
...ernal/handlers/common/aggregations/stages/group.go	`0.00% <0.00%> (-82.30%)`	⬇️
...ernal/handlers/common/aggregations/stages/limit.go	`0.00% <0.00%> (-78.58%)`	⬇️
...ernal/handlers/common/aggregations/stages/match.go	`0.00% <0.00%> (-100.00%)`	⬇️
...nal/handlers/common/aggregations/stages/project.go	`0.00% <0.00%> (-100.00%)`	⬇️
...nternal/handlers/common/aggregations/stages/set.go	`0.00% <0.00%> (-75.00%)`	⬇️
...ternal/handlers/common/aggregations/stages/skip.go	`0.00% <0.00%> (-78.58%)`	⬇️
... and 6 more

... and 209 files with indirect coverage changes

Flag	Coverage Δ
hana	`?`
integration	`5.24% <0.00%> (-66.46%)`	⬇️
mongodb	`5.24% <0.00%> (-0.02%)`	⬇️
pg	`∅ <ø> (∅)`
shard-1	`5.24% <0.00%> (-51.50%)`	⬇️
shard-2	`∅ <ø> (∅)`
shard-3	`∅ <ø> (∅)`
sqlite	`∅ <ø> (∅)`
unit	`24.42% <0.00%> (-0.09%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

rumyantseva · 2023-07-24T13:16:08Z

internal/handlers/common/aggregations/aggregations.go

+	// The returned document iterator may be originated from querying database or
+	// from in-memory value iterator.
+	// This allows first stage of the pipeline to decide how to create initial document iterator.
+	FetchDocuments(ctx context.Context, closer *iterator.MultiCloser) (types.DocumentsIterator, error)


If this function is only used for the first stage, do we need it in the stage interface at all? If something doesn't describe a typical stage, it shouldn't be a part of it.

Maybe it's part of another interface that is called something like FirstStage, or not an interface at all...

Or maybe it's part of a "pipeline" interface. E.g. we "feed" the pipeline with some documents' iterator first, and then apply stages on it. And then it's pipeline responsibility/implementation how to fetch these documents from the DB.

After some thinking... Actually, the current approach looks good to me too. I was confused by the phrase "it is only used if this is the first stage of the pipeline", but actually any stage could be the first stage, so it's correct that this method exists for each stage.

On the other hand, the implementation of this method is exactly the same in most of the cases, so that's another sign to me that maybe this function shouldn't be here...

This function seems reasonable if the way of fetching documents is different at the first stage for some stages, although I cannot define any of them

Valid point about the implementation of this FetchDocuments being mostly the same. For upcoming pipelines, the ones that have different implementations are $documents, $listLocalSessions, $listSessions and existing $collStats. Would be nice to have base implementation and override it for those special cases.

It cannot be part of pipeline because it would expose specific details about each stage to the pipeline. Let me try a few other things if it's possible to make it not an interface. Yes thanks we should rename this to something better 🤗

@chilagrow yeah, I agree with your point. My thought was to somehow "identify" the type of pipeline based on the first stage and then fetch documents based on it, but it doesn't seem better (because then we need to come up with a "pipeline" type which will depend on the first stage, so the implementation doesn't look clear and simple).

So, it looks to me that the current function "How documents are fetched if this stage is the first in the pipeline" is a better solution.

I tried your suggestion in another PR #3110
This approach exposes logic related to $collStats to aggregation package level. But it removes first stage logic from the stages. Let me know what you think 🤗

rumyantseva · 2023-07-25T06:05:40Z

internal/handlers/common/aggregations/aggregations.go

 	// Process applies an aggregate stage on documents from iterator.
 	Process(ctx context.Context, iter types.DocumentsIterator, closer *iterator.MultiCloser) (types.DocumentsIterator, error)
 }
+
+// AggregateQuery is a common interface for fetching from database.
+type AggregateQuery interface {


So, this interface will grow when we have listSessions and other stages that need different types of documents. I think it's fine. This way, we have a unified interface to work with different backends.

The approach looks similar to the Backend, Collection, and other interfaces that we have in the new handler.
Maybe we can have a similar naming too? And then this could be just Aggregate/Aggregation? But I think the naming it's not so critical.

rumyantseva · 2023-07-25T06:07:09Z

internal/handlers/common/aggregations/aggregations.go

+}
+
+// CollStats describes collection statistics retrieved from the database.
+type CollStats struct {


I think, if we want a similar naming to how it's called in handler, this should be called CollStatsResult.

rumyantseva

The approach looks good to me, and I don't see a better interface/approach/implementation, so I'm ready to approve.

Maybe the only question is how similar to internal/backends this new interface should be. If we want more similarity, maybe some naming/structuring needs to be changed. At some point, it looks reasonable to make them similar, because the idea is quite the same - we have a common implementation of aggregation pipelines and all the backends should implement a particular interface to support it.

noisersup

This approach makes sense to me!

AlekSi · 2023-07-26T04:11:50Z

internal/handlers/common/aggregations/aggregations.go

+	// FirstStage fetches document iterator, it is only used if this is
+	// the first stage of the pipeline.
+	// The returned document iterator may be originated from querying database or
+	// from in-memory value iterator.
+	// This allows first stage of the pipeline to decide how to create initial document iterator.
+	FirstStage(ctx context.Context, closer *iterator.MultiCloser) (types.DocumentsIterator, error)


I don't think we need that method at all. First of all, it looks suspiciously like a constructor; it is almost as if the io.Reader interface had NewReader method. Second, the implication there is that Stage has access to the data source, but then it is not clear why Process can't use it.

One option could be to fit everything into the existing Stage interface with only the Process method. $collStats implementation, for example, could ignore incoming iter and use the data source instead. Then the code that arranges the pipeline (a sequence of stages in this case) could check that the order is correct.

Another option is to say that pipelines are not just a sequence of stages but a combination of zero or one Producer stage and zero or more Processor stages. That way, $collStats would implement only the Producer interface, $count would implement only the Processor interface, and $facet would implement both. Then the compiler itself could check the order.

Both are definitely better options. First option being simpler. The second option seems efficient 👍

I think using ProducerStage and ProcessorStage works quite well, thanks for the suggestion. Updated implementation a bit 🤗

AlekSi · 2023-07-26T04:13:01Z

internal/handlers/common/aggregations/aggregations.go

 	// Process applies an aggregate stage on documents from iterator.
 	Process(ctx context.Context, iter types.DocumentsIterator, closer *iterator.MultiCloser) (types.DocumentsIterator, error)
 }
+
+// Aggregation is a common interface for fetching from database.


Independently of everything else, that's a bad name given this aggregation does not aggregate anything – it returns original data to be processed and aggregated, and that's pretty much the opposite.

AlekSi · 2023-07-28T08:37:03Z

internal/handlers/common/aggregations/aggregations.go

+
+// ProducerStage is a common interface aggregation stages produce documentiterator.
+type ProducerStage interface {
+	// Produce applies an aggregate stage on documents from iterator.


That's not correct

Indeed thanks, it should return a document iterator.

AlekSi · 2023-07-28T08:47:24Z

internal/handlers/common/aggregations/aggregations.go

+// ProducerStageDataSource is a common interface for fetching from database.
+type ProducerStageDataSource interface {
+	// CollStats fetches collection statistics from the database.
+	CollStats(ctx context.Context, closer *iterator.MultiCloser) (*CollStatsResult, error)


That should be ProducerStage that returns an iterator with one document

So $collStats is a ProducerStage and wants to fetch data from DB and this interface allows access to DB. Maybe I misunderstood and ProducerStage should be implemented on pg package? 🤔

You are mixing stages and what this code called Aggregation before and calls data source now.

Data source allows one to run queries.

ProducerStage produces documents. It is not important for the interface how those documents are produced – are they fetched from the database or just made up.

There are two implementations of the ProducerStage. Both would use a data source. The first one would return all documents from the collection. The second one would produce collection stats. (in the future, $match stage could implement both ProcessorStage and ProducerStage for pushdown).

So hopefully I get this. We want to keep ProducerStageDataSource or DataSource to its ProducerStage. No need to expose it here.

And I'm missing ProducerStage that returns all documents 😅

There is quite a bit more cleanup I can do, but overall I like it. 🤗

AlekSi · 2023-07-28T10:28:09Z

internal/handlers/common/aggregations/stages/collstats.go

 //
-// Processing consists of modification of the input document, so it contains all the necessary fields
+// Producing consists of modification of the input document, so it contains all the necessary fields


That's not correct

Thanks, updated 🙏

AlekSi · 2023-07-28T10:28:14Z

internal/handlers/common/aggregations/stages/collstats.go

@@ -29,12 +31,29 @@ import (
 	"github.com/FerretDB/FerretDB/internal/util/must"
 )

+// CollStatsDataSource fetches collection statistics from the database.
+type CollStatsDataSource interface {


Let's call it DataSource. It has only CollStats for now, it will have Query or something soon and will be replaced with a backend interface later

Sounds good, updated this name.

rumyantseva

I like this new approach with two interfaces! We no longer duplicate any code, and the idea's easy to understand. I have some minor questions about the usage (I asked one question), but it's not blocking us, we can clarify it later.

rumyantseva · 2023-07-31T06:10:38Z

internal/handlers/common/aggregations/aggregations.go

 	// Process applies an aggregate stage on documents from iterator.
 	Process(ctx context.Context, iter types.DocumentsIterator, closer *iterator.MultiCloser) (types.DocumentsIterator, error)
 }
+
+// ProducerStage is a common interface for aggregation stages that produce document iterator.
+type ProducerStage interface {


I like this new naming for the interfaces!

rumyantseva · 2023-07-31T06:15:10Z

internal/handlers/pg/msg_aggregate.go

-		iter, err = processStagesStats(ctx, closer, &stagesStatsParams{
-			dbPool, db, collection, statistics, collStatsDocuments,
-		})
+		// handle case where there are multiple producer stages


Is it a possible situation - multiple producer stages?

For example, having two consecutive $collStats stage should return error.

That's correct, but that's not what this code does, isn't it?

I think we should return an error and exit early waaay up when we iterate over aggregationStages

AlekSi · 2023-07-31T07:57:11Z

internal/handlers/pg/msg_aggregate.go

@@ -163,10 +162,27 @@ func (h *Handler) MsgAggregate(ctx context.Context, msg *wire.OpMsg) (*wire.OpMs
 	}

 	aggregationStages := must.NotFail(iterator.ConsumeValues(pipeline.Iterator()))
-	stagesDocuments := make([]aggregations.Stage, 0, len(aggregationStages))
-	collStatsDocuments := make([]aggregations.Stage, 0, len(aggregationStages))
+	producerStages := make([]aggregations.ProducerStage, 0, len(aggregationStages))


var producerStage aggregations.ProducerStage

As there could be only zero (nil) or one (non-nil value)

AlekSi · 2023-07-31T07:57:56Z

internal/handlers/pg/msg_aggregate.go

@@ -176,28 +192,27 @@ func (h *Handler) MsgAggregate(ctx context.Context, msg *wire.OpMsg) (*wire.OpMs
 			)
 		}

-		var s aggregations.Stage
+		if _, ok := stages.ProducerStages[d.Command()]; ok {
+			var s aggregations.ProducerStage


if producerStage != nil { return nil, errors.New("oops, already have one") }

chilagrow · 2023-07-31T10:06:53Z

Pausing work on this PR for now

noisersup

Leaving a comment as this PR is on hold

mergify · 2023-10-17T07:50:43Z

@chilagrow this pull request has merge conflicts.

chilagrow · 2023-10-30T02:40:46Z

Close for now

init

e79d81d

chilagrow added code/chore do not merge labels Jul 24, 2023

chilagrow self-assigned this Jul 24, 2023

chilagrow changed the title ~~Reorganise aggregation~~ Reorganise aggregation by each stage deciding how initial document is fetched Jul 24, 2023

update comment

90884a6

chilagrow marked this pull request as ready for review July 24, 2023 10:45

chilagrow requested a review from a team as a code owner July 24, 2023 10:45

chilagrow requested review from AlekSi, rumyantseva, a team, quasilyte and noisersup July 24, 2023 10:45

rumyantseva reviewed Jul 24, 2023

View reviewed changes

rumyantseva reviewed Jul 25, 2023

View reviewed changes

rumyantseva previously approved these changes Jul 25, 2023

View reviewed changes

renaming

6f81f95

chilagrow dismissed rumyantseva’s stale review via 6f81f95 July 25, 2023 08:30

chilagrow added 2 commits July 25, 2023 18:05

update comment

56f3981

try moving query logic

8c94e02

noisersup previously approved these changes Jul 25, 2023

View reviewed changes

chilagrow mentioned this pull request Jul 26, 2023

Reorganise aggregation by initialising pipeline at handler level #3110

Closed

9 tasks

AlekSi reviewed Jul 26, 2023

View reviewed changes

use producer stage and processor stage

9d0912d

chilagrow dismissed noisersup’s stale review via 9d0912d July 28, 2023 08:10

just some tidy up

b9a2cf8

chilagrow requested review from noisersup and rumyantseva July 28, 2023 08:28

chilagrow requested a review from AlekSi July 28, 2023 08:28

chilagrow changed the title ~~Reorganise aggregation by each stage deciding how initial document is fetched~~ Reorganise aggregation by deciding how initial document is fetched Jul 28, 2023

AlekSi reviewed Jul 28, 2023

View reviewed changes

some fixing

8633cc2

AlekSi removed the request for review from quasilyte July 28, 2023 09:26

datasource is not part of aggregations interface, add documents producer

ecf3ada

AlekSi reviewed Jul 28, 2023

View reviewed changes

chilagrow added 5 commits July 31, 2023 10:32

Merge branch 'main' into issue-2423-analysis-design

9ccc0d7

renaming and updating comment

98602fd

renaming

d04f301

update comment

86e768b

use DataSource interface for fetching documents

310100e

chilagrow requested a review from AlekSi July 31, 2023 02:14

chilagrow mentioned this pull request Jul 31, 2023

Move $collStats logic to its stage implementation #2423

Closed

5 tasks

rumyantseva previously approved these changes Jul 31, 2023

View reviewed changes

AlekSi reviewed Jul 31, 2023

View reviewed changes

chilagrow marked this pull request as draft July 31, 2023 10:06

merge conflict

3f38c30

chilagrow dismissed rumyantseva’s stale review via 3f38c30 August 28, 2023 08:54

noisersup reviewed Sep 28, 2023

View reviewed changes

mergify bot added the conflict label Oct 17, 2023

chilagrow closed this Oct 30, 2023

mergify bot removed the conflict label Oct 30, 2023

Reorganise aggregation by deciding how initial document is fetched #3102

Reorganise aggregation by deciding how initial document is fetched #3102

Conversation

chilagrow commented Jul 24, 2023 • edited Loading

Description

Readiness checklist

codecov bot commented Jul 24, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

rumyantseva Jul 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chilagrow Jul 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rumyantseva left a comment

Choose a reason for hiding this comment

noisersup left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chilagrow Jul 28, 2023 • edited Loading

Choose a reason for hiding this comment

AlekSi Jul 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rumyantseva left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chilagrow commented Jul 31, 2023

noisersup left a comment

Choose a reason for hiding this comment

mergify bot commented Oct 17, 2023

chilagrow commented Oct 30, 2023

chilagrow commented Jul 24, 2023 •

edited

Loading

codecov bot commented Jul 24, 2023 •

edited

Loading

rumyantseva Jul 24, 2023 •

edited

Loading

chilagrow Jul 25, 2023 •

edited

Loading

chilagrow Jul 28, 2023 •

edited

Loading

AlekSi Jul 28, 2023 •

edited

Loading