Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datagen fixes #2599

Merged
merged 6 commits into from
Sep 28, 2024
Merged

Datagen fixes #2599

merged 6 commits into from
Sep 28, 2024

Conversation

Fixes #2528.

Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
Start from 0 for array values, increment on every generated value.

Fixes #2545.

Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
If datagen has to create very large records that take a long time
to construct, creating a fixed amount determined by min(rate, 10_000)
before inserting it as a batch is problematic as reported in #2558.

The fix is to have an additional timeout during batch creation,
if we take longer than one second to create a batch we just
insert what we have so far and start a new batch.

Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
We reduce the default form 10*batch_size to batch_size.
I did some microbenchmarks and it didn't lead to measurable contention
for a reasonably sized record (TestStruct2). I also added a configuration
option so the user can tune this, especially when having to consider
lateness.

Fixes #2558.

Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
Fixes #2585.

Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
crates/adapters/src/transport/datagen.rs Outdated Show resolved Hide resolved
@@ -410,12 +432,17 @@ impl InputGenerator {
// If we have a low rate, it's necessary to adjust the batch-size down otherwise no inserts might be visible
// for a long time (until a batch is full).
let batch_size: usize = min(plan.rate.unwrap_or(u32::MAX) as usize, 10_000);
let per_thread_chunk: usize = 10 * batch_size;
let per_thread_chunk: usize = plan.worker_chunk_size.unwrap_or(batch_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize now that we still won't be able to use multiple workers if we want to control lateness, since there's no upper bound on how much apart workers can be, but I still think this is better without 10x.

Copy link
Collaborator Author

@gz gz Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can set worker_chunk_size to a very low number num e.g., 1. This assumes that all worker threads take roughly equal amount of time to create a batch of course which assumes you don't have a config with e.g. zipf that leads to huge variability in record creation time if you're creating nested values. Maybe we can think of better strategies that coordinate the times that get produced among all workers.

crates/adapters/src/transport/datagen.rs Outdated Show resolved Hide resolved
Otherwise a user can accidentially specify a date that can't be represented.

Fixes #2594.
@gz gz added this pull request to the merge queue Sep 28, 2024
Merged via the queue into main with commit 0c9bd40 Sep 28, 2024
5 of 6 checks passed
@gz gz deleted the datagen-fixes branch September 28, 2024 03:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants