-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datagen fixes #2599
Datagen fixes #2599
Conversation
Fixes #2528. Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
Start from 0 for array values, increment on every generated value. Fixes #2545. Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
If datagen has to create very large records that take a long time to construct, creating a fixed amount determined by min(rate, 10_000) before inserting it as a batch is problematic as reported in #2558. The fix is to have an additional timeout during batch creation, if we take longer than one second to create a batch we just insert what we have so far and start a new batch. Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
We reduce the default form 10*batch_size to batch_size. I did some microbenchmarks and it didn't lead to measurable contention for a reasonably sized record (TestStruct2). I also added a configuration option so the user can tune this, especially when having to consider lateness. Fixes #2558. Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
Fixes #2585. Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
@@ -410,12 +432,17 @@ impl InputGenerator { | |||
// If we have a low rate, it's necessary to adjust the batch-size down otherwise no inserts might be visible | |||
// for a long time (until a batch is full). | |||
let batch_size: usize = min(plan.rate.unwrap_or(u32::MAX) as usize, 10_000); | |||
let per_thread_chunk: usize = 10 * batch_size; | |||
let per_thread_chunk: usize = plan.worker_chunk_size.unwrap_or(batch_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize now that we still won't be able to use multiple workers if we want to control lateness, since there's no upper bound on how much apart workers can be, but I still think this is better without 10x.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can set worker_chunk_size
to a very low number num e.g., 1. This assumes that all worker threads take roughly equal amount of time to create a batch of course which assumes you don't have a config with e.g. zipf that leads to huge variability in record creation time if you're creating nested values. Maybe we can think of better strategies that coordinate the times that get produced among all workers.
Otherwise a user can accidentially specify a date that can't be represented. Fixes #2594.
Fixes
increment
strategy doesn't work correctly on arrays #2545