-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use thread-local RNG to generate IDs #839
Conversation
We've been exploring dividing our data set up into multiple batches, and training those batches in parallel. I noticed that performance did not scale with core count, and after some digging, found that this was mainly due to the Mutex being used to generate IDs. With the following change, training across 16 cores went from 21s to 4.2s. thread_rng was previously discussed on tracel-ai#703, but I don't believe that applies here, as this is just used for UUID creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dae
Thank you for this contribution. If there is a lot of ID generation, then I can well see how the mutex may affect performance.
I think I can summarize the discussion in #703 as:
-use StdRng if you need to sometimes set the seed deterministically (like for testing)
-use ThreadRng for better performance (no shared mutex), but it will be impossible to set the seed
There are no tests within Burn that rely on the generation of ID being deterministic, but I wonder if it would impact users in the future to not be able to set their seed.
That said, your code is fine, I'm not explicitly approving simply because I'll wait for @nathanielsimard 's input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good from my end. I had worked on statically generated seed for no-std use cases. I think your changes still work for this use case.
I also was going to change to thread random but I deferred it for later. Glad you took care of it.
Standard: Distribution<T>, | ||
{ | ||
use crate::stub::Mutex; | ||
static RNG: Mutex<Option<StdRng>> = Mutex::new(None); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good move to bring the static variable inside the function
@nathanielsimard and I were discussing to shorten the IDs if possible. One thing to keep in mind, these IDs exported and visible in JSON file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems right, we don't care about deterministic ID generation, it has nothing to do with weights and won't impact reproducibility!
I had in mind to change the ID generation game in Burn. The more efficient, the better. We could even have more strategies! UUID is very good when we need to serialize the data (like with parameter ID), but it's way less important for tensor IDs that are only runtime-dependent.
@dae I have the same NaN problem with embedding backward, I'm looking into it! |
We've been exploring dividing our data set up into multiple batches, and training those batches in parallel. I noticed that performance did not scale with core count, and after some digging, found that this was mainly due to the Mutex being used to generate IDs. With the following change, training across 16 cores went from 21s to 4.2s.
thread_rng was previously discussed on #703, but I don't believe that applies here, as this is just used for UUID creation.
Using string UUIDs here is also not the most efficient - would there be any interest in a follow-up PR that uses the bytes directly instead of converting them to a string?
Pull Request Template
Checklist
run-checks
script has been executed.I'm getting intermittent flakes with burn_wgpu, that happened even before this change. Maybe my GPU/driver?