Avoid workers waiting for training of surrogate model to finish #1170
Description
Motivation
Here's how a cpu load graph looks like for a multi-objective optimization session using the multi-fidelity facade that ran for about 46 hours an a 64 core machine (no hyperthreading, n_workers=64
), finishing almost 20k trials on a bit over 16k distinct configurations (two rungs).
One can see that cpu utilizations decreases to less than 50% after the first 12 hours. It then drops to under 40% after another 10 hours (by this time, 12.6k trials were finished in total).
Previous Discussion
I thought that another cause for this degradation in performance might be Hyperband, and thought that using ASHA (#1169) instead would help eliminate that hypothesis, however, after @eddiebergman's #1169 (comment), I understand the problem is caused by workers waiting to get another suggestion from the surrogate model
Potential solution
- train the random forest in a different thread / process
- replace a version of the RF with a newly trained one only when the training is done (similar to double buffering)
- workers should always get configs using the currently available RF, even if a new RF is training in the background
- optionally: use an occupancy threshold, e.g., 90%, and allow worker threads to wait for training to finish only if the percentage of workers that are idle waiting for the new RF version is below 10%.
- optionally: add gpu support to accelerate training of Random Forest
- optionally: perhaps add the option to decrease the number of workers running the target function by 1 once the RF trainer occupies a CPU core for more than 50% of time
Activity
dengdifan commentedon Nov 27, 2024
Hi @bbudescu,
Thanks for raising this issue and providing potential solutions. As far as I can see, the potential solutions that you proposed are mostly related to the RF. We are also planning to replace the current RF packages: #1116
However, the new RF model might still rely on the existing packages (e.g., sklearn). Since we are only a small team maintaining SMAC, we might not have enough manpower to customize a new random forest package. However, if you have any good ideas on how to implement this efficiently (in Python) or you would like to create a new PR regarding the RF replacement stuff. We are happy to help you (on how to integrate this into SMAC).
bbudescu commentedon Nov 27, 2024
Hi @dengdifan,
Thanks for your reply above. Yes, I am aware of the #1116, however, the main point I was trying to get across in this issue was making the
ask
operation asynchronous w.r.t. RF model training, i.e., being able to query an existing RF model for a new configuration to try anytime (even if the model is lagging behind, i.e., it hasn't been updated with the absolute latest results), rather than being forced to wait until training finishes, because it's better to use the cpu cores rather than to keep them unoccupied more than half of the time. I mean, even if it's not looking in the best places, it still explores the config space, and that's preferable to not doing anything at all.Now, I haven't looked into the code, but I assume this doesn't have anything to do with the choice of random forest package, but just with running RF training in a separate thread. Perhaps one of the optional suggestions I made, namely, the one where I suggested adding GPU support, might be relevant to the choice of RF implementation.
dengdifan commentedon Dec 4, 2024
Hi @bbudescu
Thanks for the clarification. However, first, because of global interpreter lock (GIL), it is not easy to implement multi-threading in pure python environment, if you have any idea how to do this efficiently, a PR is welcomed.
Seond, SMAC is implemented based on BO, an algorithm designed for optimizing expensive black-box functions, such as training a deep neural network. Usually, the expense of evaluating a function is much more significant than training the surrogate model. It is still unclear if directly starting a new configuration is better than waiting until the current surrogate model provides a new candidate.
Additionally, making
ask
operation asynchronous is equivalent to running one core with BO and the other cores with random search (since the probability that two runs end at exactly the same time is 0, meaning that once one core (core A) that runs a random configuration is free, the core that runs BO (core B) should still be occupied, and this core (core A) will continue a random sampling since B cannot provide A any useful information). In that case, you could set the argumentprobability
argument in functionget_random_design
as(num_cores-1)/(num_cores)
: https://github.com/automl/SMAC3/blob/main/smac/facade/hyperparameter_optimization_facade.py#L172bbudescu commentedon Dec 5, 2024
Hi @dengdifan
I've been working on something, and I just added #1178 to share it here. It's not yet tested, but it's a start, I guess.
That's quite right :). I had to do a bit of reading before finally deciding at least on what to try
Now, while looking into this I've taken a few ideas into consideration, namely,
threading
,multiprocessing
,concurrent.futures
,dask
futures andasyncio
. Each has its pros and cons, and I finally decided to have a stab at it with plain threads.Other than the fact that I used
threading
before more than the others, so I'm familiar with the framework, so I don't have to spend as much time learning how to use another platform, I thought they would be a good choice because w.r.t. IPC for passing datasets and models to and from the training thread:threading
is the most straightforward to use, i.e., it's just plain python, basically (of course, with proper synchronization)multiprocessing
:numpy
, so one could back it by a shared array buffer of sorts (example), or bySharedMemory
(example)ascii_string_representation
of the random forest back to the main threaddask
has the upside that it's already being used within SMAC3, but the downside that, if we're scheduling the training session the same as we do every worker process, it might be executed on a different machine, so there might be some overhead transferring the data and model (maybe scatter data)Of course, because of the GIL, the main optimization loop gets less cpu time, but although I haven't tested, I expect that the main thread would spend most time in the training section anyway. And I don't think the other parts in the loop are very CPU intensive anyway. Or are they? Do you know if, perhaps the inference or something takes really long?
Well, other than expanding SMAC's use cases to things like database knob tuning and other tasks that also have a low evaluation time for the target cost function, it's also an impeding the scalability of SMAC. E.g., let's say you want to train a neural net that takes, on average, 2 hours on a gpu to fit to a dataset and you need to find the optimal parameters fast. In this case you might opt to do something like renting 256 machines on aws so that you increase the chance you get a decent neural net within a tight budget like 24 hours. In this case you will face the same problem, because you'll get new results every 30 seconds anyway, not because it takes little to train neural nets, but because of the number of training processes.
If you're paying for an aws instance with a gpu you'd hate it to see it stay idle. You'd even prefer it even running a random config rather than just doing nothing.
Well, I'm not sure I understand this correctly, but if I do, then I don't think that is the case other than in the beginning. So, here's an example at the opposite end: let's say that you already finished 1,000,000 trials, and trained the RF on that data. Let's now say that a worker process finishes running the 1,000,001st trial, reports results and asks for a new configuration to try. What happens in the current setting is that the worker needs to wait for the model to retrain taking into consideration that last result, so it can benefit from the extra information added by that extra sample in the RF training set.
Now, my hypothesis is that there is only a minor benefit of doing this, as I don't think the expected improvement in the cost function caused by making the suggestion based on 1,000,001 vs 1,000,000 is going to be that much different anyway. And also, when setting
retrain_after
most of the trial configs will be sampled from outdated models anyway.So, to be clear, I'm not suggesting using random configurations, but, rather, to adapt the
retrain_after
parameter dynamically, based on resource availability and demands for retraining.LE: Here is a comparison between the performance of some IPC methods when sharing numpy data.
dengdifan commentedon Dec 6, 2024
Hi @bbudescu
Now I am confusing by your two examples.
In your first example (the neural network training one), you suggest sampling a new configuration randomly:
This is exactly the case for training a RF in a separated thread. Assuming that you have two cores, one core already finishes its configuration evaluation. Since your RF is still under training (you can not expect that both RF training and configuration evaluation ends at exact the same time clock for both CPUs), you cannot expect how long it would take to finish the RF training in another core(thread). Then in that case you simply start a new configuration randomly (as suggested by your AWS example) even the RF training will end just 1 CPU clock after. Therefore, in this case, you always have the core 1 running BO+evaluation and core 2 running random evaluation.
Here you suggest to adapt the retrain_after parameters dynamically instead of sampling new configurations randomly. However, how should this value be scheduled? To which degree should we increase this value? Should we always apply this strategy or we only apply this if the number of budgets is higher than a threshold? Since we are aiming at arbitrary budgets and number of evaluations, all these need to be carefully designed to avoid performance degeneration
bbudescu commentedon Dec 6, 2024
Sorry, I tried to make a comparison. Conceptually, training in a different thread would be somewhat similar to when
retrain_after
is greater than 1, because you'd use the same surrogate model, without any retraining, for generating multiple trial configs. I wasn't particularly clear about this.Yes, the key word here is "even". I think even a random config is better than nothing, but what I propose is to use a trained random forest to suggest new configs, just we shouldn't care so much that it was trained on the absolute latest results, just as in the case when
retrain_after
is greater than 1.Ok, so look, let's say we have 2 cores. One core always trains (because training takes longer than evaluation), and one core always evaluates the cost function (i.e., it's a worker that runs trials):
n_configs
parameter ofInitialDesign
constructor).retrain_after
> 1.NOTE: Indeed, at step 5 above, it's also possible for thread A to start running another random trial, but this is NOT the subject of this feature request, but rather a potential further optimization similar to what you said above.
NOTE 2: actually, what I think would be more relevant and more in line with what you asked (and I replied to), is that at step 4 above thread B can just start a random config without waiting for the first model to finish training. However, again, this would be just an optimization, and NOT the point of this issue and associated PR
dengdifan commentedon Dec 11, 2024
Hi @bbudescu,
As you said, we could simply set
retrain_after
greater than 1 or add a scheduler forretrain_after
arguments to solve this problem. In my opinion, there is no need for us to train the RF in a separate thread that might result in potential conflict or problems.bbudescu commentedon Dec 11, 2024
Well, the problem with setting
retrain_after
to something fixed and constant is a one-size-fits-all approach which might even make the worst of both worlds:A tradeoff is hard to find, and even the best tradeoff will be suboptimal.
And, after all, I've just pushed some code to implement this feature. I know it's a potential risk, but, as far as I have come to understand, you guys are ok with at least offering some support for integration (I'm not talking about you implementing it yourself).
LE: It's not done yet, i.e., there are no tests, I haven't added in a way to switch to the old behavior, and I still have a few questions about, e.g., what to do with the
AbstractModel._rng
which gets overwritten withinRandomForest.__init__
, but I think the implementation is, at least in a stage where it can be tested until it works fine.bbudescu commentedon Dec 11, 2024
This should also be done dynamically, because, as a user, it's kinda hard to anticipate how long training will take on the current number of finished trials, how long running a trial will take, as it also depends on the hyperparameters in the suggested config, which evolve over time etc. I'm thinking there would be value in adding estimators for all of these, but I feel that just running a second thread on the main cpu when there's some new data available is easier.
LE: The alternative for the user would be to do trial and error until you get the schedule right, but that beats the whole purpose of making the session faster / more efficient
bbudescu commentedon Dec 13, 2024
Hi @dengdifan,
I think I was able to bring the code to a state in which it can be tested.
As per my #1178 (comment), could this, perhaps, be solved by a compromise like merging the pull request and exposing the option of using concurrent background training as an experimental feature, at the user's (documented) risk?
10 remaining items