-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazy and Dask Improvements #627
Comments
Thank you for taking the time to raise these concerns, @CSSFrancis. I gave a brief summary of my own use of lazy computations with kikuchipy here (based on your good questions): hyperspy/hyperspy#3107 (comment). All your points are reasonable, but I have different views on how they should be addressed.
I find a warning raised every time a bit too drastic. But adding this information to the docstring Notes of relevant methods would be highly welcome!
I use HyperSpy's zarr myself by "casting" An aside: I'm reluctant to adding our own zarr format because I'm not that big of a fan of our h5ebsd (HDF5) format. Maintaining an always consistent file format specification is a real hassle when our needs change (e.g. when new "states" are added to a signal). E.g., our h5ebsd format was written so that any dataset saved could be read by EMsoft using its EMEBSD format, which has a flattened navigation dimension. We initially added some metadata as well, but this metadata has since changed a lot to accomodate added functionality (mainly handling of detector-sample geometry via
Sorry, could you explain what you mean here? Do you mean we shouldn't call
I agree that this would be convenient. However, users already have this option by selecting a part of their dataset for indexing with HyperSpy's slicing, right? Dictionary indexing with kikuchipy is arguably not very fast, so having a definite start to the computation, e.g. when calling When it comes to refinement (pattern matching) we can pass a navigation mask to control which patterns are indexed. I see this as a powerful slicing operation, and can effectively be used instead of HyperSpy's slicing (although I haven't tested this in a workflow).
As an orix user, arrays in my crystal maps fit comfortably in memory. What I'd like to see though is more operations using Dask on-the-fly. Many crystallographic computations involve finding the lowest disorientation angle after applying many symmetrically equivalent operations to each point (rotation or vector), which can be memory intensive. I think this is more important than allowing a lazy crystal map. |
Something in the notes is probably good. I imagine that you don't have major issues in terms of RAM because the datasets are still pretty small and running on your laptop you are likely not using very many cores. The problems become larger when running larger datasets with more cores. If most peoples workflows are similar to yours then it most likely isn't worth it to have a warning every time.
Yea that should be easy to do!
Makes sense.
It depends on the instance but calling compute inside a function takes away some potential workflows. Calling compute inside of a function takes away a lot of a users ability to interact with a dataset in the lazy state. For example they cannot rechunk or call Basically you just want to make sure that the compute function is the last thing you do with the data. Maybe I have too strict of ideas about this...
This is true but goes with the point above. It just cleans up the code and makes it more consistent.
This is kind of why lazy operations are nice. You can get a lazy result. Test to see if it looks good and then test another small region before doing the larger computation. You can also call the
I'll have to look more into this... |
There are a couple of things that I wanted to bring up and possibly offer solutions. @hakonanes maybe you can tell me if these are reasonable or not.
##Overlap causing RAM Spikes
There are a couple of instances where the
dask.overlap
function is used. In my personal experience this function suffers heavily from "the curse of dimensionality". In the worst case scenario where you have a 4D dataset that is chunked in every dimension each chunk is also loading 81 other chunks or 3^n_chunks! This basically means that:Even with only the x and y dimensions chunked each chunk is loading 9 chunks. While this is better and
dask
has recently improved how it handles this situation more than likely you will still see a large memory usage spike. I would suggest having a warning that pops up with every function that usesoverlap
to the effect of"Warning: For optimal performance and RAM usage only one dimension should be chunked. Best practice is to call
s.rechunk(nav_chunks=("auto",-1))
. Saving and loading the data can also help as that optimizes the chunks saved on the disk"##Loading Data Lazily
I would highly recommend adding a
zarr
format to to kikchipy. It is probably the easiest way to speed things up a bit. It also would let you load data and usedask_distributed
. Otherwise you can read binary data in a way that can be applied to distributed computing.##End to End Lazy computations
It would be convenient to not call the compute function in lazy computations. The idea here is that you could create a lazy Orix
CrystalMap
object that you could slice and compute only part of. That might be difficult but could potentially cause pretty large improvements.The text was updated successfully, but these errors were encountered: