Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy and Dask Improvements #627

Open
CSSFrancis opened this issue Mar 21, 2023 · 2 comments
Open

Lazy and Dask Improvements #627

CSSFrancis opened this issue Mar 21, 2023 · 2 comments
Labels
documentation This relates to the documentation enhancement New feature or request help wanted Would be nice if someone could help

Comments

@CSSFrancis
Copy link
Member

There are a couple of things that I wanted to bring up and possibly offer solutions. @hakonanes maybe you can tell me if these are reasonable or not.

##Overlap causing RAM Spikes

There are a couple of instances where the dask.overlap function is used. In my personal experience this function suffers heavily from "the curse of dimensionality". In the worst case scenario where you have a 4D dataset that is chunked in every dimension each chunk is also loading 81 other chunks or 3^n_chunks! This basically means that:

  1. Your entire dataset is being loaded into RAM
  2. You are spending much of your time transferring chunks from worker to worker. (which is bad)

Even with only the x and y dimensions chunked each chunk is loading 9 chunks. While this is better and dask has recently improved how it handles this situation more than likely you will still see a large memory usage spike. I would suggest having a warning that pops up with every function that uses overlap to the effect of

"Warning: For optimal performance and RAM usage only one dimension should be chunked. Best practice is to call s.rechunk(nav_chunks=("auto",-1)). Saving and loading the data can also help as that optimizes the chunks saved on the disk"

##Loading Data Lazily

I would highly recommend adding a zarr format to to kikchipy. It is probably the easiest way to speed things up a bit. It also would let you load data and use dask_distributed. Otherwise you can read binary data in a way that can be applied to distributed computing.

##End to End Lazy computations

It would be convenient to not call the compute function in lazy computations. The idea here is that you could create a lazy Orix CrystalMap object that you could slice and compute only part of. That might be difficult but could potentially cause pretty large improvements.

@hakonanes
Copy link
Member

hakonanes commented Mar 21, 2023

Thank you for taking the time to raise these concerns, @CSSFrancis. I gave a brief summary of my own use of lazy computations with kikuchipy here (based on your good questions): hyperspy/hyperspy#3107 (comment). All your points are reasonable, but I have different views on how they should be addressed.

I would suggest having a warning that pops up with every function that uses overlap to the effect of

I find a warning raised every time a bit too drastic. But adding this information to the docstring Notes of relevant methods would be highly welcome! map_overlap() is used in EBSD methods average_neighbour_patterns() and get_average_neighbour_dot_product_map(). These methods are not optimized (i.e. we haven't updated them after we initially wrote them), but still, I haven't experienced any major issues in terms of RAM. They run fine on my laptop for larger datasets.

I would highly recommend adding a zarr format to to kikchipy

I use HyperSpy's zarr myself by "casting" EBSD -> Signal2D and using HyperSpy's save() method. I then reload the EBSD signal with hs.load("data.zspy", signal_type="EBSD", lazy=True). We should add this workflow to the user guide. This workflow basically gives kikuchipy a zarr format, and should be sufficient for now. The next step would be to interface better with HyperSpy's load() function by allowing writing to the ZSPY format from EBSD.save().

An aside: I'm reluctant to adding our own zarr format because I'm not that big of a fan of our h5ebsd (HDF5) format. Maintaining an always consistent file format specification is a real hassle when our needs change (e.g. when new "states" are added to a signal). E.g., our h5ebsd format was written so that any dataset saved could be read by EMsoft using its EMEBSD format, which has a flattened navigation dimension. We initially added some metadata as well, but this metadata has since changed a lot to accomodate added functionality (mainly handling of detector-sample geometry via EBSDDetector in the EBSD.detector attribute and orientation data via CrystalMap in the EBSD.xmap attribute). Because of these changes and the anticipation of more changes to the format in the future, I decided to remove our specification of the format from the docs...

It would be convenient to not call the compute function in lazy computations.

Sorry, could you explain what you mean here? Do you mean we shouldn't call compute()?

The idea here is that you could create a lazy Orix CrystalMap object that you could slice and compute only part of.

I agree that this would be convenient. However, users already have this option by selecting a part of their dataset for indexing with HyperSpy's slicing, right?

Dictionary indexing with kikuchipy is arguably not very fast, so having a definite start to the computation, e.g. when calling EBSD.dictionary_indexing(), is a help for the user, I think, since they are forced to make sure all input parameters are appropriate before running.

When it comes to refinement (pattern matching) we can pass a navigation mask to control which patterns are indexed. I see this as a powerful slicing operation, and can effectively be used instead of HyperSpy's slicing (although I haven't tested this in a workflow).

[creating a lazy orix CrystalMap] might be difficult but could potentially cause pretty large improvements.

As an orix user, arrays in my crystal maps fit comfortably in memory. What I'd like to see though is more operations using Dask on-the-fly. Many crystallographic computations involve finding the lowest disorientation angle after applying many symmetrically equivalent operations to each point (rotation or vector), which can be memory intensive. I think this is more important than allowing a lazy crystal map.

@CSSFrancis
Copy link
Member Author

I find a warning raised every time a bit too drastic. But adding this information to the docstring Notes of relevant methods would be highly welcome! map_overlap() is used in EBSD methods average_neighbour_patterns() and get_average_neighbour_dot_product_map(). These methods are not optimized (i.e. we haven't updated them after we initially wrote them), but still, I haven't experienced any major issues in terms of RAM. They run fine on my laptop for larger datasets.

Something in the notes is probably good. I imagine that you don't have major issues in terms of RAM because the datasets are still pretty small and running on your laptop you are likely not using very many cores. The problems become larger when running larger datasets with more cores. If most peoples workflows are similar to yours then it most likely isn't worth it to have a warning every time.

I use HyperSpy's zarr myself by "casting" EBSD -> Signal2D and using HyperSpy's save() method. I then reload the EBSD signal with hs.load("data.zspy", signal_type="EBSD", lazy=True). We should add this workflow to the user guide. This workflow basically gives kikuchipy a zarr format, and should be sufficient for now. The next step would be to interface better with HyperSpy's load() function by allowing writing to the ZSPY format from EBSD.save().

Yea that should be easy to do!

An aside: I'm reluctant to adding our own zarr format because I'm not that big of a fan of our h5ebsd (HDF5) format. Maintaining an always consistent file format specification is a real hassle when our needs change (e.g. when new "states" are added to a signal). E.g., our h5ebsd format was written so that any dataset saved could be read by EMsoft using its EMEBSD format, which has a flattened navigation dimension. We initially added some metadata as well, but this metadata has since changed a lot to accomodate added functionality (mainly handling of detector-sample geometry via EBSDDetector in the EBSD.detector attribute and orientation data via CrystalMap in the EBSD.xmap attribute). Because of these changes and the anticipation of more changes to the format in the future, I decided to remove our specification of the format from the docs...

Makes sense.

It would be convenient to not call the compute function in lazy computations.

Sorry, could you explain what you mean here? Do you mean we shouldn't call compute()?

It depends on the instance but calling compute inside a function takes away some potential workflows. Calling compute inside of a function takes away a lot of a users ability to interact with a dataset in the lazy state. For example they cannot rechunk or call persist if they don't want the data to be transferred back to the RAM. You also have to wait for the data to compute before saving the data rather than saving things chunk by chunk in an embarrassingly parallel way. The map function in hyperspy always uses dask to run operations so when using distributed computing in particular you want to limit the number of transfers from the distributed computing to RAM. It also cleans up the code and makes it more flexible and able to respond to api changes in dask.

Basically you just want to make sure that the compute function is the last thing you do with the data. Maybe I have too strict of ideas about this...

I agree that this would be convenient. However, users already have this option by selecting a part of their dataset for indexing with HyperSpy's slicing, right?

This is true but goes with the point above. It just cleans up the code and makes it more consistent.

Dictionary indexing with kikuchipy is arguably not very fast, so having a definite start to the computation, e.g. when calling EBSD.dictionary_indexing(), is a help for the user, I think, since they are forced to make sure all input parameters are appropriate before running.

This is kind of why lazy operations are nice. You can get a lazy result. Test to see if it looks good and then test another small region before doing the larger computation. You can also call the persist function and then keep working without the "waiting" for some code to finish running. It makes the whole workflow more seamless and I think helps speed up iteration.

When it comes to refinement (pattern matching) we can pass a navigation mask to control which patterns are indexed. I see this as a powerful slicing operation, and can effectively be used instead of HyperSpy's slicing (although I haven't tested this in a workflow).

[creating a lazy orix CrystalMap] might be difficult but could potentially cause pretty large improvements.

As an orix user, arrays in my crystal maps fit comfortably in memory. What I'd like to see though is more operations using Dask on-the-fly. Many crystallographic computations involve finding the lowest disorientation angle after applying many symmetrically equivalent operations to each point (rotation or vector), which can be memory intensive. I think this is more important than allowing a lazy crystal map.

I'll have to look more into this...

@hakonanes hakonanes added enhancement New feature or request discussion documentation This relates to the documentation labels Mar 30, 2023
@hakonanes hakonanes added the help wanted Would be nice if someone could help label Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation This relates to the documentation enhancement New feature or request help wanted Would be nice if someone could help
Projects
None yet
Development

No branches or pull requests

2 participants