Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add median filtering function #85

Merged
merged 10 commits into from
Aug 1, 2024

Conversation

bruno-pannunzio
Copy link
Contributor

@bruno-pannunzio bruno-pannunzio commented Jul 25, 2024

Description

This PR adds the function phasor_filter in utils module. The only supported method at the moment is the median filtering function which uses scipy.ndimage.median_filter function.

The phasor_filter function was implemented in a way to allow easy addition of new methods for filtering (i.e. the wavelet filter, gaussian, etc)

This PR partly addresses some functionality to be added in #21

Release note

Summarize the changes in the code block below to be included in the
release notes:

Add `phasor_filter` function with median method.

Checklist

  • The pull request title, summary, and description are concise.
  • Related issues are linked in the description.
  • New dependencies are explained.
  • The source code and documentation can be distributed under the MIT license.
  • The source code adheres to code standards.
  • New classes, functions, and features are thoroughly tested.
  • New, user-facing classes, functions, and features are documented.
  • New features are covered in tutorials.
  • No files other than source code, documentation, and project settings are added to the repository.

@bruno-pannunzio bruno-pannunzio added the enhancement New feature or request label Jul 25, 2024
@bruno-pannunzio bruno-pannunzio self-assigned this Jul 25, 2024
@bruno-pannunzio
Copy link
Contributor Author

Hi @cgohlke! I opened this PR to start the discussion of a couple of thing related to the filtering methods, before I continue working on this:

  1. I tried implementing the filtering in Cython so that it works faster, and it does marginally better (15-20%) than SciPy's, which is not that much because it's quite fast. For an image of 100 million pixels SciPy's takes 10.31 seconds and this one 8.80 seconds. Do you think this is worth it? The advantage is that this is more under our control and we can modify if necessary certain aspects.

  2. As I am not too familiar with Cython, I may have some ways to optimize the code, so maybe you can have a look and this can help also to improve performance and/or memory handling.

  3. Regarding Cython code formatting, I tried black formatting but is not working. Do you have an alternative or you check manually the format?

  4. I left the median calculation separated from the filtering function for two reasons. One is that we now can have a separate median calculation in case it is needed for some other purpose, and the other is that the same filtering function can be easily modified to apply another method (mean, gaussian, etc) for future implementations.

  5. We think the way SimFCS handles the borders is by keeping the original from the image, so by default the filter does that, but it also allows for the reflection method, which is the default of SciPy's and can be useful.

  6. I added also the num_iter parameter to the median_filter function since it is very common to apply the filter 3 or more times. But if you think this adds complexity I can remove it very easily.

@bruno-pannunzio
Copy link
Contributor Author

I removed the num_iter parameter from the median_filter function. I realized it was not needed and it actually made the performance of the function worse.

@bruno-pannunzio bruno-pannunzio requested a review from cgohlke July 25, 2024 17:30
@cgohlke
Copy link
Member

cgohlke commented Jul 25, 2024

  1. I tried implementing the filtering in Cython so that it works faster, and it does marginally better (15-20%) than SciPy's, which is not that much because it's quite fast. For an image of 100 million pixels SciPy's takes 10.31 seconds and this one 8.80 seconds. Do you think this is worth it? The advantage is that this is more under our control and we can modify if necessary certain aspects.

Good to see you getting familiar with Cython. I think that 20% is not worth the implementation/maintenance overhead, but that can probably be improved. The huge advantage of ndimage's implementation is that it works with any-dimensional arrays, for example of 3D phasors. The median_filter functions should fall back to the scipy or skimage implementation for cases that _apply_2D_filter can't handle. I think there is ongoing work in the scikit-image library to improve the median filter. Also check literature for optimized algorithms, such as https://ieeexplore.ieee.org/document/1163188

  1. As I am not too familiar with Cython, I may have some ways to optimize the code, so maybe you can have a look and this can help also to improve performance and/or memory handling.

In general, use ssize_t instead of int for sizes and indices such that the code can work with > 2GB arrays. You also want to release the GIL in _apply_2D_filter such that the function can be efficiently used in multi-threading cases.

  1. Regarding Cython code formatting, I tried black formatting but is not working. Do you have an alternative or you check manually the format?

Try cython-lint. It's in the project's dev requirements but not used in CI yet.

  1. I added also the num_iter parameter to the median_filter function since it is very common to apply the filter 3 or more times. But if you think this adds complexity I can remove it very easily.

Exactly, and that is probably a huge opportunity for optimizing the Cython function: Move the num_iter (how about renaming it repeat) loop inside the _apply_2D_filter and repeat the filter on the kernel while it is in memory.

In general, how about adding (or replacing with) a function taking phasor coordinates as input for convenience, something like

real, imag = phasor_median(real, imag, repeat=3, **kwargs)
or
real, imag = phasor_filter(real, imag, method='median', repeat=3, **kwargs)
instead of

real = median_filter(real, repeat=3, **kwargs)
imag = median_filter(imag, repeat=3, **kwargs)

@bruno-pannunzio
Copy link
Contributor Author

Great suggestions @cgohlke, I will look a little bit more into the algorithms but these are all good ideas. I will fall on SciPy's for cases we can't handle yet.

I will look into the iterations, I didn't think of including it in the Cython function but it makes a lot of sense.

I will let you know when it's ready again for review, but thanks already for all the suggestions!

@cgohlke
Copy link
Member

cgohlke commented Jul 25, 2024

Move the num_iter loop inside the _apply_2D_filter and repeat the filter on the kernel while it is in memory.

I realize that this will not be that trivial with the default algorithm.

@cgohlke
Copy link
Member

cgohlke commented Jul 25, 2024

I will look a little bit more into the algorithms

Don't spend too much time on it though. I think we agree that the purpose of the median filter functionality in the PhasorPy library is not to provide yet another median_filter function as an alternative to scipy, scikit_image, or opencv, but to provide the user with a convenience function that fits well into the library's (hopefully) consistent workflow, such as phasorplot.hist2d(*phasor_median(*phasor_calibrate(*phasor_from_signal(signal, ...)[1:], ...), ...), ...).

@cgohlke
Copy link
Member

cgohlke commented Jul 25, 2024

tutorial are missing

It's probably enough for to update the introduction tutorial.

I noticed that the Cython code works on signal_t instead of float_t. Do you plan/expect to apply median filter on the signal or phasor coordinates?

@bruno-pannunzio
Copy link
Contributor Author

I noticed that the Cython code works on signal_t instead of float_t. Do you plan/expect to apply median filter on the signal or phasor coordinates?

You are right, I was testing that. If it uses float_t it doesn't work with arrays of dtype int, that's why I used signal_t which contains also int types. I know the median filter is intended to be used with phasor coordinates which are float but I thought it may be useful to allow for the use of int, but maybe we leave it only for float. If not we can convert int to float, but I think memory-wise this is not optimal right?

Maybe for now it's better to work only with float since I can't think of now of cases where int is going to be used. Let me know if you believe this is the best option and I will make the changes.

@cgohlke
Copy link
Member

cgohlke commented Jul 25, 2024

Maybe for now it's better to work only with float since I can't think of now of cases where int is going to be used. Let me know if you believe this is the best option and I will make the changes.

The issue with using signal_t is that Cython will create separate C code for each function for each of the types in signal_t, which quickly results in large compiled extension sizes. In some cases, such as the _phasor_from_signal function, it is worth. In the case of the median function probably not unless your implementation offers significant advantages over what's in scipy, skimage, or opencv.

… to use `scipy.ndimage.median_filter` instead of previous cython implementation.

Added tests for `phasor_filter` function.
Updated `tutorials/phasorpy_introduction.py` to reflect changes in `utils` module.
@bruno-pannunzio bruno-pannunzio changed the title WIP: Add median filtering function Add median filtering function Jul 30, 2024
@bruno-pannunzio
Copy link
Contributor Author

Hi @cgohlke, I have been researching as we talked with the cython implementation of the median filter and tested moving the iteration inside the cython function for memory and execution efficency, but It didn't have much of an impact in performance, mainly compared to scipy's and skimage filters, so I rolled back and made a new implementation using scipy's median filter instead. I think it's the best approach for now as it is already very efficent and maybe it's not worth spending time now trying to improve this to only have a very minor impact.

I used scipy's instead of skimage's filtering function because skimage actually uses scipy's and altough it seems to be more efficent with large images, I tested and actually were quite similar. I think scipy is a little more personalizable with the parameters and also can handle more than 3D, which may be of use in the case of multidimensional phasors.

As you may see, I am having some error when building the cibuildwheel because it doesn't seem to have scipy installed, although it is already listed as a dependence. Do you know what can be the problem?

@cgohlke
Copy link
Member

cgohlke commented Jul 30, 2024

I am having some error when building the cibuildwheel because it doesn't seem to have scipy installed, although it is already listed as a dependence. Do you know what can be the problem?

Scipy is in the development dependencies. Add it to the install dependencies in pyproject.toml too:

# "scipy",

@cgohlke
Copy link
Member

cgohlke commented Jul 30, 2024

A few general thoughts:

  • How about moving phasor_filter to the phasor module. We should also discuss again moving all functions in that module to the main package namespace.
  • Since scipy is a heavy dependency and is currently not used elsewhere, how about moving the import statement from scipy.ndimage import median_filter inside the _median_filter function?
  • How does ndimage.median_filter handle NaN values, which are, for example, common in phasor coordinates returned from phasor_from_signal_fft? We should probably review the whole library for NaN handling.

src/phasorpy/utils.py Outdated Show resolved Hide resolved
src/phasorpy/utils.py Outdated Show resolved Hide resolved
src/phasorpy/utils.py Outdated Show resolved Hide resolved
src/phasorpy/utils.py Outdated Show resolved Hide resolved
src/phasorpy/utils.py Outdated Show resolved Hide resolved
src/phasorpy/utils.py Outdated Show resolved Hide resolved
src/phasorpy/utils.py Outdated Show resolved Hide resolved
src/phasorpy/utils.py Outdated Show resolved Hide resolved
src/phasorpy/utils.py Outdated Show resolved Hide resolved
src/phasorpy/utils.py Outdated Show resolved Hide resolved
@bruno-pannunzio
Copy link
Contributor Author

Thanks @cgohlke for all your comments, I will go through them and let you know when it is ready for review.

  • How about moving phasor_filter to the phasor module. We should also discuss again moving all functions in that module to the main package namespace.

I agree, this is something we can discuss in the next meeting as I believe is a big change that should be done (if we agree to do it) before our first release.

  • How does ndimage.median_filter handle NaN values, which are, for example, common in phasor coordinates returned from phasor_from_signal_fft? We should probably review the whole library for NaN handling.

This is a very good issue you raised. It is also something I was thinking about in the thresholding function implementation also. Maybe this is something we can discuss in it's corresponding PR, but I was thinking of masking below the threshold as NaN, if it makes sense and all downstream function work well with `NaN, but is something we have to check first.

Regarding the filtering, how do you think it should handle NaN values? I am thinking of two alternatives:

  1. NaN values are always kept out of the filtering and remain as NaN.
  2. NaN values are filtered with the surrounding values.

Maybe we can discuss this or @schutyb and @lmalacrida can give some insight into this. I am inclined to the first alternative, but I don't know what is usually done in phasor processing in this case.

@cgohlke
Copy link
Member

cgohlke commented Jul 30, 2024

Regarding the filtering, how do you think it should handle NaN values?

Good question. What does ndimage.median_filter do?

Using NaN is tempting, for example, to indicate masked out values in a potential phasor_threshold function. NaNs are already used internally in the pseudo_color function to indicate masked out values. Matplotlib can handle NaN I think. There are some functions in the plot module that currently choke on NaN. I will include a fix in my next PR. Probably phasor_from_signal should return NaN instead of zero to be compatible with the FFT implementation...

@cgohlke cgohlke mentioned this pull request Jul 30, 2024
9 tasks
@bruno-pannunzio
Copy link
Contributor Author

Good question. What does ndimage.median_filter do?

It seems that the topic of NaN handling when filtering is still debated in the imaging forums. I tested with a small array with NaN and ndimage.median_filter seems to fill the NaN with the median of the kernel. But looking the discussions from the community regarding NaN handling, it seems that both scipy's and skimage's filters have undefined behaviour with NaNs and this has been going for a while without a fix. The discussion seems to center in the compromise between being NaN safe at the expense of performance, which is something they don't seem to want to sacrifice.

Considering this I think we have to discuss if implementing our filter makes sense to be able to control what happens to NaN. If we consider NaN are not going to be too frequent, maybe we can stick with ndimage of skimage ones, but I think it will not be consistent.

@bruno-pannunzio bruno-pannunzio requested a review from cgohlke August 1, 2024 12:46
@bruno-pannunzio bruno-pannunzio merged commit 02c3f81 into phasorpy:main Aug 1, 2024
16 checks passed
@bruno-pannunzio bruno-pannunzio deleted the median_filter branch August 1, 2024 13:20
@cgohlke cgohlke mentioned this pull request Aug 2, 2024
13 tasks
schutyb pushed a commit to schutyb/phasorpy that referenced this pull request Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants