Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in mean filter (v0.18.0 onwards) #6028

Open
soyers opened this issue Nov 9, 2021 · 7 comments
Open

Performance regression in mean filter (v0.18.0 onwards) #6028

soyers opened this issue Nov 9, 2021 · 7 comments

Comments

@soyers
Copy link

soyers commented Nov 9, 2021

Description

Our performance tests detected a regression in the mean filter (from skimage.filters.rank.mean)after upgrading from scikit-image 0.17.2 to 0.18.3.
I was not able to reproduce this issue in linux-based colab notebooks, so it seems to be specific for the Windows builds of skimage.
After some profiling I found that the skimage.filters.rank.generic_cy._mean function is about >30% slower in skimage v0.18.3 compared to v0.17.2.

Way to reproduce

Run the following code in an ennvironment with skimage v0.17.2 and afterwards run the same code with skimage v0.18.3:

import numpy as np
import skimage
from skimage.morphology import disk
from skimage.filters.rank.generic_cy import _mean
from skimage.filters.rank.generic import _preprocess_input
import time

print(skimage.__version__)

inputs = (np.random.random((1000, 1000)) * 255.0). astype("uint8")
footprint = disk(8)
inputs, footprint, out, mask, n_bins = _preprocess_input(
    inputs,
    footprint,
    None,
    None,
    None,
    1
)


def timing_fn():
    _mean(inputs, footprint, out=out, mask=mask, shift_x=False, shift_y=False, n_bins=n_bins)


NUM_RUNS = 1000

start = time.time()
for _ in range(NUM_RUNS):
    timing_fn()
end = time.time()

print((end-start) / NUM_RUNS)

The output of script run with v0.18.3 is (first line is skimage version, second line is average execution time of _mean function):

0.18.3
0.4662466173171997

The output of script run with v0.17.2 is

0.17.2
0.2975685405731201

Version information

# Paste the output of the following python commands
from __future__ import print_function
import sys; print(sys.version)
import platform; print(platform.platform())
import skimage; print(f'scikit-image version: {skimage.__version__}')
import numpy; print(f'numpy version: {numpy.__version__}')

Output from v0.18.3 environment:

3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)]
Windows-10-10.0.18362-SP0
scikit-image version: 0.18.3
numpy version: 1.21.4

Output from v0.17.2 environment:

3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)]
Windows-10-10.0.18362-SP0
scikit-image version: 0.17.2
numpy version: 1.21.4
@jni
Copy link
Member

jni commented Nov 15, 2021

As far as I can tell, the only thing that's changed is #5399, which removes the gil from the core loop of rank filters. I can imagine that this adds a small overhead when running single threaded in some situations, e.g. maybe releasing and reacquiring the gil is more expensive on Windows? However, the change results in very significant (multi-x) speedups in multithreaded environments, so I'm not sure whether much can be done here... (ie we are not going to revert #5399.) Perhaps @soyers you can try to benefit from the multithreading improvements using skimage.util.apply_parallel to offset the 30% slowdown you're seeing?

@scikit-image/core unless others have ideas about how to fix this, I suggest (unfortunately!) closing this as wont-fix...

@soyers do you have any thoughts on the above? I'm disappointed to not be able to help but I don't see another way forward...

@JDWarner
Copy link
Contributor

It would be really useful to know how many cores were available to the Win10 (virtual?) environment in question @soyers.

Beyond the code changes, did anything get tweaked globally in how Windows builds are compiled and optimized around 0.18? It's been a while, though, and I would have expected lots more complaints since if there was something systematic going on, especially with nogil. The number of single-threaded Windows users may be so small as to allow it to fly under the radar.

@jni
Copy link
Member

jni commented Nov 15, 2021

@JDWarner what do you mean by "single threaded windows users"? I would expect that most users work with single threads. I think it's more likely that most people, especially desktop users (vs HPC running Linux) wouldn't notice a 30% performance decrease.

@soyers
Copy link
Author

soyers commented Nov 15, 2021

Hi @jni and @JDWarner,

thank you for the detailed feedback and your thoughts on this!

@JDWarner I am using a Windows 10 System (native, no VM) with a (mini)conda environment. The scikit-image library is installed from the anaconda repository. The machine executing the tests is a Dell Precision 7520 with an Intel Core i7-6920HQ (4C8T) and 32GB RAM. The experiment above is not parallelized (runs on a single core).

I am currently struggling to understand which version of scikit-image actually introduced the kernels which release the GIL. #5399 was merged in May 2017 while scikit-image 0.17.2 was released in 2020. I was therefore assuming that version 0.17.2 already contains the changes for releasing the GIL. As far as I understand this commit should be the one that was released as v0.17.2: https://github.com/scikit-image/scikit-image/blob/1188c33c3defefdc0f672870b5bcc36beaec967f/skimage/filters/rank/bilateral_cy.pyx
The definition of the mean kernel (_kernel_mean) seems to already have the nogil: tag. However, this version does not show the performance regression on my machine.
Am I missing some important point here?

Using the release tags to narrow down the time frame in which the regression must have been introduced, I would argue that the following PRs most likely introduced the regression:

I just skimmed over the changes and noticed that this PR introduces a couple of changes in how parameters are passed to the cython part of the code. cd5c1a7#diff-11036dd1acc33e83cc65e4a480cb1eb4d56877c10ebbf93c5e5e974191d92988 shows that a lot of Py_ssize_t* turned into Py_ssize_t[::1] (unfortunately I don't have experience with this, so probably it boils down to being the same).
Also, the _core function to execute kernelse (see cd5c1a7#diff-9fce4d855510db2956d712c46bb402fd6b541fd49fa109746751bb826d0c2526) seems to hace changed the way of allocating memory.
For example:

cdef Py_ssize_t [::1] se_e_r = np.empty(se_size, dtype=np.intp)

changed to

se_e_r = <Py_ssize_t*>malloc(se_size)

Unfortunately, I don't have the knowledge to judge if some of these changes could potentially cause regressions, but mayve you have more insight concerning this topic!

I'm grateful about any input 😄

@grlee77
Copy link
Contributor

grlee77 commented Nov 15, 2021

If your application allows relaxing the disk footprint to a square one of similar area (e.g. 14x14 square instead of radius=8 disk) you can get better performance using scipy.ndimage.uniform_filter. This is because the square footprint is separable across axes so it gets computed using a 14x1 rectangle and then the output of that is filtered by a 1x14 footprint. uniform_filter also uses a sliding window approach for efficiency.

@grlee77
Copy link
Contributor

grlee77 commented Nov 15, 2021

a lot of Py_ssize_t* turned into Py_ssize_t[::1]

This might be related. I think sometimes Cython memoryviews can be a little slower than using pointers, but we would have to benchmark it to be sure.

@soyers
Copy link
Author

soyers commented Nov 15, 2021

@grlee77 Thank you very much for the hint! I hadn't thought about using a uniform filter.

Concerning the memoryviews vs pointers issue, I can imagine that it's hard to tell whether this is really an issue. I am not sure when I will find the time to investigate this, but if somebody else already has a development stack running and is trying to reproduce this, I will do my best to help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants