-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for NaN input #21
Comments
EDIT: I accidentally sent this out too early, please excuse the edits to my draft. Interesting suggestion. But I'm not sure what the expected outcome would be with these missing/NaN values. Option 1: Poison Whole Window
using Statistics
julia> median([1,2,NaN])
NaN
julia> median([1,2,missing])
missing Is this what you would expect? That a single NaN/missing value "poisons" the entire window? That is what julia> using RollingFunctions
julia> runmedian([1,2,3,4], 3)
4-element Vector{Float64}:
1.0
1.5
2.0
3.0
julia> runmedian([1,2,NaN,4], 3)
4-element Vector{Float64}:
1.0
1.5
NaN
NaN
julia> runmedian([missing,2,3,4], 3)
4-element Vector{Union{Missing, Float64}}:
missing
missing
missing
3.0 Option 2: Use Sorting Convention From Base
julia> sort([NaN,1,2,Inf,-Inf])
5-element Vector{Float64}:
-Inf
1.0
2.0
Inf
NaN I have not tested this exhaustively (there are MANY pitfalls with comparisons in the algorithm), but that might be what julia> using FastRunningMedian
julia> running_median([1,2,NaN,4], 3, :asymmetric)
6-element Vector{Float64}:
1.0
1.5
2.0
4.0
NaN
4.0 However, there's no support for julia> running_median([missing,2,3,4], 3, :asymmetric)
ERROR: MethodError: no method matching running_median(::Vector{Union{Missing, Int64}}, ::Int64, ::Symbol)
Closest candidates are:
running_median(::AbstractVector{T}, ::Integer, ::Any) where T<:Real
@ FastRunningMedian ~/.julia/packages/FastRunningMedian/TyoOP/src/FastRunningMedian.jl:28
running_median(::AbstractVector{T}, ::Integer) where T<:Real
@ FastRunningMedian ~/.julia/packages/FastRunningMedian/TyoOP/src/FastRunningMedian.jl:28
Stacktrace:
[1] top-level scope
@ REPL[78]:1 Option 3: Use Inverse Sorting Convention From BaseThis is what you get with julia> sort([NaN,1,2,Inf,-Inf], lt=<)
5-element Vector{Float64}:
NaN
-Inf
1.0
2.0
Inf I think this might be used by julia> using SortFilters
julia> movsort([1,2,3,4], 3, .5)
4-element Vector{Float64}:
1.0
1.0
2.0
3.0
julia> movsort([1,2,NaN,4], 3, .5)
4-element Vector{Float64}:
1.0
1.0
1.0
2.0
julia> movsort([4,3,NaN,1], 3, .5)
4-element Vector{Float64}:
4.0
4.0
3.0
NaN
julia> movsort([missing,2,3,4], 3, .5)
ERROR: TypeError: non-boolean (Missing) used in boolean context
Stacktrace:
[1] movsort!(out::Vector{Float64}, in::Vector{Union{Missing, Int64}}, window::Int64, p::Float64)
@ SortFilters ~/.julia/packages/SortFilters/lSAIo/src/SortFilters.jl:39
[2] movsort(in::Vector{Union{Missing, Int64}}, window::Int64, p::Float64)
@ SortFilters ~/.julia/packages/SortFilters/lSAIo/src/SortFilters.jl:17
[3] top-level scope
@ REPL[56]:1 Option 4: Ignore NaN/missing Unless They are the Only Thing in WindowThere's another approach, which would be to skip all NaN/missing values if there's another one in the window and return NaN/missing otherwise: julia> nonexistent_running_median([1,2,3,4], 3, :asymmetric)
6-element Vector{Float64}:
1.0
1.5
2.0
3.0
3.5
4.0
julia> nonexistent_running_median([1,2,NaN,4], 3, :asymmetric)
6-element Vector{Float64}:
1.0
1.5
1.5
3.0
4.0
4.0
julia> nonexistent_running_median([missing,2,3,4], 3, :asymmetric)
6-element Vector{Union{Missing, Float64}}:
missing
2.0
2.5
3.0
3.5
4.0 The behavior here is that the window still uses the elements from its supporting points but the actual number of used elements varies depending on how many NaN/missing are in there. In consequence, with this option I would also expect: julia> nonexistent_running_median([missing,NaN,3,4], 3, :asymmetric)
6-element Vector{Float64}:
missing
missing
3.0
3.5
3.5
4.0 ConclusionFor me, Option 4 sounds the most promising right now, though also probably the hardest to implement. Option 2/3 I dislike because whether to put NaN as lowest or highest number is a completely arbitrary choice. Option 1 is the only one that makes any mathematical sense, but is probably less useful in practice than the other options, even if the other options introduce subtle errors (Option 2 biases the median upwards, Option 3 biases the median downwards and Option 3 biases the median left/right depending on whether the missing value is to the right/left of the window center). Which option would you like? |
Thanks for your response. Option 4 is what I need. |
I wrote a simple implementation for Option 4:
However, the performance can be improved and NaN will return when there are more than half of NaN inside a window.
|
In an earlier version of your answer (I got it via mail) you mentioned the behavior of MATLAB. I looked at it and happen to really like it: https://www.mathworks.com/help/matlab/ref/movmedian.html.
That's elegant, but not quite right. Inserting Inf or -Inf pushes the median in those windows up or down by some amount that strongly depends on the distribution of the data, and is not guaranteed to average out afterwards. It is not the same as completely omitting the NaN. Anyway, I'd like to copy MATLAB's behavior. Here's the steps I'd like to take:
The biggest hurdle will be an elegant implementation that works well with both options. Option 1 will probably want to track the index of the newest NaN in the window, and output NaN until it is out again. Option 4 will likely need the ability to roll with less elements than the maximum window size, which currently is impossible. I'll start with the reference implementation and I'll also do the Fuzz tests in the near future. I'll also split out an issue regarding more data types, like Union{Float64, Missing} or Dates. |
Alright, reference implementation for Option 4 is done: julia> using RollingFunctions, Statistics
julia> function median_ignorenan(x)
mask = @. !isnan(x)
masked = @view x[mask]
masked |> isempty && return NaN
median(masked)
end
median_ignorenan (generic function with 1 method)
julia> running(median_ignorenan, [NaN,2,3,4], 3)
4-element Vector{Float64}:
NaN
2.0
2.5
3.0
julia> running(median_ignorenan, [1,2,NaN,4], 3)
4-element Vector{Float64}:
1.0
1.5
1.5
3.0 Until FastRunningMedian supports it, the above (maybe replace |
Yes, that approah misbehaves when encounter too many NaNs inside a running window. Yet, I still use it for now because the performance of running median is critical in my task, and your package is faster than calling the pandas package by times. Good luck with your future update~ |
Hi, could it work to just replace the |
Hi @berjine 👋 Unfortunately I haven’t had much time for programming the NaN handling. The testing is in place on the ‘master’ branch though, so if someone wants to give the implementation a go, feel free to do so and put in a PR. If you start working on it, let us know in this issue. Last time I started working on this I remember being so fed up with modifying my complicated algorithm that I explored SkipLists some more instead 😂. I’ll see whether I can carve out some more time for this issue in the future, but it’s a hobby project, so no guarantees. |
Sure ! A is With 1M points, 1% random NaN and 1001-points window : Same data but 101-points window : With 10k points, 1% random NaN and 101-points window : It confirms your workaround works great, and that FastRunningMedian is fantastic haha A possible workaround is to interpolate the NaNs to suppress them beforehand This works quite well I think, as long as there are no NaNs at the beginning or end of the data vector x) |
Thanks for the comparison. Interpolating NaNs might be the best workaround yet in terms of performance 👍
That's because RollingFunctions.jl uses different boundary conditions than FastRunningMedian.jl. If I remeber correctly (untested code ahead), you can match the two with either:
Consult the Excel sheet in the README or the docstrings for a detailed overview of the boundary conditions in this library. |
I got around to the implementation of Option 1 (default) and Option 4 today. 🎉 It is available on the
Default behavior is I'd appreciate if both of you could give it a try and compare it to your existing solutions. Also, if you have Feedback on Performance or the API, please feel free to write what's on your mind. |
Fantastic ! It works with the same awesome performance ! Maybe you would want to detect the presence of NaN or missing values and autoselect the |
See above:
MATLAB also does it like this and I‘m pretty convinced it is the right approach. That‘s why nan=:include is the default. |
Very quick benchmark comparison of how much the NaN handling slowed down the performance: Before (0.2.0): julia> @benchmark running_median($x, 101)
BenchmarkTools.Trial: 88 samples with 1 evaluation.
Range (min … max): 54.663 ms … 67.524 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 56.883 ms ┊ GC (median): 0.00%
Time (mean ± σ): 57.382 ms ± 1.701 ms ┊ GC (mean ± σ): 0.19% ± 0.47%
▁▁▂█▄ ▁▁
▃▁▁▁▁▃▃▃▁▅████████▅▅█▅▃▅█▁▃▅▃▅▁▆▁▁▅▃▃▃▁▁▁▁▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
54.7 ms Histogram: frequency by time 62.8 ms <
Memory estimate: 7.65 MiB, allocs estimate: 25. After (9b17e62): julia> @benchmark running_median($x, 101)
BenchmarkTools.Trial: 85 samples with 1 evaluation.
Range (min … max): 57.673 ms … 72.718 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 58.444 ms ┊ GC (median): 0.00%
Time (mean ± σ): 58.995 ms ± 1.753 ms ┊ GC (mean ± σ): 0.17% ± 0.43%
▂█ ▂
▄█▇▄▁▇▅██▇▇▅█▄▄▁▁▄▁▁█▅▅▄▅▅▅▄▁▅▅▁▁▁▁▄▁▄▁▄▇▁▁▄▇▄▁▁▁▄▁▅▇▄▁▄▁▁▄ ▁
57.7 ms Histogram: frequency by time 60.9 ms <
Memory estimate: 7.65 MiB, allocs estimate: 64. That's a slowdown of 3 %. It's not great but NaNs need to be handled for correctness. I'll ship it now and we can include possible performance improvements in the next patch release. You should soon be able to update FastRunningMedian to v0.2.1 and get the new NaN handling feature. |
As of v0.2.1, FastRunningMedian doesn't output garbage with NaNs anymore 🎉 (tracking issue: Firionus/FastRunningMedian.jl#21). I've adjusted the README slightly to include this information.
There are many missing/corrupted data in real world cases, adding support for NaN in inputs would be great.
The text was updated successfully, but these errors were encountered: