-
-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Big performance regression in subsetting in v0.11 #251
Comments
Thinking about it some more, it looks like I should be using library(xts)
library(microbenchmark)
ts <- .xts(1:1e5, 1:1e5)
dates <- .POSIXct(1:1e5)
microbenchmark(ts[dates], merge.xts(ts, xts(, dates), join="right", retside=c(TRUE,FALSE)), times=1) # fewer iterations, otherwise it takes too long
Unit: milliseconds
expr
ts[dates]
merge.xts(ts, xts(, dates), join = "right", retside = c(TRUE, FALSE))
min lq mean median uq max
18403.239120 18403.239120 18403.239120 18403.239120 18403.239120 18403.239120
2.835482 2.835482 2.835482 2.835482 2.835482 2.835482 |
A quick profiling session indicates that the bottleneck is in if(usr_idx && !is.null(firstlast)) {
# Translate from user .index to xts index
# We get back upper bound of index as per findInterval
tmp <- base_idx[firstlast]
# Iterate in reverse to grab all matches
# We have to do this to handle duplicate dates in the xts index.
tmp <- rev(tmp)
res <- NULL
for(i in tmp) {
dt <- idx[i]
j <- i
repeat {
res <- c(res, j)
j <- j -1
if(j < 1 || idx[j] != dt) break
}
}
firstlast <- rev(res)
} I'd like a solution that is more performant with duplicate values, but a quick fix for the non-duplicate case would be to add an |
Let's see: # Test index parameter with repeated dates in xts series
idx <- sort(rep(1:5, 5))
x <- xts(1:length(idx), as.Date("1999-12-31")+idx)
bin <- window(x, index = as.Date("1999-12-31")+c(1,3,5)) So here bin gives the duplicated dates in the xts series as expected. Now we did have a discussion about this because there are two aspects:
Now, back to the performance of this routine. I agree it is slow. It is trying to do a fast subset of the dates, but making it work with duplicated dates in the xts index slows everything down.
Looking further, you could say that 3) is already implemented by x <- xts(4:10, Sys.Date()+c(4,4,4, 7:10))
y <- xts(1:6, Sys.Date()+1:6)
merge(x,y, join='inner')
# x y
# 2018-07-23 4 4 Instead, the inner join should return 3 records matching the values in x. (Here is a more correct version of a sort-merge-join that allows for duplicate values: http://www.dcs.ed.ac.uk/home/tz/phd/thesis/node20.htm). |
I'm not at all familiar with the working of xts subsetting, so I may have misunderstood what you are discussing in which case I apologise. I just wanted to point out that in the example above there are no duplicates in the index of |
Thanks for your detailed comments, @corwinjoy! It looks like xidx <- as.Date("1999-12-31") + rep(1:3, each = 2)
# duplicate values in xts index, and duplicate values in window index.
x <- xts(seq_along(xidx), xidx)
widx <- as.Date("1999-12-31") + c(1, 1)
window(x, widx)
# [,1]
# 2000-01-01 1
# 2000-01-01 2
# 2000-01-01 1
# 2000-01-01 2
zoo:::window.zoo(x, widx)
# [,1]
# 2000-01-01 1
# 2000-01-01 2 I lean more toward consistency with zoo, but I do realize that this is a bit of an edge case. I agree with your assessment of the result of inner joins performed by @TomAndrews the slowdown is caused by re-allocating the |
Sounds good Josh. Hopefully you can just refactor the merge routine like
what you did for binsearch. That is, just create an == operation that works
over multiple types.
For subsetting, it is not too clear what the behavior should be. Instead of
looking at zoo, I modelled standard R behavior. E.g.
x<- 1:5
x[c(1,1)]
Gives
1, 1
This is in line with what standard inner join does.
Either way, the behavior should be documented.
Best,
Corwin
…On Sat, Jul 21, 2018, 7:26 AM Joshua Ulrich ***@***.***> wrote:
Thanks for your detailed comments, @corwinjoy
<https://github.com/corwinjoy>!
It looks like window.xts() currently behaves different from window.zoo()
when index. contains duplicate values. window.zoo() seems to ignore the
duplicates, while window.xts() returns all the matching .index() values
multiple times. For example:
xidx <- as.Date("1999-12-31") + rep(1:3, each = 2)# duplicate values in xts index, and duplicate values in window index.x <- xts(seq_along(xidx), xidx)widx <- as.Date("1999-12-31") + c(1, 1)
window(x, widx)# [,1]# 2000-01-01 1# 2000-01-01 2# 2000-01-01 1# 2000-01-01 2zoo:::window.zoo(x, widx)# [,1]# 2000-01-01 1# 2000-01-01 2
I lean more toward consistency with zoo, but I do realize that this is a
bit of an edge case.
I agree with your assessment of the result of inner joins performed by
merge.xts(). I created issue #106
<#106> to document it, but have
not attempted a fix. I plan to refactor the merge C code and address the
issue afterward.
@TomAndrews <https://github.com/TomAndrews> the slowdown is caused by
re-allocating the res vector inside the for and repeat loops. You don't
need to worry about the details of xts subsetting. The benchmark tests you
provided are helpful by themselves.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#251 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMJeauIVkcicbchV9_EJW1kgc0Vn_BK_ks5uIzoCgaJpZM4VU-p2>
.
|
This basically moves the R code into C, with minor changes. The C code pre-(and over-)allocates the result and loops over the locations in reverse order. Corwin suggested better long-term solutions (e.g. fix merge.xts() to handle duplicate index values correctly), but this this commit addresses the immediate performance regression. Also add more benchmarks. See #251.
@TomAndrews, thanks again for the report! My timings suggest this solution is 2-3x faster than 0.10-2. Hopefully you see similar results. |
Moving the duplicate index value handling to C created an error when the index type was integer. This was not caught in the unit tests. Add a loop around the window/subset unit tests that run them on both index types (double and integer). Add support for both index types to the fill_window_dups_rev() C function. Change the scalar '1' to '1L' in the pmax() call after findInterval() to ensure that 'base_index' is not coerced to double from integer. See #251.
Description
There seems to be a big performance regression when subsetting an
xts
object using a vector of dates. In the example below it goes from 41 milliseconds to 17 seconds to perform the subsetting.Using
v0.10-2
:Session Info
Using
v0.11
:Session Info
Obviously this is a bit of a stupid example since I'm just fetching all the dates in
ts
. In my actual problem,dates
is arbitrary but large. Should I be re-writing the subsetting in a different way?The text was updated successfully, but these errors were encountered: