-
-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for window function #100
Conversation
I propose that a window function be added to xts so that users can have binary lookup over a range of dates. This function is the same as in the base package zoo, that is window.zoo, however we use binary search over the start and end date.
Also, window_bin probably can't be easily reused in .subset.xts - too much overhead relative to a two line binary search. If I wasn't lazy - it could actually be done as follows: 1. Have window_bin take a .index parameter like window.zoo and return a set of index numbers (possibly NULL for no match). Then .subset.xts could re-use. Even better, subset.xts should be rewritten to do binary search when it is passed a set of dates. How? Actually, R has a nice built-in function for this findInterval. This uses binary search, but even better, it starts at the last matched value on subsequent searches. This ends up being O(n) on the number of dates to lookup. Much faster than O(n*N) with the current match logic.
We need to make sure we are accessing the internal POSIXct index, so switch the call to .index. The index function (potentially) converts to an end-user index.
Thank you very much for taking the initiative on this! A couple things I would like to see done (either by you or me; let me know your preference):
|
@joshuaulrich wrote a number of nice comments as reproduced below. Basically I'm open to all of these ideas but would like to iron out specifics. Here are my thoughts:
The .subset.xts function has the following code: if (timeBased(i)) {
if(inherits(i, "POSIXct")) {
i <- which(!is.na(match(.index(x), i)))
} else if(inherits(i, "Date")) {
i <- which(!is.na(match(.index(x), as.POSIXct(as.character(i),tz=indexTZ(x)))))
} else {
# force all other time classes to be POSIXct
i <- which(!is.na(match(.index(x), as.POSIXct(i,tz=indexTZ(x)))))
}
i[is.na(i)] <- 0
} Instead, I simply ask for a conversion to POSIXct and fail if there is not an unambiguous conversion. I think this is cleaner, but let's look at the cases. First, imagine they pass in a POSIXct date with a different time zone. I think this is OK because the internal representation will be the same. For example: x <- as.POSIXct("2015-06-05", tz = "GMT")
y <- x
attr(y, "tzone") = ""
# The timezone changes the pretty printing
> x
[1] "2015-06-05 GMT"
# My default time zone is PDT
> y
[1] "2015-06-04 17:00:00 PDT"
# The time zone attribute does not change the underlying representation because this is always in UTC.
# From the docs for POSIXct:
# Class "POSIXct" represents the (signed) number of seconds since the beginning of 1970 (in the UTC timezone) as a numeric vector.
> unclass(x)
[1] 1433462400
attr(,"tzone")
[1] "GMT"
> unclass(y)
[1] 1433462400
attr(,"tzone")
[1] ""
Conversion from a Date and conversion from other time based classes tries to trust the conversion operator. Here I think start <- as.POSIXct(start, tz=indexTZ(x)) is safer than else if(inherits(i, "Date")) {
i <- which(!is.na(match(.index(x), as.POSIXct(as.character(i),tz=indexTZ(x)))))
} What if as.character(i) converts the date to one time zone (the default) then as.POSIXct reads this as another? Then I think we get the wrong date. Really this time zone, indexTZ, may only be used if start is a string. Finally, another option would be to emulate window.zoo and call the .binsearch function. Looking at the code this compares index > key, index < key and index != key. This would be a lower standard, but I don't like it. These comparisons may be slow and if we can't get an unambiguous conversion to our index type I think we should make the user clarify. |
window.xts. Make .index efficient by using findInterval. 2. Get unit tests running again. Had to disable the reclass tests for now because the attributes don't match in checkIdentical. Disabled with the prefix no_test. 3. Fixed class order for POSIXct in timeBasedSeq.R so the result matches POSIXct (and passes unit tests). 4. Added window.xts to the namespace.
when the xts series has repeated dates in series. 2. Add support for dates of the form I(character dates). The default conversion did not work. 3. Get a clean R CMD check result. To do this I had to fix the line endings for the source files (via dos2unix) and makefiles. Check also complained about missing Author and Maintainer fields in the DESCRIPTION file. 4. Add documentation for window.xts
Josh,
I think the window code is pretty solid now and ready for you to review > x <- xts(1:1000, as.Date("2000-01-01")+1:1000)
> x[c(3,2,1),] # Result is sorted
[,1]
2000-01-02 1
2000-01-03 2
2000-01-04 3
> x[c(1,1,2,3),] # Duplicates are allowed
[,1]
2000-01-02 1
2000-01-02 1
2000-01-03 2
2000-01-04 3
Now, you are probably thinking "Of course, you idiot, the returned subset Thanks again. I ended up touching a bit more code than I planned on for Corwin |
Rewrite binary search routine to make the code clearer and more maintainable.
Ugh, I think I closed this because I was working on merging it... then got distracted and completely lost track of it. Reopening because these are useful contributions. |
Awesome! Hopefully some of these ideas can make it into the project, I
certainly found them useful in some of the code I have. At any rate, I
think the idea of generalizing the window function and cleaning up the
binary search function that was in there is the right direction.
…On Fri, Apr 20, 2018 at 8:16 AM, Joshua Ulrich ***@***.***> wrote:
Reopened #100 <#100>.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#100 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AMJeasl9C93RfctIBAOFMhqJPBYzLhUMks5tqfvqgaJpZM4E3OE1>
.
|
Lines 62-74, 94-104, and 114-116 had no test coverage, and even basic coverage is better than nothing when refactoring. These tests pass on the current code, so they should indicate any serious errors with Corwin's binary search/subset refactor. See #100.
There were conflicts in basically every file, largely due to the time between the last branch commit and the merge. The branch also contained LF (Windows) line endings, plus the archival of the 'its' package, and the xts PR that ensured all headers had "GPL 2" contributed to number of conflicts. I also reverted all the changes to unit tests that were made to stop them from running. The unit tests were not being run at all when Corwin started his patch, but are running successfully now. Other than what's mentioned above, I merged the branch as-is, so this merge commit will be followed by some cleanup commits. Fixes #100.
Bummer... looks like there's performance regression in at least one subsetting case. It looks like the bulk of it is from not having an integer binary search branch (i.e. the I will investigate and patch, unless anyone else wants to get to it quickly. I would also appreciate any contributed benchmark cases (not just for subsetting)! > cat xts/inst/benchmarks/benchmark.subset.R
x <- xts::.xts(1:2e7, 1.0*seq(86400*365*20, 86400*365*40, length.out = 2e7))
# warmup
for(i in 1:2) {
x["1990/2000"]
}
# profile
Rprof()
for(i in 1:25) {
x["1990/2000"]
}
Rprof(NULL)
(srp <- summaryRprof())
> # Using current master (1716103af1d3c7ede69e6abda986daf2bc74e90c)
> Rscript xts/inst/benchmarks/benchmark.subset.R
$by.self
self.time self.pct total.time total.pct
".Call" 1.86 47.69 1.86 47.69
"c" 1.68 43.08 1.68 43.08
"seq.int" 0.34 8.72 0.34 8.72
"strptime" 0.02 0.51 0.02 0.51
$by.total
total.time total.pct self.time self.pct
"[" 3.90 100.00 0.00 0.00
"[.xts" 3.90 100.00 0.00 0.00
".Call" 1.86 47.69 1.86 47.69
"c" 1.68 43.08 1.68 43.08
"isOrdered" 0.76 19.49 0.00 0.00
"seq.int" 0.34 8.72 0.34 8.72
"strptime" 0.02 0.51 0.02 0.51
"<Anonymous>" 0.02 0.51 0.00 0.00
"as.POSIXct" 0.02 0.51 0.00 0.00
"as.POSIXlt" 0.02 0.51 0.00 0.00
"do.call" 0.02 0.51 0.00 0.00
"ISOdatetime" 0.02 0.51 0.00 0.00
".parseISO8601" 0.02 0.51 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 3.9
> # Using current window-xts HEAD (bfedeb00cc5c35e31fb7f71519c840a36fd85437)
> Rscript xts/inst/benchmarks/benchmark.subset.R
self.time self.pct total.time total.pct
"as.double" 3.00 34.97 3.00 34.97
"[.xts" 2.90 33.80 8.58 100.00
".Call" 1.46 17.02 1.46 17.02
"c" 0.80 9.32 0.80 9.32
"seq.int" 0.38 4.43 0.38 4.43
"attr" 0.02 0.23 0.02 0.23
"gsub" 0.02 0.23 0.02 0.23
$by.total
total.time total.pct self.time self.pct
"[.xts" 8.58 100.00 2.90 33.80
"[" 8.58 100.00 0.00 0.00
"index_bsearch" 3.38 39.39 0.00 0.00
"as.double" 3.00 34.97 3.00 34.97
"binsearch" 3.00 34.97 0.00 0.00
".Call" 1.46 17.02 1.46 17.02
"c" 0.80 9.32 0.80 9.32
"seq.int" 0.38 4.43 0.38 4.43
"isOrdered" 0.38 4.43 0.00 0.00
".parseISO8601" 0.04 0.47 0.00 0.00
"attr" 0.02 0.23 0.02 0.23
"gsub" 0.02 0.23 0.02 0.23
"as.list" 0.02 0.23 0.00 0.00
"as_numeric" 0.02 0.23 0.00 0.00
"as.POSIXct" 0.02 0.23 0.00 0.00
"as.POSIXct.POSIXlt" 0.02 0.23 0.00 0.00
"as.POSIXlt" 0.02 0.23 0.00 0.00
"do.call" 0.02 0.23 0.00 0.00
"parse.side" 0.02 0.23 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 8.58 |
Yeah, I would agree. It looks like the cast to as.double is a big hit here
so maybe we need an int version of the binsearch which should be
straightforward. Also something is up with the "[.xts" call, I'm guessing
that the date constructor here is very inefficient.
…On Mon, Apr 23, 2018 at 10:03 AM, Joshua Ulrich ***@***.***> wrote:
Bad news... looks like there's performance regression in at least one
subsetting case. It looks like the bulk of it is from not having an integer
binary search branch (i.e. the as.double() calls).
I will investigate and patch, unless anyone else wants to get to it
quickly.
I would also appreciate any contributed benchmark cases (not just for
subsetting)!
> cat xts/inst/benchmarks/benchmark.subset.R
x <- xts::.xts(1:2e7, 1.0*seq(86400*365*20, 86400*365*40, length.out = 2e7))# warmup
for(i in 1:2) {
x["1990/2000"]
}# profileRprof()
for(i in 1:25) {
x["1990/2000"]
}
Rprof(NULL)
(srp <- summaryRprof())
> # Using current master (1716103)> Rscript xts/inst/benchmarks/benchmark.subset.R$by.self
self.time self.pct total.time total.pct".Call" 1.86 47.69 1.86 47.69"c" 1.68 43.08 1.68 43.08"seq.int" 0.34 8.72 0.34 8.72"strptime" 0.02 0.51 0.02 0.51
$by.total
total.time total.pct self.time self.pct"[" 3.90 100.00 0.00 0.00"[.xts" 3.90 100.00 0.00 0.00".Call" 1.86 47.69 1.86 47.69"c" 1.68 43.08 1.68 43.08"isOrdered" 0.76 19.49 0.00 0.00"seq.int" 0.34 8.72 0.34 8.72"strptime" 0.02 0.51 0.02 0.51"<Anonymous>" 0.02 0.51 0.00 0.00"as.POSIXct" 0.02 0.51 0.00 0.00"as.POSIXlt" 0.02 0.51 0.00 0.00"do.call" 0.02 0.51 0.00 0.00"ISOdatetime" 0.02 0.51 0.00 0.00".parseISO8601" 0.02 0.51 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 3.9
> # Using current window-xts HEAD (bfedeb0)> Rscript xts/inst/benchmarks/benchmark.subset.R
self.time self.pct total.time total.pct"as.double" 3.00 34.97 3.00 34.97"[.xts" 2.90 33.80 8.58 100.00".Call" 1.46 17.02 1.46 17.02"c" 0.80 9.32 0.80 9.32"seq.int" 0.38 4.43 0.38 4.43"attr" 0.02 0.23 0.02 0.23"gsub" 0.02 0.23 0.02 0.23
$by.total
total.time total.pct self.time self.pct"[.xts" 8.58 100.00 2.90 33.80"[" 8.58 100.00 0.00 0.00"index_bsearch" 3.38 39.39 0.00 0.00"as.double" 3.00 34.97 3.00 34.97"binsearch" 3.00 34.97 0.00 0.00".Call" 1.46 17.02 1.46 17.02"c" 0.80 9.32 0.80 9.32"seq.int" 0.38 4.43 0.38 4.43"isOrdered" 0.38 4.43 0.00 0.00".parseISO8601" 0.04 0.47 0.00 0.00"attr" 0.02 0.23 0.02 0.23"gsub" 0.02 0.23 0.02 0.23"as.list" 0.02 0.23 0.00 0.00"as_numeric" 0.02 0.23 0.00 0.00"as.POSIXct" 0.02 0.23 0.00 0.00"as.POSIXct.POSIXlt" 0.02 0.23 0.00 0.00"as.POSIXlt" 0.02 0.23 0.00 0.00"do.call" 0.02 0.23 0.00 0.00"parse.side" 0.02 0.23 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 8.58
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#100 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AMJeauGgsE9Gmyb4TxxNwf4NmuLcm6HDks5trglKgaJpZM4E3OE1>
.
|
Appears we were both wrong. Removing the |
Bah! That's performance profiling for you! Sometimes the most innocuous
looking lines are the stinkers.... Thanks...
…On Tue, Apr 24, 2018 at 3:16 PM, Joshua Ulrich ***@***.***> wrote:
Appears we were both wrong. Removing the as.double() call and adding an
integer branch only took ~2s off the run time. The main culprit was always
calling i <- i[i != 0] instead of only calling it if there was at least
one zero in i.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#100 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AMJeaq8xIHQ55qsMTBLDdHNCuuHDQ23vks5tr6RagaJpZM4E3OE1>
.
|
The binsearch C function now branches on double and integer types, so no conversion is done if 'key' and 'vec' are the same type. Cast both to double if they're not the same type. Create a struct that contains int and double fields, so we can pass it to a generic compare function. Create 4 compare functions, one for each type/start combination. All these components are static inline, with the hope the compiler can optimize them. Removing the as.double() call means ISO-8601 range-based subsetting is nearly twice as fast for integer vectors and keys. See #100.
The binsearch C function now branches on double and integer types, so no conversion is done if 'key' and 'vec' are the same type. Cast both to double if they're not the same type. Remove the bound() function and call lower_bound() and upper_bound() directly from binsearch(). Also move the edge case checks into the respective *_bound() functions, since we need to do the checks inside the switch() statement. Removing the as.double() call means ISO-8601 range-based subsetting is nearly twice as fast for integer vectors and keys. See #100.
Commit 988d513 removed this check. It isn't strictly necessary, but it's considerably faster than always trying to remove 'i' equal to zero at any location in 'i'. Also add benchmark for subsetting by ISO8601 range. The benchmark results before this commit are: Unit: milliseconds expr min lq mean median uq max neval x[rng, ] 371.8408 384.7454 409.4091 399.3384 426.9196 486.1258 10 After restoring the check, the benchmark results are: Unit: milliseconds expr min lq mean median uq max neval x[rng, ] 280.5653 287.1048 304.1365 303.6579 316.7202 333.2147 10 See #100.
The binsearch C function now branches on double and integer types, so no conversion is done if 'key' and 'vec' are the same type. Cast both to double if they're not the same type. Create a struct that contains int and double fields, so we can pass it to a generic compare function. Create 4 compare functions, one for each type/start combination. All these components are static inline, with the hope the compiler can optimize them. Removing the as.double() call means ISO-8601 range-based subsetting is nearly twice as fast for integer vectors and keys. See #100.
Progress update. Restoring the zero-search before trying to remove > # from commit 7463da7a636e87f0958add9e64ef27f6c929d077
> Rscript xts/inst/benchmarks/benchmark.subset.R
Loading required package: microbenchmark
$by.self
self.time self.pct total.time total.pct
"as.double" 1.42 56.35 1.42 56.35
".Call" 0.60 23.81 0.60 23.81
"c" 0.34 13.49 0.34 13.49
"seq.int" 0.16 6.35 0.16 6.35
$by.total
total.time total.pct self.time self.pct
"[" 2.52 100.00 0.00 0.00
"[.xts" 2.52 100.00 0.00 0.00
"as.double" 1.42 56.35 1.42 56.35
"binsearch" 1.42 56.35 0.00 0.00
"index_bsearch" 1.30 51.59 0.00 0.00
".Call" 0.60 23.81 0.60 23.81
"c" 0.34 13.49 0.34 13.49
"seq.int" 0.16 6.35 0.16 6.35
"isOrdered" 0.14 5.56 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 2.52 Adding the integer branch gets it down to ~1s. > # from branch 'one' or 'two' (results are similar)
> Rscript xts/inst/benchmarks/benchmark.subset.R
Loading required package: microbenchmark
$by.self
self.time self.pct total.time total.pct
".Call" 0.58 53.70 0.58 53.70
"c" 0.34 31.48 0.34 31.48
"seq.int" 0.14 12.96 0.14 12.96
"as.numeric" 0.02 1.85 0.02 1.85
$by.total
total.time total.pct self.time self.pct
"[" 1.08 100.00 0.00 0.00
"[.xts" 1.08 100.00 0.00 0.00
".Call" 0.58 53.70 0.58 53.70
"c" 0.34 31.48 0.34 31.48
"isOrdered" 0.16 14.81 0.00 0.00
"seq.int" 0.14 12.96 0.14 12.96
"index_bsearch" 0.14 12.96 0.00 0.00
"as.numeric" 0.02 1.85 0.02 1.85
"as.list" 0.02 1.85 0.00 0.00
"as_numeric" 0.02 1.85 0.00 0.00
"as.POSIXlt" 0.02 1.85 0.00 0.00
"do.call" 0.02 1.85 0.00 0.00
".parseISO8601" 0.02 1.85 0.00 0.00
"parse.side" 0.02 1.85 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 1.08 I have implemented the integer branch two different ways. One took @corwinjoy's code and essentially moved stuff around so the integer/double logic could live in one place. The other is a much heavier refactor, but I wanted to try and reduce the code duplication inside the I would appreciate any reviews, thoughts, comments, etc. |
HIi @joshuaulrich . I really like what you did with the branch with the
comparison functions. I think this cleans up the code very nicely. I think
that the only part I would change is the end part of the binary search
function which handles the edge cases. It has been several months since I
looked at this code, so I found it hard to remember / follow the logic.
Here is what I propose which simplifies and explains more of what is going
on:
From line 87ff (
https://github.com/joshuaulrich/xts/blob/421366cb264b333313d0bada20f4eb32746f28a3/src/binsearch.c#L87
)
/* lo is the least index for which cmp_func is true */
/* handle edge cases where item may be at the lo/hi end of array */
if (use_start) {
/* cmp_func is: vec[lo] >= k */
/* if start == true, return lowest index s.t. vals[index] >= key */
if (!cmp_func(data, lo)) {
/* entire array < key */
return NA_INTEGER;
}
} else {
/* cmp_func is: vec[lo] > k */
/* lo is the least index for which vec[index] > k */
/* if start == false, return highest index s.t. vals[index] <= key */
if (lo > 0 && cmp_func(data, lo)) {
// ?? Do we really need another call to cmp_func above?? I think
this probably should be: if(lo>0)
lo--;
}
if (cmp_func(data, lo)) {
/* entire array > key */
return NA_INTEGER;
}
}
/* Convert from 0 based index to 1 based index */
lo++;
return ScalarInteger(lo);
Note that I haven't compiled this, so I may have made an error. However, I
think this clarifies the last section for anyone else using this code.
Thanks for cleaning up my previous code and taking the time to review this!
Cheers!
Corwin
…On Mon, Apr 30, 2018 at 1:41 PM, Joshua Ulrich ***@***.***> wrote:
Progress update. Restoring the zero-search before trying to remove i
equal to zero gets the run time down to ~2.5 seconds.
> Rscript xts/inst/benchmarks/benchmark.subset.RLoading required package: microbenchmark$by.self
self.time self.pct total.time total.pct"as.double" 1.42 56.35 1.42 56.35".Call" 0.60 23.81 0.60 23.81"c" 0.34 13.49 0.34 13.49"seq.int" 0.16 6.35 0.16 6.35
$by.total
total.time total.pct self.time self.pct"[" 2.52 100.00 0.00 0.00"[.xts" 2.52 100.00 0.00 0.00"as.double" 1.42 56.35 1.42 56.35"binsearch" 1.42 56.35 0.00 0.00"index_bsearch" 1.30 51.59 0.00 0.00".Call" 0.60 23.81 0.60 23.81"c" 0.34 13.49 0.34 13.49"seq.int" 0.16 6.35 0.16 6.35"isOrdered" 0.14 5.56 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 2.52
Adding the integer branch gets it down to ~1s.
> Rscript xts/inst/benchmarks/benchmark.subset.RLoading required package: microbenchmark$by.self
self.time self.pct total.time total.pct".Call" 0.58 53.70 0.58 53.70"c" 0.34 31.48 0.34 31.48"seq.int" 0.14 12.96 0.14 12.96"as.numeric" 0.02 1.85 0.02 1.85
$by.total
total.time total.pct self.time self.pct"[" 1.08 100.00 0.00 0.00"[.xts" 1.08 100.00 0.00 0.00".Call" 0.58 53.70 0.58 53.70"c" 0.34 31.48 0.34 31.48"isOrdered" 0.16 14.81 0.00 0.00"seq.int" 0.14 12.96 0.14 12.96"index_bsearch" 0.14 12.96 0.00 0.00"as.numeric" 0.02 1.85 0.02 1.85"as.list" 0.02 1.85 0.00 0.00"as_numeric" 0.02 1.85 0.00 0.00"as.POSIXlt" 0.02 1.85 0.00 0.00"do.call" 0.02 1.85 0.00 0.00".parseISO8601" 0.02 1.85 0.00 0.00"parse.side" 0.02 1.85 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 1.08
I have implemented the integer branch two different ways. One took
@corwinjoy <https://github.com/corwinjoy>'s code and essentially moved
stuff around so the integer/double logic could live in one place. The other
is a much heavier refactor, but I wanted to try and reduce the code
duplication inside the switch() statements and while() loops.
I would appreciate any reviews, thoughts, comments, etc.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#100 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AMJeapDjTmz7VtkFD7wXA_u1bSSM4vdrks5tt3b4gaJpZM4E3OE1>
.
|
Thanks for the feedback! I agree that section was quite terse and not particularly easy to follow. I've incorporated your changes and comments. I also forgot to include a bunch of specific binsearch unit tests that I wrote. I'm going to amend the commit and add them, so here's the diff for posterity: diff --git a/src/binsearch.c b/src/binsearch.c
index 67c0632..cb3b2eb 100644
--- a/src/binsearch.c
+++ b/src/binsearch.c
@@ -84,23 +91,35 @@ SEXP binsearch(SEXP key, SEXP vec, SEXP start)
}
}
- /* handle edge cases where item may be at the lo/hi end of array */
+ /* 'lo' contains the smallest index where cmp_func() is true, but we need
+ * to handle edge cases where 'lo' is at the max/min end of the vector.
+ */
if (use_start) {
- /* cmp_func is: vec[lo] >= k */
- if (cmp_func(data, lo)) {
- lo++;
- } else {
- lo = NA_INTEGER;
+ /* cmp_func() := vector[index] >= key when start == true, and we need
+ * to return the smallest index subject to vector[index] >= key.
+ */
+ if (!cmp_func(data, length(vec)-1)) {
+ /* entire vector < key */
+ return ScalarInteger(NA_INTEGER);
}
} else {
- if (lo > 0 && cmp_func(data, lo)) lo--;
- /* cmp_func is: vec[lo] > k */
+ /* cmp_func() := vector[index] > key when start == false, and we need
+ * to return the largest index subject to vector[index] <= key.
+ */
if (cmp_func(data, lo)) {
- lo = NA_INTEGER;
- } else {
- lo++;
+ /* previous index value must satisfy vector[index] <= key, unless
+ * current index value is zero.
+ */
+ lo--;
+ if (lo < 0) {
+ /* entire vector > key */
+ return ScalarInteger(NA_INTEGER);
+ }
}
}
+ /* Convert from 0-based index to 1-based index */
+ lo++;
+
return ScalarInteger(lo);
} |
Thanks for this! This looks good. The only line I had to really think
about was this:
if (!cmp_func(data, length(vec)-1)) {...}
Originally, I had written
if (!cmp_func(data, lo)) {...}
The idea is that in your binary search, it could happen that your predicate
(cmp_func) is never true. So, you have to verify that you have a valid
value for lo before you return it.
The test you have is a bit less direct, but I can see where you are coming
from, it directly reflects the comment on the next line
/* entire vector < key */
so the test you have should also be true.
I guess which test you want is a matter of taste, I will leave it up to you.
Thanks again for looking at this, hopefully this improved binary search and
indexing gives a good speedup in the end. It certainly made a big
difference for the code I was running where I used subsets quite a bit.
Corwin
…On Tue, May 1, 2018 at 9:28 AM, Joshua Ulrich ***@***.***> wrote:
Thanks for the feedback! I agree that section was quite terse and not
particularly easy to follow. I've incorporated your changes and comments. I
also forgot to include a bunch of specific binsearch unit tests that I
wrote. I'm going to amend the commit and add them, so here's the diff for
posterity:
diff --git a/src/binsearch.c b/src/binsearch.c
index 67c0632..cb3b2eb 100644--- a/src/binsearch.c+++ b/src/binsearch.c@@ -84,23 +91,35 @@ SEXP binsearch(SEXP key, SEXP vec, SEXP start)
}
}
- /* handle edge cases where item may be at the lo/hi end of array */+ /* 'lo' contains the smallest index where cmp_func() is true, but we need+ * to handle edge cases where 'lo' is at the max/min end of the vector.+ */
if (use_start) {- /* cmp_func is: vec[lo] >= k */- if (cmp_func(data, lo)) {- lo++;- } else {- lo = NA_INTEGER;+ /* cmp_func() := vector[index] >= key when start == true, and we need+ * to return the smallest index subject to vector[index] >= key.+ */+ if (!cmp_func(data, length(vec)-1)) {+ /* entire vector < key */+ return ScalarInteger(NA_INTEGER);
}
} else {- if (lo > 0 && cmp_func(data, lo)) lo--;- /* cmp_func is: vec[lo] > k */+ /* cmp_func() := vector[index] > key when start == false, and we need+ * to return the largest index subject to vector[index] <= key.+ */
if (cmp_func(data, lo)) {- lo = NA_INTEGER;- } else {- lo++;+ /* previous index value must satisfy vector[index] <= key, unless+ * current index value is zero.+ */+ lo--;+ if (lo < 0) {+ /* entire vector > key */+ return ScalarInteger(NA_INTEGER);+ }
}
}
+ /* Convert from 0-based index to 1-based index */+ lo++;+
return ScalarInteger(lo);
}
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#100 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AMJeapUKzWUv_eiddPGOWofIgjzzYTXzks5tuI04gaJpZM4E3OE1>
.
|
The binsearch C function now branches on double and integer types, so no conversion is done if 'key' and 'vec' are the same type. Cast both to double if they're not the same type. Create a struct that contains int and double fields, so we can pass it to a generic compare function. Create 4 compare functions, one for each type/start combination. All these components are static inline, with the hope the compiler can optimize them. Removing the as.double() call means ISO-8601 range-based subsetting is nearly twice as fast for integer vectors and keys. And the integer branch is faster still. See #100.
I waffled back and forth between those two alternatives. I think the former is clearer for that specific case, but the latter is more general. I couldn't think of a test that would fail the former and pass the latter. I may end up changing it back anyway. Speaking of tests, quite a few of the binsearch unit tests are commented because the new function doesn't do what the old function did in various edge cases. I'm more inclined to the new function's results, because the old results seem unusual and/or wrong. Any feedback you have there would be phenomenal, as would be any tests you add. There is no need to hope that your changes have improved things. I've already seen at least one case (ISO8601 subsetting) where this has improved performance. And your thoughts and review are very valuable! It's extremely hard to write good code without input from others (at least for me). |
Sure, I can take a look at the binsearch tests and see what is going on /
see if the edge cases are being properly tested. If you don't mind, just to
clarify:
1. Where are the binsearch tests?
2. What branch are you working in at this point? Is it window-xts? Then I
guess I can just branch off of that for any proposed changes.
Thanks again!
Corwin
…On Tue, May 1, 2018 at 11:09 AM, Joshua Ulrich ***@***.***> wrote:
I waffled back and forth between those two alternatives. I think the
former is clearer for that specific case, but the latter is more general. I
couldn't think of a test that would fail the former and pass the latter. I
may end up changing it back anyway.
Speaking of tests, quite a few of the binsearch unit tests are commented
because the new function doesn't do what the old function did in various
edge cases. I'm more inclined to the new function's results, because the
old results seem unusual and/or wrong. Any feedback you have there would be
phenomenal, as would be any tests you add.
There is no need to hope that your changes have improved things. I've
already seen at least one case (ISO8601 subsetting) where this has improved
performance. And your thoughts and review are very valuable! It's extremely
hard to write good code without input from others (at least for me).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#100 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AMJeavevw0d765a91HfjHSuqCcg5XZ3Jks5tuKTIgaJpZM4E3OE1>
.
|
The latest is in the window-xts branch, commit 6ef8014. The tests are in inst/unitTests/runit.binsearch.R. All the cases where Some cases where the key is outside the vector are also commented, depending on the value of There are several tests with FIXME / TODO because it isn't clear what the correct / best behavior is (e.g. what should happen if either the key or vector are zero-length?). One thing to keep in mind is that |
Fix up edge cases in binsearch unit tests Remove the start = NULL test cases, since it will no longer be supported. Always return NA if key is NA or zero-length. Fix compiler warning in binsearch.c ('static' is unnecessary as the struct is already local to the file). See joshuaulrich#100.
Remove the start = NULL test cases, since it will no longer be supported. Always return NA if key is NA or zero-length. Fix compiler warning in binsearch.c ('static' is unnecessary as the struct is already local to the file). See joshuaulrich#100.
Thanks for the fixes in #240. Now that it's merged, I'm going to run reverse dependency checks to ensure nothing downstream is broken. I don't usually do this until preparing for a CRAN release, but these changes are fairly significant. I'd prefer to address any issues now, while the code is still fresh in my mind. Assuming those reverse dependency checks go well, I think this is ready to merge. |
And merged! You win the award for most patient PR ever. ;-) |
I propose that a window function be added to xts so that users can have binary lookup over a range of dates.
This function is the same as in the base package zoo, that is window.zoo, however we use binary search over the start and end date.