Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add replace & replace! for collections #22324

Merged
merged 1 commit into from
Dec 22, 2017
Merged

Add replace & replace! for collections #22324

merged 1 commit into from
Dec 22, 2017

Conversation

rfourquet
Copy link
Member

This adds methods to replace for AbstractArray, AbstractSet and Associative and a corresponding inplace replace! function.
Did I miss already existing similar functionality in Base?

The order in which to put the 2 function arguments pred and f can be discussed.

Also, replace!(f::Function, A, old) is not included as it would be ambiguous with replace!(pred, A, new) (and a new function name would be necessary, e.g. replaceif!); But it is assumed that in most of the cases, the replacement value f(x) can be computed by f(old) and then use replace!(A, old, f(old)). Otherwise, replace!(pred, f, A) can be used.

@rfourquet rfourquet added the collections Data structures holding multiple items, e.g. sets label Jun 10, 2017
@rfourquet rfourquet force-pushed the rf/replace branch 2 times, most recently from 30e97aa to d0c2d57 Compare June 10, 2017 15:16
@@ -0,0 +1,28 @@

@testset "replace! & replace" begin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new test files would need to be added to test/choosetests.jl to actually run, but aren't there existing files with related tests this would fit with?

@@ -0,0 +1,119 @@
module Algorithms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while reorganizing some of base into modules is a good idea, I don't see why these two functions make sense as the only thing in an Algorithms module

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I forgot to mention that in the OP. It's just that I didn't know to put those functions, but I open to suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems they'd simply belong in the files corresponding to the types they're implemented for? That is, base/abstractarray.jl, base/set.jl, and base/associative.jl. I agree it doesn't make sense to separate these into their own module.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But where to put the general methods which work on Union{AbstractArray,Associative,AbstractSet} ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find* methods are defined in array.jl even when they apply to any iterator, which isn't ideal but I guess OK.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just saw that the unique methods are defined in set.jl and multidimensional; the reason seems to be that they are defined right after the needed tools (e.g. Set or multidim tools) are defined. This would make much more sense to me put those functions also in an "algorithm.jl" file. E.g. I would expect to find in "set.jl" only the functions implementing the interface of Set, and possibly methods specialized for Set, but functions merely using a set in their implementation have nothing to do there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then what's are the boundaries of this file? Any function uses an "algorithm". Maybe we'd need a collections.jl file with all generic functions which apply to any collection. Anyway, better use an existing file for this PR, and open another one to regroup functions in a subsequent PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree "algorithm" is a bit vague (I "algorihtms" like in the c++ STL...), but no problem I will update and follow existing practices here.

base/exports.jl Outdated
@@ -1351,6 +1351,9 @@ export
nzrange,
nnz,

# Algorithms module re-exports
replace!,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be added to stdlib doc index

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

julia> replace!(x->x.first=>3, Dict(1=>2, 3=>4), 1) do (k, v)
v < 3
end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there shouldn't be an empty line between input and output, should there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No indeed.


Return a copy of collection `A` where all occurrences `x` for which
`pred(x)` is true are replaced by `new` or `f(x)` (exactly one among
`f` and `new` must be specified).
Copy link
Contributor

@tkelman tkelman Jun 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to list the different signatures instead of trying to explain constraints on non-syntactic square brackets here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall I put the 2 stacked signatures (one below the other), followed by a common doctstring?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be clearer, yes

@iamed2
Copy link
Contributor

iamed2 commented Jun 10, 2017

Given this is new behaviour not needed by Base itself, could it live in https://github.com/JuliaCollections/DataStructures.jl or another package instead?

@rfourquet
Copy link
Member Author

Why not, but this is very basic functionality... For example, C++ which is quite barebones in its standard library, has a replace function.

@iamed2
Copy link
Contributor

iamed2 commented Jun 10, 2017

A lot of what is in the C++ standard library (e.g. deque, sorted sets/maps, etc.) lives in other packages such as DataStructures.jl. Some functionality which exists in the C++ standard library was moved from Base out to other packages (e.g. priority queues, combinatorics, etc.). Code in packages can be updated more frequently and doesn't contribute to Base size or the time it takes to run Julia's tests.

Just thought I'd suggest it, since this has come up for discussion before in cases where the code is not as easily decoupled from Base.

@ararslan
Copy link
Member

Technically since replace is defined in Base and this is extending it with Base types, having it in a package would be committing type piracy. Also, FWIW, I've found myself wanting functionality like this often.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds useful, thanks!

Replace all occurrences `x` in collection `A` for which `pred(x)` is true
by `new` or `f(x)` (exactly one among `f` and `new` must be specified).
If `count` is specified, then replace at most `count` occurrences.
This is the in-place version of [`replace`](@ref).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we usually refer to copying versions of the functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thougth it was a good thing to document more related functions in docstrings.

!!! note
When `A` is an `Associative` or `AbstractSet` collection, if
there are collisions among old and newly created keys, the result
can be unexpected:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"can be unexpected" is a bit vague: the last insert key wins, which can be undefined results when the iteration order is undefined. For Set, is the order defined or not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I believe the text in Documenter !!! note blocks has to be indented four spaces.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order is not defined for Set, but reproducible, so I chose values which work to illustrate my point. Will try to find a better wording.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better now?

`pred(x)` is true are replaced by `new` or `f(x)` (exactly one among
`f` and `new` must be specified).
If `count` is given, then replace at most `count` occurrences.
See the in-place version [`replace!`](@ref) for examples.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better repeat examples (that's what we do e.g. for sum).

replace!(pred, A::ReplaceCollection, new, n::Integer=-1) = replace!(pred, y->new, A, n)

"""
replace(pred, [f::Function], A, [new], [count])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count is called n for the AbstractString method. Better stay consistent (in one direction or the other).

```jldoctest
julia> a = [1, 2, 3, 1];

julia> replace!(isodd, a, 0, 2); a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ; a? Doesn't the function return it already? BTW, maybe it would be useful to return count instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was to clarify that a is modified in-place, but this may not be necessary. As to return count, I see that it could be useful, but I'm uncomfortable with the inconsistency of having the non-mutating replace returning the collection instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inconsistency with replace doesn't sound a big issue to me, but I would be more concerned about consistency with other in-place functions.

If `count` is given, then replace at most `count` occurrences.
See the in-place version [`replace!`](@ref) for examples.
"""
replace(pred, new::Function, A::ReplaceCollection, n::Integer=-1) = replace!(pred, new, copy(A), n)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap line at 92 chars.

```
"""
function replace!(pred, new::Function, A::AbstractArray, n::Integer=-1)
n == 0 && return A
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AbstractString method uses n=0 by default, so it means "unlimited". Your choice sounds better since it allows using 0 for a no-op, which can be useful if that's the result of a computation, but better adapt the existing method then. Would also make sense to check that n > -1 to prevent unexpected things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I try propose to update the existing method in #22325. I will document the behavior for negatibe inputs, but I'm not sure it's useful to prevent n < -1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm converting to a default value of typemax(Int), like for the string method.

A
end

const ReplaceCollection = Union{AbstractArray,Associative,AbstractSet}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you should leave the signature unrestricted until we have a MutableCollection type? Else, it won't be extensible for custom package types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid ambiguities, then pred has to be constrained to be a Function (or Base.Callable), which seems OK.


# Examples
```jldoctest
julia> replace!(Set([1, 2, 3]), 1, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not illustrate this with a plain Array first?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because a previous method of the same method (hence in the same docstring) was already having an example with an Array, so I tried to vary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to show examples using more complex structures, but I would start with the simplest example. Else it's not immediately obvious whether Set is particularly associated with that method or whether that's an arbitrary variation.

@nalimilan
Copy link
Member

FWIW, I have found out for CategoricalArrays.recode that pairs offer a simple, flexible and efficient API for this kind of operation. For example, recode(x, 1=>2, 3=>4) can perform two replacement operations and the generated code is really fast (i.e. the loop over the pairs is completely unrolled).

Maybe something to consider to make replace more powerful while keeping the interface relatively simple. Also the difference between Pair and Function could help dispatch when it's not clear what's the meaning of an argument.

@iamed2
Copy link
Contributor

iamed2 commented Jun 10, 2017

Technically since replace is defined in Base and this is extending it with Base types, having it in a package would be committing type piracy.

The string replace doesn't obey the same conventions (replaces subsets, not elements) and could actually cause confusion or method conflicts if there was an iterator version of replace defined (and I'll probably write one if it doesn't make it into this PR).

@rfourquet
Copy link
Member Author

recode(x, 1=>2, 3=>4) can perform two replacement operations and the generated code is really fast

I don't know CategoricalArrays yet, but for the replace!(a, old, new) method can be optimized greatly for Associative and AbstractSet (or sorted arrays).

The string replace doesn't obey the same conventions (replaces subsets, not elements)

It may be possible to replace subsets for collection too... (but than may require to be more strict type for the old value for single element replacement, e.g. typeof(A))

@nalimilan
Copy link
Member

I don't know CategoricalArrays yet, but for the replace!(a, old, new) method can be optimized greatly for Associative and AbstractSet (or sorted arrays).

Sorry, I guess my comment wasn't clear. What I mean is that using pairs allows recode(x, 1=>2, 3=>4) to do the equivalent of replace(replace(x, 1, 2), 3, 4), but in a single pass, and with quite efficient code. This has nothing categorical-specific.

The element vs. subset debate reminds me of issues discussed in the Search & Find Julep, with findfirst vs. (the proposed) findseq. But I don't really see the problem with strings: there is no ambiguity since replace("qwerty" ,"qw", "") isn't ambiguous. As @rfourquet it could be made to work for general arrays with some complications; might be worth using a separate replaceseq function for that. But it's so simple and natural for strings that it would be too bad not to support that feature.

@iamed2
Copy link
Contributor

iamed2 commented Jun 11, 2017

Seems like that concern does fit in with the Julep. If there was a replaceseq function, the String version would definitely fit better there, though the naming suffers.

Lately I've been feeling that the string methods don't really jive with their iterator/array counterparts. But that's a general issue for somewhere else.

But I don't really see the problem with strings: there is no ambiguity since replace("qwerty" ,"qw", "") isn't ambiguous.

Suppose there was an iterator version of replace replace{T}(itr{T}, old, new) -> ReplaceIterator{eltype(T)} and replace{T}(pred, itr{T}, new) -> ReplaceIterator{eltype{T}}. If the first format is called on a String, the existing String method would take precedence. But if the user changed their call to call to the second format, the iterator version would be called and they would get a ReplaceIterator{Char} instead.

@nalimilan
Copy link
Member

Suppose there was an iterator version of replace replace{T}(itr{T}, old, new) -> ReplaceIterator{eltype(T)} and replace{T}(pred, itr{T}, new) -> ReplaceIterator{eltype{T}}. If the first format is called on a String, the existing String method would take precedence. But if the user changed their call to call to the second format, the iterator version would be called and they would get a ReplaceIterator{Char} instead.

Yes, but at that point we would also add a specialized String method for the second form, returning a string. Anyway, replace isn't going to start returning an iterator, that would be Base.Iterators.replace.

@iamed2
Copy link
Contributor

iamed2 commented Jun 11, 2017

Anyway, replace isn't going to start returning an iterator, that would be Base.Iterators.replace.

Ah, right, I forgot about that deprecation, sorry.

@rfourquet
Copy link
Member Author

I think I addressed all the comments. I still have to know how to set the link to replace from the doc of replace! point to the collection version, not to the string version; and to not have a duplicate docstring for the collection version...

On the design level:

  1. I think @nalimilan's example of using replace(A, old=>new) is interesting; but Allowing multiple pairs would not be compatible with a "count" argument (n): replace(A, p::Pair..., n=typemax(Int)) is not valid... As using a Dict of those pairs internally would probably be necessary for efficiency, it could be required to pass a Dict directly instead: replace(A, repl::Dict, n=typemax(Int)).

  2. there are two forms when a predicate is given:

replace(pred, A, new, [n])
replace(pred, f::Callable, A, [n])

It would be cleaner to have instead replace(pred, A, f::Callable, [n]) for the second, to keep the same argument order, but this would collide when A can store callables... Do we care enough for this case? (I guess yes)

@nalimilan
Copy link
Member

I think I addressed all the comments. I still have to know how to set the link to replace from the doc of replace! point to the collection version, not to the string version; and to not have a duplicate docstring for the collection version...

Can you elaborate? I think generally we point the docstring to the function, which concatenates docstrings for all methods. That's not ideal, but at least it includes the expected docstring.

On the design level:

I think @nalimilan's example of using replace(A, old=>new) is interesting; but Allowing multiple pairs would not be compatible with a "count" argument (n): replace(A, p::Pair..., n=typemax(Int)) is not valid... As using a Dict of those pairs internally would probably be necessary for efficiency, it could be required to pass a Dict directly instead: replace(A, repl::Dict, n=typemax(Int)).

Actually, no, the pairs should be passed as a varargs so that the compiler is able to unroll the checks rather than using a loop. With a small number of pairs, a dict isn't faster than repeated equality tests.

But you're right that the handling of n would be problematic in that case. We could make it a keyword argument. I guess it really depends on whether we want to make replace very general or rather keep it as simple as possible.

there are two forms when a predicate is given:

replace(pred, A, new, [n])
replace(pred, f::Callable, A, [n])

It would be cleaner to have instead replace(pred, A, f::Callable, [n]) for the second, to keep the same argument order, but this would collide when A can store callables... Do we care enough for this case? (I guess yes)

Yes, that's a tough call. The API based on pairs would solve this.

base/array.jl Outdated

## replace/replace! ##

# NOTE: to implement replace! and replace for a new type T, it's enough to define:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the place to document this. Though I don't really know where it could go. Also, I don't like recommending people to override an unexported function whose name starts with an underscore. Really, they can reimplement the three lines in replace that appear before the call to _replace, that's a negligible repetition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will replace _replace by replace in the recommandation. Until a better place is suggested, I will keep it here though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not have this at all, AFAIK we don't generally do this, and people are not supposed to read this file to find out...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This note is intended also for people adding a new method in Base... What do you think about keeping/removing this note @tkelman ?

@rfourquet
Copy link
Member Author

Can you elaborate? I think generally we point the docstring to the function, which concatenates docstrings for all methods. That's not ideal, but at least it includes the expected docstring.

By specifying the signatures in .md files, it's possible to show only the docstring for the string version in "strings.md", and the general collections version in "collection.md". But unfortunately, the link from replace! choses to point at the string version. In the docs of Documenter, I didn't see a way to address that, someone has an idea? otherwise I will simply remove the link.

Actually, no, the pairs should be passed as a varargs so that the compiler is able to unroll the checks rather than using a loop. With a small number of pairs, a dict isn't faster than repeated equality tests.

OK, I see. Then indeed, why not set the count argument as a keyword. It may be clearer actually (and I expect it not to be used so often), and I understood that keyword args are meant to be faster eventually.

The API based on pairs would solve this.

I don't see that. Assuming there is replace(A, oldnew::Pair...; [count]), how do you propose the version with a predicate to be? (for which old is not necessary anymore).
One possibility is to remove replace(pred, A, new), and to merge merge f and pred for the last version, which can be more powerful:
replace(pred_f, A), with pred_f returning Nullable for the replacement value. This would allow to implement the pair version in terms of this pred_f (finding whether a pair matches (in the pred part) is work which can be reused for the f part).

@nalimilan
Copy link
Member

By specifying the signatures in .md files, it's possible to show only the docstring for the string version in "strings.md", and the general collections version in "collection.md". But unfortunately, the link from replace! choses to point at the string version. In the docs of Documenter, I didn't see a way to address that, someone has an idea? otherwise I will simply remove the link.

I see. Maybe something like [`replace`](@ref replace(pred, A, new, [n])) would work (adapting the signature)? Else you'll have to ask @mortenpi.

I don't see that. Assuming there is replace(A, oldnew::Pair...; [count]), how do you propose the version with a predicate to be? (for which old is not necessary anymore).

Mmm, actually I was thinking about the possible ambiguity with the replace(A, old, new) form, but that doesn't help with the case you highlighted.

One possibility is to remove replace(pred, A, new), and to merge merge f and pred for the last version, which can be more powerful:
replace(pred_f, A), with pred_f returning Nullable for the replacement value. This would allow to implement the pair version in terms of this pred_f (finding whether a pair matches (in the pred part) is work which can be reused for the f part).

Interesting! It makes sense to do everything from a single function indeed. (Regarding the implementation of the pair version, I'm not sure what's your suggestion but you'd have to be very careful that everything is optimized.)

@rfourquet
Copy link
Member Author

So I implemented your idea, and I like it very much:

  1. to have a keyword arg for n is way clearer (it should be changed to count or something more explicit though)
  2. having old=>new also is more clear
  3. when the keyword is not specified, the speed are on par with the previous version (even faster, I don't understand why yet)
  4. I kept the version replace(pred, A, new) as it doesn't hurt after all

Compare:

julia> replace(A, 2, 3, 8) # old version

julia> replace(A, 2=>3, count=8) # new version

PS: your trick to use [replace](@ref replace(pred, A, new, [n])) for the doc didn't work :(

@nalimilan
Copy link
Member

Cool. I'd rename n now too since we won't be able to change it later (keyword argument). Maybe "maxcount"? Should check for consistency with other functions, if any. Too bad for the docstrings, maybe somebody else can help.

@nalimilan
Copy link
Member

@rfourquet I guess this won't be in 0.7, but maybe we should deprecate replace(s, pattern, replacement) in favor of replace(s, pattern => replacement) now to avoid ambiguity issues in the future when adding more powerful functions? I also find it more explicit.

@StefanKarpinski
Copy link
Member

I like that idea. It would also allow giving multiple replacements, which can potentially be done more efficiently in a single pass.

@rfourquet
Copy link
Member Author

That sounds good indeed. I thought we didn't reach a decision yet on this design with pairs, but I still like it. It's true that I didn't plan to get it into 0.7, but mostly because it's not breaking and I didn't want to overload triage, so this could wait for 1.1. But if we go for this design and deprecate the existing replace, then given all the review which has gone into this PR already, I would propose to merge it too. I will rebase to get an updated feel of how it looks with your Union{Some{T}, Void} PR merged.

@rfourquet
Copy link
Member Author

Let me recap the API proposed here, if it can help:

  1. replace(A, 1=>2, 2=>3; count=3) to replace in A 1 by 2 and 2 by 3, with at most 3 replacements (count available for all methods).
  2. replace(isodd, A, 0) to replace odd numbers by 0 in A
  3. replace(x -> isodd(x) ? 2x : nothing) to replace odd numbers x by 2x. Here, 2x can be written as Some(2x) (Some will be useful when you want to replace a value by nothing).

Also included the mutating variant, replace!.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! If @StefanKarpinski is fine with merging this now, I think we should go ahead. Most of my comments could be addressed later as they don't affect the API, but I think the AbstractString-specific method should be deprecated (see below).

base/set.jl Outdated

# we use this wrapper because using directly eltype(A) as the type
# parameter below for Some degrades performance
function _replace!(A, ::Type{K}, count::Integer, old_new #=::Pair...=#) where {K}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mark old_new with ::Tuple{Vararg{Pair}}?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

base/set.jl Outdated
## replace/replace! ##

# NOTE: to implement replace! and replace for a new type T, it's enough to define:
# function _replace!(pred::Callable, A::T, count::Int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I don't really like recommending people to overload an internal function. Is there a way to make this replace!(pred::Callable, A::T; count::Int)? BTW, keyword arguments are supposed to be faster now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, I didn't re-evaluate this choice in light of the supposedly faster keyword args. I would be willing to benchmark that, but as I'm very short on time now, I will accept your offer to address that later by giving it a lower priority.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not go with the simplest solution of defining only the keyword argument version for now, and see later whether performance can be improved?

base/set.jl Outdated

# Examples
```jldoctest
julia> replace!(Dict(1=>2, 3=>4)) do kv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a simpler example which doesn't use the do syntax, and an example using Some?

base/set.jl Outdated
replace(prednew::Callable, A; count::Integer=typemax(Int)) = _replace!(prednew, copy(A), count)

# Handle ambiguities
replace!(a::Callable, b::Pair; count::Integer=-1) = throw(MethodError(replace!, a, b))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be (a, b). Else you get a MethodError, but it happens when failing to create the MethodError. ;-)

But should these really be errors? It's perfectly valid to have an array of pairs and to replace some pairs with others.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes right! Those were not tested, just inserted to silence some ambiguities. You are right that what you said is perfectly valid, but this method doesn't have an array as parameter! This is just a silly signature, but Julia needs it!

base/set.jl Outdated
### _replace! for AbstractDict/AbstractSet

askey(k, ::AbstractDict) = k.first
askey(k, ::AbstractSet) = k
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two convenience functions could interest @andyferris, who tried to unify iteration over sets and dicts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe that would justify defining pairs(::AbstractSet).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How so? That seems backwards from the problem. The issue here seems to be that pop!(::AbstractDict, key) / delete! is inconsistent with the way the rest of the way the Associative interface is defined (e.g. it should remove an element, not a key). For consistency with our other indexed containers, these methods should perhaps be renamed splice! or deleteat!. (later in v1.x we can then re-add pop!(dict, element))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, deleteat! vs delete! has been discussed at #20531.

base/set.jl Outdated
askey(k, ::AbstractDict) = k.first
askey(k, ::AbstractSet) = k

function _replace_update_hash!(repl::Vector{<:Pair}, x, y::Some)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why it's called "hash". Maybe "dict" instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had to quicly find a name, and both Dict and Set use a hash, but it's a wrong name indeed as this is a helper function for Union{AbstractDict,AbstractSet} in general

base/set.jl Outdated
repl = Pair{eltype(A),eltype(A)}[]
c = 0
for x in A
c +=_replace_update_hash!(repl, x, prednew(x))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it possible to do the replacement in a single pass?

Missing space after +=.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible for particular containers, but in general we don't know if mutating the container while iterating is valid (e.g. for a hash-table, there could be a rehash after one insertion, and the rest of the iteration may not visit all non-visited elements. As a first step I wanted something always safe, but it's likely that this can be improved at least for Set and Dict.

c = 0
for i in eachindex(A)
c += _replace_update!(A, i, prednew(A[i]))
c == count && break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably an overkill optimization, but I wonder whether using a special loop for count == typemax(Int) could be worth it in some cases where the function is so simple that SIMD might be enabled without the break call. Maybe leave that to future improvements after some benchmarking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add that on my todo list to not forget :)

@@ -38,7 +38,7 @@ Base.searchindex
Base.rsearchindex
Base.contains(::AbstractString, ::AbstractString)
Base.reverse(::Union{String,SubString{String}})
Base.replace
Base.replace(s::AbstractString, pat, f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This old method is inconsistent with the new ones. Maybe deprecate it in favor of pat => f?

Copy link
Member Author

@rfourquet rfourquet Dec 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in #25165 :)

test/sets.jl Outdated
@@ -344,3 +344,37 @@ end
@test typeof(cssset) == Set{String}
@test cssset == Set(["foo", "bar"])
end

@testset "replace! & replace" begin
maybe1(v, p) = if p Some(v) end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of using Some is to be able to distinguish nothing. Can you add a test with Some(nothing) to ensure this works?

@nalimilan nalimilan changed the title [WIP] add replace & replace! for collections Add replace & replace! for collections Dec 18, 2017
@rfourquet
Copy link
Member Author

I addressed most of comments @nalimilan thanks!
I just kept the deprecation for replace(::String, ...) for another PR.
I ended up doing few benchmarks concerning the _replace! methods which avoids forwarding keywords arguments, and it seems there is basically no overhead now with that (except 1 allocation, which doens't affect the performance), so I followed your suggestion to get rid of this _replace!.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks good, though it needs a NEWS entry.

The deprecation of the other method should also happen quite quickly since we're theoretically past feature freeze.

c == count && break
end
for oldnew in repl
pop!(A, askey(first(oldnew), A))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this do all deletions before doing all insertions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way to avoid 2 passes, and as a result had to put a warning in the docstring. But you are definitely right. It's much more predictable, and the basic benchmarks I just did don't show any overhead :) thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splitting the oldnew replacements list into old and new lists might even help performance? (for non-isbits Pair{Key, Value})

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did a benchmark with strings, and it doesn't seem to make a difference. I think I will not bother now, and will investigate when I have time again, but thanks.

@nalimilan
Copy link
Member

@rfourquet I've tried rebasing via GitHub but I apparently failed. Would you do it, so that the PR can be merged ASAP?

@rfourquet
Copy link
Member Author

Would you do it, so that the PR can be merged ASAP?

Sure, but I thought it would need a more clear decision? I put earlier today the triage label on the replace PR for strings, I guess the outcome of it can apply here too.

@nalimilan
Copy link
Member

Sure, but there seems to be a broad consensus that it's a positive change, so...

@rfourquet
Copy link
Member Author

Triage said yes to this API for strings, so I will take it we go with this one too! CI is remarkably all green and there are not yet conflicts, I would say this is a good oppotunity window for someone to click the green button.

@nalimilan
Copy link
Member

@rfourquet Can you remind me why we use nothing to mean "no replacement"/"keep existing value"? Couldn't we just require the user function to return the original value in that case? Then nothing could mean "drop value".

@rfourquet
Copy link
Member Author

Can you remind me why we use nothing to mean "no replacement"/"keep existing value"?

Because I initially used Nullable, where isnull means "no replacement", I hadn't thought of alternatives.

Couldn't we just require the user function to return the original value in that case? Then nothing could mean "drop value".

Sounds like an excellent idea! then it's a bit more than purely replacing values, but after all, replace for strings already allows deleting some parts (by providing an empty string as the replacement). I started playing with your idea and this seems to work well. Unless you like to work on this yourself, I will look at the performance tomorrow and prepare a PR to discuss further this API.

@nalimilan
Copy link
Member

Cool. We don't necessarily need to allow dropping some elements, but if we stop using nothing for that purpose we can always allow this later if we want.

@rfourquet
Copy link
Member Author

My preliminary benchmarks seem to indicate mainly no performance regression :)

We don't necessarily need to allow dropping some elements, but if we stop using nothing for that purpose we can always allow this later if we want.

indeed, except when the replacement value is nothing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collections Data structures holding multiple items, e.g. sets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants