Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add rand(::String) #22224

Merged
merged 7 commits into from
Jun 15, 2017
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
use O(1) algo for generic AbstractString (StefanKarpinski)
  • Loading branch information
rfourquet committed Jun 13, 2017
commit 533dfb3d021dd4f9c35c27dd59778b3ee65705a8
32 changes: 11 additions & 21 deletions base/random.jl
Original file line number Diff line number Diff line change
Expand Up @@ -284,11 +284,6 @@ julia> rand(MersenneTwister(0), Dict(1=>2, 3=>4))
`Set` and `IntSet`. For more than a few calls, use `rand(rng,
collect(s))` instead, or either `rand(rng, Dict(s))` or `rand(rng,
Set(s))` as appropriate.

!!! note
the complexity of `rand(rng, s::AbstractString)` is linear
in the length of `s` if `s` is not of type `String`. If called more
than a few times, you should use `rand(rng, collect(s))` instead.
"""
@inline rand() = rand(GLOBAL_RNG, CloseOpen)
@inline rand(T::Type) = rand(GLOBAL_RNG, T)
Expand Down Expand Up @@ -717,32 +712,27 @@ rand(rng::AbstractRNG, r::AbstractArray, dims::Integer...) = rand(rng, r, conver

# rand from a string

rand(rng::AbstractRNG, s::AbstractString) = rand_iter(rng, s, Base.iteratorsize(s))
rand(s::AbstractString) = rand(GLOBAL_RNG, s)

## optimization for String

isvalid_unsafe(s::String, i) = !Base.is_valid_continuation(unsafe_load(pointer(s), i))

function rand(rng::AbstractRNG, s::String)
rg = RangeGenerator(1:s.len)
function rand(rng::AbstractRNG, s::String)::Char
g = RangeGenerator(1:s.len)
while true
pos = rand(rng, rg)
pos = rand(rng, g)
isvalid_unsafe(s, pos) && return s[pos]
end
end

## rand from an iterator

rand_iter(rng, s, ::Base.IteratorSize) = throw(ArgumentError("iterator must have a known length"))

function rand_iter(rng::AbstractRNG, s, ::Union{Base.HasShape,Base.HasLength})::eltype(s)
pos = rand(rng, 1:length(s))
for (i, c) in enumerate(s)
i == pos && return c
function rand(rng::AbstractRNG, s::AbstractString)::Char
g = RangeGenerator(1:endof(s))
while true
try # the generic `isvalid` includes an equivalent try/catch statement
return s[rand(rng, g)]
end
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little skeptical of the rand_iter function and AbstractString support. People using rand with other string types will probably not expect this function to be O(n); it seems preferable to throw a MethodError rather than to use this implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree, I had the same hesitation and had planned to update the doc today to warn about the complexity. Then I think that with a proper warning those convenience methods (similarly for AbstractSet and Associative at #22228) are not harmful and can be handy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also do rejection sampling: randomly choose and index from start(s):endof(s) and retry if it's not a valid start of a character – i.e. the same thing you do for String.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh great! for some reason, I skimmed too fast on the generic code of isvalid and thought it was O(n).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I was confused not about isvalid, but about the fact that I thought I needed to compute the length somehow, which is O(n) in the generic method. Thanks again @StefanKarpinski, I updated the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StefanKarpinski would you mind having a look at this updated generic implementation, to confirm it complies with AbstractString API?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to do i = rand(rng, g); isvalid(s, i) && return i since various string types are expected to implement that without the try/catch. The try/catch only exists as a last resort fall-back, and I believe it is more targeted and only catches UnicodeError exceptions (if it's not already, it should be).

Copy link
Member Author

@rfourquet rfourquet Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks (I believe the idea being that try/catch an exception is expected to occur, and slow down the computation when the isvalid call would generally be faster). I will update. Note that the generic isvalid catches all exceptions (not only UnicodeError). I'm not familiar with strings, but I can open a corresponding simple PR if it helps.


rand(s::AbstractString) = rand(GLOBAL_RNG, s)

## rand from a string for arrays
# we use collect(str), which is most of the time more efficient than specialized methods
# (except maybe for very small arrays)
Expand Down