Skip to content
This repository has been archived by the owner on May 4, 2019. It is now read-only.

We need more Nullable types for promotion and conversion #95

Open
tshort opened this issue Dec 16, 2015 · 22 comments
Open

We need more Nullable types for promotion and conversion #95

tshort opened this issue Dec 16, 2015 · 22 comments

Comments

@tshort
Copy link

tshort commented Dec 16, 2015

Nullable{Float64} doesn't act like a proper number. You can't do 3.5 + Nullable(4). You also can't do log(Nullable(4.0)). The rest of the Julia ecosystem has great conversion and promotion, and Nullables should, too. The existing lift mechanism for NullableArrays is also kludgey because we don't have promotion.

The main problem is that you can't have both of the following. I haven't heard of any planned enhancements to the type system that would allow that sort of relationship.

Nullable{T <: Float} <: AbstractFloat
Nullable{T <: Integer} <: Integer

Given that, I think we need more Nullable types, so each type can fit in at appropriate places in the type hierarchy to allow promotion and conversion, so 3.5 + NullableInt(4) == NullableFloat{Float64}(7.5).

I'm running into the same quandary in the PooledElements package. But there, I think there's less expectation that the user needs to perform arbitrary numerics on a PooledElementArray.

The big drawback is the amount of code needed to implement this. The other is deciding how far to take this concept.

@johnmyleswhite
Copy link
Member

I don't really think this is the right way to go. One of the benefits of Nullable's is that they force explicit handling of missing values by preventing things like 1 + Nullable(2). I don't think of Nullable's as being proper numbers, so these examples don't really trouble me. The Nullable{T} type hierarchy is essentially meant to a doppelganger of the core type hierarchy that mirrors it without integrating into it, except for a mutual ancestor at Any.

This restriction, of course, means that Nullable's are much less abstract than R's solution to missing values, but I see that as a strength rather than a weakness. R's approach make sense when you're working at a much higher level of abstraction that I wanted NullableArrays to live at. For me, something like 1 + Nullable(2) is a construct in a DSL that enforces a particular semantics for Nullable's that should be opt-in rather than opt-out.

@nalimilan
Copy link
Member

That's the eternal debate regarding the semantics of Nullable... The one that always makes me wonder whether we shouldn't have both Nullable and Option, one with automatic lifting and the other without.

@davidagold
Copy link
Contributor

@tshort can you give a specific use case where this is giving you grief?

@tshort
Copy link
Author

tshort commented Dec 16, 2015

@davidagold, my main concerns are clumsiness associated with use of individual Nullables. This could arise when indexing into a NullableVector or operating on a DataFrame row by row. Here are some examples that I think are clumsy:

x = NullableArray([1,2,3])
x[1] = 9  # works
## None of the following work
x[1] += 1
2 * x
log(x)
log(x[1])
f(x) = x + 1
f(x[1]) 
g(x,y) = x + y + 1
g(x[1], x[2]) 
g(1, x[2]) 
g(x[1], 2)

Here's what I think I need to get the broken statements to work:

x = NullableArray([1,2,3])
x[1] = 9  # works
## None of the following work
x[1] += Nullable(1)
# 2 * x    ## I can't figure this one out
## Nullable(2) * x   # doesn't work
map(z -> 2 * z, x, lift=true)  # best way?
map(log, x, lift=true)
isnull(x[1]) ? Nullable(typeof(x[1])) : log(get(x[1]))
f(x) = x + 1
f(x::Nullable) = x + Nullable(1)
f(x[1]) 
g(x,y) = x + y + 1
g(x::Nullable,y) = g(x,Nullable(y))
g(x,y::Nullable) = g(Nullable(x),y)
g(x::Nullable,y::Nullable) = x + y + Nullable(1)
g(x[1], x[2]) 
g(1, x[2]) 
g(x[1], 2)

Having to write Nullable(1) seems odd. There's nothing missing or null about 1. It's just a constant.

@tshort
Copy link
Author

tshort commented Dec 16, 2015

You can also do a lot just by defining promotions and conversions. Here is a bit of code that gets almost all of my first set of examples to work:

Base.promote_rule{T <: Number}(::Type{Nullable{T}}, ::Type{T}) = Nullable{T}
Base.promote_rule{T <: Number, V <: Number}(::Type{Nullable{T}}, ::Type{V}) = Nullable{promote_type(T,V)}
Base.convert{T}(::Type{Nullable{T}}, x::Type{T}) = Nullable(x)
for op in (:+, :-, :*, :/, :^, :%, :.*)
    @eval function Base.$op(x::Nullable,y::Nullable)
        if isnull(x) || isnull(y)
            return promote_type(x,y)()
        else 
            return Nullable($op(get(x), get(y)))
        end
    end
    @eval Base.$op(x::Nullable,y::Number) = $op(x, Nullable(y))
    @eval Base.$op(x::Number,y::Nullable) = $op(Nullable(x), y)
end
for op in (:sin, :log)
    @eval function Base.$op{T <: Number}(x::Nullable{T})
        if isnull(x)
            return Nullable(T)
        else 
            return Nullable($op(get(x)))
        end
    end
end

The downside is that a Nullable still isn't participating in Julia's type hierarchy.

Anyway, I get the pushback, so feel free to close with a "wontfix" label.

@johnmyleswhite
Copy link
Member

Using promotion more is an interesting approach. It just feels like it opens you up to lots of missing cases. Why should sin work out of the box, but not digamma?

My personal feeling is that we should nail the low-level semantics and then get some syntactic sugar for doing stuff like 1? + Nullable(1) or sin?(Nullable(1)).

@tshort
Copy link
Author

tshort commented Dec 16, 2015

It seems like you've already started down the path with math operations in https://github.com/JuliaStats/NullableArrays.jl/blob/master/src/operators.jl.

@tshort
Copy link
Author

tshort commented Dec 18, 2015

Also, #85 addresses similar issues (missed that).

@davidagold
Copy link
Contributor

Okay, finally have a bit of time to address these points.

Over the summer I developed a "lift" macro @^ to provide call-site lifting of arbitrary methods over exclusively nullable arguments: that is, if one had a method with signature f(x::T, y::U) then one could call @^ f(x, y) V, where x::Nullable{T}, y::Nullable{U} in order to call f on the wrapped values of x and y if both were not null or return an empty Nullable{V} if either were null. The macro was originally designed for lifting methods defined originally on signatures of exclusively non-Nullable arguments over signatures of exclusively Nullable arguments. It's proved relatively simple to extend this functionality so that a method defined for exclusively non-Nullable arguments can be lifted over mixtures of Nullable and non-Nullable arguments -- such as in the situations you raise above. Here's the gist. This requires extending Base.isnull and Base.get to provide generic fall-back methods for non-Nullable types. Here's an instance of how the macro expands:

julia> macroexpand(:( @^ f(x, y) + g(x, h(z)) Int )) 
quote  # /Users/David/.julia/v0.5/Lift/src/liftmacro.jl, line 24:
    if (isnull(y) || isnull(z)) || isnull(x) # /Users/David/.julia/v0.5/Lift/src/liftmacro.jl, line 25:
        Nullable{Int}()
    else  # /Users/David/.julia/v0.5/Lift/src/liftmacro.jl, line 27:
        Nullable(f(get(x),get(y)) + g(get(x),h(get(z))))
    end
end

A very simple performance test shows that the macro performs comparably to defining a "semi-lifted method" -- i.e. a method designed to handle a mixture of Nullable and non-Nullable arguments:

using NullableArrays
srand(1)
A = rand(5_000_000)
B = rand(5_000_000)
M = rand(Bool, 5_000_000)
X = NullableArray(A)
Y = NullableArray(A, M)
Z = similar(X)

@inline function _g(b::Float64, y::Nullable{Float64})
    if y.isnull
        return Nullable{Float64}()
    else
        return Nullable(b * y.value)
    end
end

function f(Z, B, Y)
    for i in eachindex(Z)
        Z[i] = @^ B[i] * Y[i] Float64
    end
end

function g(Z, B, Y)
    for i in eachindex(Z)
        Z[i] = _g(B[i], Y[i])
    end
end

f(Z, B, Y);
f(Z, X, Y);
g(Z, B, Y);
@time f(Z, B, Y)
@time f(Z, X, Y)
@time g(Z, B, Y)

yields

  0.042598 seconds (4 allocations: 160 bytes)
  0.064166 seconds (4 allocations: 160 bytes)
  0.050128 seconds (4 allocations: 160 bytes)

If this sort of thing would be useful, I'll continue to flesh it out and make it available via something like a NullableUtilities package. Right now there's rudimentary support for control flow, but it's a bit buggy and questions remain about what should be the specified behavior. Should we assume that calls in the condition and the body ought to be lifted? Just the body? Also, there's currently only support for lifting functions over variable arguments and no support yet for doing something like @^ f(Nullable(5), y) Int. (EDIT: Although, if one can lift over mixed signatures, then there doesn't seem to be any reason to do @^ f(Nullable(5), y) Int as opposed to @^ f(5, y) Int, which does currently work.)

@tshort
Copy link
Author

tshort commented Dec 31, 2015

I like the idea of @^ and a NullableUtilities package.

There are cases where lifting may be confusing. For some code, the generic lifting won't work right. Consider this:

f(x,y) = x > 3 ? 4x : -y

When lifted, if either x or y are Null, it'll return Null, but if x is greater than three, it shouldn't return Null. Another issue is that because lifting is done as a macro, you can't use code specialized for Nullable numbers. Well, you can, but you have to make sure you don't lift expressions that include methods specialized for Nullable numbers. The combination of these issues may make it hard to write generic code that works with regular numbers and Nullable numbers (and their array equivalents).

@davidagold
Copy link
Contributor

These are definitely good details to point out. I don't expect to produce a macro that can single-handedly lift every possible expression to every possible end. But I do think it should be possible to cover a range of the most common patterns. I think the macro can be made to lift smartly over conditionals such as in your first concern.

It seems the second concern requires community standards to minimize the sorts of situations in which one needs to lift Nullable-specialized methods over mixed or entirely non-Nullable arguments. I'm having trouble envisioning the cases in which this would really come up. It would have to be a case in which one only has a Nullable-specialized method. When does that occur?

@tshort
Copy link
Author

tshort commented Jan 1, 2016

A Nullable-only use case is likely to be rare. One I can think of is a method that does sampling-type replacements on a Nullable number or array.

I'm curious how you plan to support the conditional case. I can envision how it could work if the conditional is in the lifted expression. If it's in a function called by the lifted expression, I don't see how that would work.

@davidagold
Copy link
Contributor

There would be no general way to make this work if the conditional were hidden inside a method body. I don't think this is a drawback. The case you've described seems essentially like that of three-valued logic. It's not clear that being able to call on such semantics hidden within a method body would be a good thing. The standard semantics for Nullables that we've adopted are that if an expression depends on Nullable objects, then nullity in any of those objects returns an empty Nullable. We probably oughtn't encourage people to rely on alternative semantics unless they are plainly visible.

I think discussions about how best to provide lifting facilities need to be accompanied by discussions about what we expect/intend users actually to be doing with data structures built on top of NullableArrays. And this in turn requires some insight into the direction that DataFrames, DataStreams and friends will be taking in the near future. @quinnj @simonbyrne

@davidagold
Copy link
Contributor

@tshort In case you're still interested, I've thrown up a working prototype for the lift macro: https://github.com/davidagold/NullableUtils.jl. Please do let me know if you find it helpful, or if you can think of other utilities/features for @lift that would make working with NullableArrays easier.

@nalimilan
Copy link
Member

Good to see some move in this area! I wonder about the return type though: since all functions do not return the same type as their inputs (which might even be of different types), maybe return Nullable{Union{}}?

@davidagold
Copy link
Contributor

The user specifies the type parameter T for the empty Nullable{T} object to be returned in case any arguments are null: @lift f(x, y) T. Not specifying such a T currently incurs an error:

julia> @lift f(x, y)
ERROR: MethodError: no method matching @lift(::Expr)
Closest candidates are:
  @lift(::ANY, ::ANY)
 in eval(::Module, ::Any) at ./boot.jl:267

However, I could make it so that Nullable{Union{}} is the default return type for empty Nullables if the user doesn't specify a T. I haven't been keeping up on the consequences that this would have for performance, though.

@johnmyleswhite
Copy link
Member

The consequences are still pretty bad inside of a tight loop. More changes have to happen in the compiler to get that stuff to work well.

@davidagold
Copy link
Contributor

Aye, that's what I thought. So we can add this to the list of decisions about what Julia statistics functionality ought to do automatically for the user (removal of nulls, etc.).

Speaking of compiler changes, how far away is Union{Nothing, T} away from being comparably amenable to static analysis/optimization as is Nullable{T}?

@johnmyleswhite
Copy link
Member

Sadly I have no idea.

@datnamer
Copy link

datnamer commented Mar 2, 2016

Is that sort of optimization even possible given the fundamental ambiguity of each element's type?

@davidagold
Copy link
Contributor

@datnamer Simon Byrne's latest post in https://groups.google.com/forum/#!topic/julia-stats/29l5yA87Qss suggests there are strategies that may be able to handle this.

@datnamer
Copy link

datnamer commented Mar 2, 2016

Interesting, thanks for the pointer.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants