-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Make apply_schema work within the arguments to a FunctionalTerm #117
Conversation
There's a lot of whitespace changes here; can you revert those? |
Oh gross Atom must do this automatically. |
I have not gone through and reverted the white-space changes, yet. But here is the code mostly working. There are a bunch of TODOs. Also this is a breaking change. Note the broken test in modelmatrix.jl We are going to need a way to say not to apply a schema to a thing. e.g also we have no way to do I think that is right though, that is good (edited) |
I think back ticks would work well.
I am thinking that such a region is parses at least as much like normal Julia as the arguments of a Possibly even to the extent of not autobroardcasting. consider the following
Vs
|
@kleinschmidt I started to remove the whitespace changes. https://github.blog/2018-05-01-ignore-white-space-in-code-review/ |
Ok the protect stuff still is not quiet working right, and this at least lets us see what this idea looks like. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cool! I've said a lot of this on slack but want to preserve the conversation here:
First, I'm surprised how elegant it is to get rid of the anonymous function magic from FunctionTerm. That's really the last bit of "macro magic" (otherwise all the macro does is apply some algebra rules and wrap symbols in Term
s) and if we could do it at runtime then that would be seriously cool.
Second, I'm a bit worried that there would be a performance cost. The advantage of creating an anonymous function that evaluates the body of the call (e.g., turning log(1 + a + a*b)
into (a,b) -> log(1 + a + a*b)
) is that the body is "fused"; the proposed solution would create a temporary array for each node in the AST and this might be expensive. However, it might be a small enough cost relative to the computational cost of the whole modeling pipeline that the increased flexibility could be worth it.
Third, this introduces a major change in the semantics of function calls. In original Terms 2.0, anonymous functions consume columns from the data. So f = (a,b) -> log(1 + a*b)
is evaluated like f.(data.a, data.b)
, with no transformations applied. This proposal changes that so that functions now consume terms, so now we're calling modelcols
on the terms generated by a
and b
(which could be continuous or categorical) and then calling f
on the values returned by modelcols
. This change is necessary to get the desired behavior here (that is, support for things like lag(x)^2
). But I'm a bit worried that this semantics might open the door to some kind of unpleasantness which I'm not quite clear on but still want to flag my discomfort :)
Finally, the larger issue in my mind is that there's a tradeoff between the flexibility we provide "power users" and the clarity of the semantics of a formula for "normal users". Currently, the semantics are very clear, but with limited flexibility in how custom terms are applied: anything that's not special cased is applied one row at a time, just as if it's normal julia code. With this proposal, as far as I can tell the semantics get a LOT more complicated, since non-special functions are by default consuming the output of potentially magical special terms. However, this proposal also provides a potential way out of this bind, which is the protect/unprotect mechanism. With the right defaults, I think we can strike a good balance between clear semantics for normal users while still allowing power users to do all the fancy feature engineering they want.
Finally finally, @oxinabox has mentioned that he also would like to explore an alternative approach that involves using macros to generate a lot of the boilerplate code necessary to create "special terms", which I think is a useful thing to explore. I don't think it'll solve all the things that this proposal does but it might make many of them a lot less annoying.
modelcols(t::CategoricalTerm, d::NamedTuple) = t.contrasts[d[t.sym], :] | ||
|
||
# This can exist in a formula after apply_schema if it was protected | ||
modelcols(t::ConstantTerm, d::ColumnTable) = fill(t.n, length(d)) | ||
modelcols(t::ConstantTerm, d::Tables.RowTable) = fill(t.n, length(first(d))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
length(first(d))
gives you the number of fields in the named tuple, not really what you want here right? this should just return a scalar for a single row.
In fact, both methods should maybe just return a scalar. Except that hcat(1, [1, 2,3])
doesn't work, so maybe not. But if the only reason that constant term is actually returning a value is to make it work as arguments to FunctionCallTerm
wrapped functions which are broadcasted then filling an array with values seems like an unnecessary allocation...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think rowtable and column table are switched actually.
(that would explain my dimension mismatch errors)
A RowTable isn't a Row though. that would be length 1.
Our RowTable handling is AFAICT kinda not real, but this isn't the place for that discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Just do the right thing for column table and leave row table handling for the future. And add a method for NamedTuple
that just returns the scalar so that it works for a single row?
# so the function will be called in `modelcols` | ||
return apply_schema(ft, schema, Mod) | ||
end | ||
apply_schema(ct::CallTerm, schema, Mod::Type) = call_fallback_apply_schema(ct, schema, Mod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this indirection necessary? why not just put the body of call_fallback_apply_schema
in this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reasons of making it easy to write things that avoid ambiguities,
when defining protected methods.
Though I am not sure it is used in any code I have committed,
I may have backed the other stuff that used it out,
or not written it yet.
names = Tuple(termvars(arg_terms)) | ||
ex = Expr(:call, nameof(op), names...) | ||
ct = CallTerm{typeof(op), names}(+, ex, t) | ||
return call_fallback_apply_schema(ct, schema, Mod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here as below...why not just call apply_schema(ct, schema, Mod)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, probably not needed.
unprotect(t::CallTerm{typeof(protect)}) = t.args_parsed[1] | ||
unprotect(t) = t | ||
function StatsModels.apply_schema(t::CallTerm{typeof(unprotect)}, sch, Mod::Type) | ||
throw(DomainError("`unprotect` used outside a protected context.")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using a DomainError seems a little punny.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but for once ArgumentError
seems wrong.
Could be a FormulaSyntaxError
maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or even ParseError
or LoadError
or whatever the error is that's thrown for invalid syntax? (but that's maybe too punny as well)
function direct_call(op, arg_terms::Tuple) | ||
names = Tuple(termvars(arg_terms)) | ||
ex = Expr(:call, nameof(op), names...) | ||
ct = CallTerm{typeof(op), names}(+, ex, t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be op
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
Yeah, I suspect one such option might be to make CallTerms default to I also think we can hold of on resolving this for Terms 3.0: Grandson of Terms, |
Agreed about this not blocking a release. For syntax, like we discussed on slack maybe you'd do |
closing this; it is built on in #183 |
This is to close #114
it is pretty involved.