Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Make apply_schema work within the arguments to a FunctionalTerm #117

Closed
wants to merge 6 commits into from

Conversation

oxinabox
Copy link
Contributor

This is to close #114

it is pretty involved.

@oxinabox oxinabox changed the title Make apply_schema work within the arguments to a FunctionalTerm [WIP] Make apply_schema work within the arguments to a FunctionalTerm Jun 14, 2019
@kleinschmidt
Copy link
Member

There's a lot of whitespace changes here; can you revert those?

@oxinabox
Copy link
Contributor Author

There's a lot of whitespace changes here; can you revert those?

Oh gross Atom must do this automatically.

@oxinabox
Copy link
Contributor Author

I have not gone through and reverted the white-space changes, yet.

But here is the code mostly working.

There are a bunch of TODOs.
And I haven't updated the docs,
or the extension tests.

Also this is a breaking change. Note the broken test in modelmatrix.jl
and it's workaround.
Because: it works too well
Some things assume that wrapping stuff in a function causes schema not to be applied.

We are going to need a way to say not to apply a schema to a thing.
Maybe backticks?

e.g 1|x outside of special context
the 1 gets pared into an intercept term
the x becomes a Float64

also we have no way to do + or * or &

I think that is right though, that is good (edited)
I rember thinking at juliacon that it was a bit weird that the way to do this before was the wrap expressions in identity

@oxinabox
Copy link
Contributor Author

I think back ticks would work well.

y ~ `x|1`

I am thinking that such a region is parses at least as much like normal Julia as the arguments of a FunctionTerm are before this PR.

Possibly even to the extent of not autobroardcasting.
Though that might be odd as it itself must endup the right size

consider the following

y~a+x^`-1`

Vs

y~a+`x.^-1`

@oxinabox
Copy link
Contributor Author

@kleinschmidt I started to remove the whitespace changes.
But it is annoying and is making the code worse.
(Also I worry Atom is just going to introduce them again)
You can set Github to not display whitepace changes in the diff.

https://github.blog/2018-05-01-ignore-white-space-in-code-review/

@oxinabox
Copy link
Contributor Author

Ok the protect stuff still is not quiet working right,
but the gist of it is there.
I am going to leave this alone for a while as I have other things to be doing.

and this at least lets us see what this idea looks like.

Copy link
Member

@kleinschmidt kleinschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool! I've said a lot of this on slack but want to preserve the conversation here:

First, I'm surprised how elegant it is to get rid of the anonymous function magic from FunctionTerm. That's really the last bit of "macro magic" (otherwise all the macro does is apply some algebra rules and wrap symbols in Terms) and if we could do it at runtime then that would be seriously cool.

Second, I'm a bit worried that there would be a performance cost. The advantage of creating an anonymous function that evaluates the body of the call (e.g., turning log(1 + a + a*b) into (a,b) -> log(1 + a + a*b)) is that the body is "fused"; the proposed solution would create a temporary array for each node in the AST and this might be expensive. However, it might be a small enough cost relative to the computational cost of the whole modeling pipeline that the increased flexibility could be worth it.

Third, this introduces a major change in the semantics of function calls. In original Terms 2.0, anonymous functions consume columns from the data. So f = (a,b) -> log(1 + a*b) is evaluated like f.(data.a, data.b), with no transformations applied. This proposal changes that so that functions now consume terms, so now we're calling modelcols on the terms generated by a and b (which could be continuous or categorical) and then calling f on the values returned by modelcols. This change is necessary to get the desired behavior here (that is, support for things like lag(x)^2). But I'm a bit worried that this semantics might open the door to some kind of unpleasantness which I'm not quite clear on but still want to flag my discomfort :)

Finally, the larger issue in my mind is that there's a tradeoff between the flexibility we provide "power users" and the clarity of the semantics of a formula for "normal users". Currently, the semantics are very clear, but with limited flexibility in how custom terms are applied: anything that's not special cased is applied one row at a time, just as if it's normal julia code. With this proposal, as far as I can tell the semantics get a LOT more complicated, since non-special functions are by default consuming the output of potentially magical special terms. However, this proposal also provides a potential way out of this bind, which is the protect/unprotect mechanism. With the right defaults, I think we can strike a good balance between clear semantics for normal users while still allowing power users to do all the fancy feature engineering they want.

Finally finally, @oxinabox has mentioned that he also would like to explore an alternative approach that involves using macros to generate a lot of the boilerplate code necessary to create "special terms", which I think is a useful thing to explore. I don't think it'll solve all the things that this proposal does but it might make many of them a lot less annoying.

modelcols(t::CategoricalTerm, d::NamedTuple) = t.contrasts[d[t.sym], :]

# This can exist in a formula after apply_schema if it was protected
modelcols(t::ConstantTerm, d::ColumnTable) = fill(t.n, length(d))
modelcols(t::ConstantTerm, d::Tables.RowTable) = fill(t.n, length(first(d)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

length(first(d)) gives you the number of fields in the named tuple, not really what you want here right? this should just return a scalar for a single row.

In fact, both methods should maybe just return a scalar. Except that hcat(1, [1, 2,3]) doesn't work, so maybe not. But if the only reason that constant term is actually returning a value is to make it work as arguments to FunctionCallTerm wrapped functions which are broadcasted then filling an array with values seems like an unnecessary allocation...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rowtable and column table are switched actually.
(that would explain my dimension mismatch errors)

A RowTable isn't a Row though. that would be length 1.
Our RowTable handling is AFAICT kinda not real, but this isn't the place for that discussion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Just do the right thing for column table and leave row table handling for the future. And add a method for NamedTuple that just returns the scalar so that it works for a single row?

# so the function will be called in `modelcols`
return apply_schema(ft, schema, Mod)
end
apply_schema(ct::CallTerm, schema, Mod::Type) = call_fallback_apply_schema(ct, schema, Mod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this indirection necessary? why not just put the body of call_fallback_apply_schema in this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reasons of making it easy to write things that avoid ambiguities,
when defining protected methods.

Though I am not sure it is used in any code I have committed,
I may have backed the other stuff that used it out,
or not written it yet.

names = Tuple(termvars(arg_terms))
ex = Expr(:call, nameof(op), names...)
ct = CallTerm{typeof(op), names}(+, ex, t)
return call_fallback_apply_schema(ct, schema, Mod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here as below...why not just call apply_schema(ct, schema, Mod)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably not needed.

unprotect(t::CallTerm{typeof(protect)}) = t.args_parsed[1]
unprotect(t) = t
function StatsModels.apply_schema(t::CallTerm{typeof(unprotect)}, sch, Mod::Type)
throw(DomainError("`unprotect` used outside a protected context."))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using a DomainError seems a little punny.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but for once ArgumentError seems wrong.
Could be a FormulaSyntaxError maybe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even ParseError or LoadError or whatever the error is that's thrown for invalid syntax? (but that's maybe too punny as well)

function direct_call(op, arg_terms::Tuple)
names = Tuple(termvars(arg_terms))
ex = Expr(:call, nameof(op), names...)
ct = CallTerm{typeof(op), names}(+, ex, t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be op?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@oxinabox
Copy link
Contributor Author

With the right defaults, I think we can strike a good balance between clear semantics for normal users while still allowing power users to do all the fancy feature engineering they want.

Yeah, I suspect one such option might be to make CallTerms default to protected, which has almost the same semantics as current.
Then one would do something like unleash(sin)(lag(x)) to bring on the full power.
Or sin(unprotect(lag(x))) would actually work.

I also think we can hold of on resolving this for Terms 3.0: Grandson of Terms,
and a terms 2. release could happen first.

@kleinschmidt
Copy link
Member

Agreed about this not blocking a release.

For syntax, like we discussed on slack maybe you'd do sin($(lag(x))) to get the "unprotected" LagTerm interpretation and :(x + y) to get addition (e.g. protect). The @formula macro could do the necessary transformation of this punny syntax into calls to protect and unprotect.

@kleinschmidt
Copy link
Member

closing this; it is built on in #183

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants