[WIP] Make apply_schema work within the arguments to a FunctionalTerm #117

oxinabox · 2019-06-14T15:52:59Z

This is to close #114

it is pretty involved.

kleinschmidt · 2019-06-14T17:23:43Z

There's a lot of whitespace changes here; can you revert those?

oxinabox · 2019-06-14T18:02:53Z

There's a lot of whitespace changes here; can you revert those?

Oh gross Atom must do this automatically.

oxinabox · 2019-06-14T20:53:24Z

I have not gone through and reverted the white-space changes, yet.

But here is the code mostly working.

There are a bunch of TODOs.
And I haven't updated the docs,
or the extension tests.

Also this is a breaking change. Note the broken test in modelmatrix.jl
and it's workaround.
Because: it works too well
Some things assume that wrapping stuff in a function causes schema not to be applied.

We are going to need a way to say not to apply a schema to a thing.
Maybe backticks?

e.g 1|x outside of special context
the 1 gets pared into an intercept term
the x becomes a Float64

also we have no way to do + or * or &

I think that is right though, that is good (edited)
I rember thinking at juliacon that it was a bit weird that the way to do this before was the wrap expressions in identity

oxinabox · 2019-06-14T22:24:17Z

I think back ticks would work well.

y ~ `x|1`

I am thinking that such a region is parses at least as much like normal Julia as the arguments of a FunctionTerm are before this PR.

Possibly even to the extent of not autobroardcasting.
Though that might be odd as it itself must endup the right size

consider the following

y~a+x^`-1`

Vs

y~a+`x.^-1`

oxinabox · 2019-06-17T13:13:45Z

@kleinschmidt I started to remove the whitespace changes.
But it is annoying and is making the code worse.
(Also I worry Atom is just going to introduce them again)
You can set Github to not display whitepace changes in the diff.

https://github.blog/2018-05-01-ignore-white-space-in-code-review/

oxinabox · 2019-06-17T17:50:33Z

Ok the protect stuff still is not quiet working right,
but the gist of it is there.
I am going to leave this alone for a while as I have other things to be doing.

and this at least lets us see what this idea looks like.

kleinschmidt

This is very cool! I've said a lot of this on slack but want to preserve the conversation here:

First, I'm surprised how elegant it is to get rid of the anonymous function magic from FunctionTerm. That's really the last bit of "macro magic" (otherwise all the macro does is apply some algebra rules and wrap symbols in Terms) and if we could do it at runtime then that would be seriously cool.

Second, I'm a bit worried that there would be a performance cost. The advantage of creating an anonymous function that evaluates the body of the call (e.g., turning log(1 + a + a*b) into (a,b) -> log(1 + a + a*b)) is that the body is "fused"; the proposed solution would create a temporary array for each node in the AST and this might be expensive. However, it might be a small enough cost relative to the computational cost of the whole modeling pipeline that the increased flexibility could be worth it.

Third, this introduces a major change in the semantics of function calls. In original Terms 2.0, anonymous functions consume columns from the data. So f = (a,b) -> log(1 + a*b) is evaluated like f.(data.a, data.b), with no transformations applied. This proposal changes that so that functions now consume terms, so now we're calling modelcols on the terms generated by a and b (which could be continuous or categorical) and then calling f on the values returned by modelcols. This change is necessary to get the desired behavior here (that is, support for things like lag(x)^2). But I'm a bit worried that this semantics might open the door to some kind of unpleasantness which I'm not quite clear on but still want to flag my discomfort :)

Finally, the larger issue in my mind is that there's a tradeoff between the flexibility we provide "power users" and the clarity of the semantics of a formula for "normal users". Currently, the semantics are very clear, but with limited flexibility in how custom terms are applied: anything that's not special cased is applied one row at a time, just as if it's normal julia code. With this proposal, as far as I can tell the semantics get a LOT more complicated, since non-special functions are by default consuming the output of potentially magical special terms. However, this proposal also provides a potential way out of this bind, which is the protect/unprotect mechanism. With the right defaults, I think we can strike a good balance between clear semantics for normal users while still allowing power users to do all the fancy feature engineering they want.

Finally finally, @oxinabox has mentioned that he also would like to explore an alternative approach that involves using macros to generate a lot of the boilerplate code necessary to create "special terms", which I think is a useful thing to explore. I don't think it'll solve all the things that this proposal does but it might make many of them a lot less annoying.

kleinschmidt · 2019-06-18T23:28:05Z

src/terms.jl

 modelcols(t::CategoricalTerm, d::NamedTuple) = t.contrasts[d[t.sym], :]

+# This can exist in a formula after apply_schema if it was protected
+modelcols(t::ConstantTerm, d::ColumnTable) = fill(t.n, length(d))
+modelcols(t::ConstantTerm, d::Tables.RowTable) = fill(t.n, length(first(d)))


length(first(d)) gives you the number of fields in the named tuple, not really what you want here right? this should just return a scalar for a single row.

In fact, both methods should maybe just return a scalar. Except that hcat(1, [1, 2,3]) doesn't work, so maybe not. But if the only reason that constant term is actually returning a value is to make it work as arguments to FunctionCallTerm wrapped functions which are broadcasted then filling an array with values seems like an unnecessary allocation...

I think rowtable and column table are switched actually.
(that would explain my dimension mismatch errors)

A RowTable isn't a Row though. that would be length 1.
Our RowTable handling is AFAICT kinda not real, but this isn't the place for that discussion

Right. Just do the right thing for column table and leave row table handling for the future. And add a method for NamedTuple that just returns the scalar so that it works for a single row?

kleinschmidt · 2019-06-18T23:41:10Z

src/schema.jl

+    # so the function will be called in `modelcols`
+    return apply_schema(ft, schema, Mod)
+end
+apply_schema(ct::CallTerm, schema, Mod::Type) = call_fallback_apply_schema(ct, schema, Mod)


is this indirection necessary? why not just put the body of call_fallback_apply_schema in this method?

For reasons of making it easy to write things that avoid ambiguities,
when defining protected methods.

Though I am not sure it is used in any code I have committed,
I may have backed the other stuff that used it out,
or not written it yet.

kleinschmidt · 2019-06-18T23:53:21Z

src/protection.jl

+    names = Tuple(termvars(arg_terms))
+    ex = Expr(:call, nameof(op), names...)
+    ct = CallTerm{typeof(op), names}(+, ex, t)
+    return call_fallback_apply_schema(ct, schema, Mod)


same here as below...why not just call apply_schema(ct, schema, Mod)?

Yeah, probably not needed.

kleinschmidt · 2019-06-18T23:58:54Z

src/protection.jl

+unprotect(t::CallTerm{typeof(protect)}) = t.args_parsed[1]
+unprotect(t) = t
+function StatsModels.apply_schema(t::CallTerm{typeof(unprotect)}, sch, Mod::Type)
+    throw(DomainError("`unprotect` used outside a protected context."))


using a DomainError seems a little punny.

Yes, but for once ArgumentError seems wrong.
Could be a FormulaSyntaxError maybe?

Or even ParseError or LoadError or whatever the error is that's thrown for invalid syntax? (but that's maybe too punny as well)

kleinschmidt · 2019-06-19T00:00:59Z

src/protection.jl

+function direct_call(op, arg_terms::Tuple)
+    names = Tuple(termvars(arg_terms))
+    ex = Expr(:call, nameof(op), names...)
+    ct = CallTerm{typeof(op), names}(+, ex, t)


should be op?

oxinabox · 2019-06-19T10:02:20Z

With the right defaults, I think we can strike a good balance between clear semantics for normal users while still allowing power users to do all the fancy feature engineering they want.

Yeah, I suspect one such option might be to make CallTerms default to protected, which has almost the same semantics as current.
Then one would do something like unleash(sin)(lag(x)) to bring on the full power.
Or sin(unprotect(lag(x))) would actually work.

I also think we can hold of on resolving this for Terms 3.0: Grandson of Terms,
and a terms 2. release could happen first.

kleinschmidt · 2019-06-19T15:21:10Z

Agreed about this not blocking a release.

For syntax, like we discussed on slack maybe you'd do sin($(lag(x))) to get the "unprotected" LagTerm interpretation and :(x + y) to get addition (e.g. protect). The @formula macro could do the necessary transformation of this punny syntax into calls to protect and unprotect.

kleinschmidt · 2023-03-13T22:01:24Z

closing this; it is built on in #183

oxinabox added 2 commits June 13, 2019 17:40

Prove JuliaStats#114 occurs

ea7ba60

First pass at readoing FunctionTerms into two parts

e99a244

oxinabox changed the title ~~Make apply_schema work within the arguments to a FunctionalTerm~~ [WIP] Make apply_schema work within the arguments to a FunctionalTerm Jun 14, 2019

2nd pass at redoing Function Term to expand all

093abba

oxinabox added 3 commits June 17, 2019 16:04

redefine poly term in tests

274e220

Add Protection [WIP]

45126c1

Further work on Protection

c289bdc

kleinschmidt reviewed Jun 19, 2019

View reviewed changes

kleinschmidt mentioned this pull request Jun 19, 2019

An explictly overloaded FunctionalTerm used in the arguments to julia function used in a formula is always treated as a julia function. #114

Open

kleinschmidt mentioned this pull request Jul 1, 2019

Formula transfer updated to StatsModels v0.6.0 JuliaInterop/RCall.jl#317

Merged

kleinschmidt mentioned this pull request Jul 11, 2019

Programmatic formulas with function calls #130

Closed

kleinschmidt mentioned this pull request Aug 1, 2019

Make modelcols(::Term, d) an error #137

Merged

kleinschmidt mentioned this pull request Nov 11, 2019

Speed creating formula #164

Open

kleinschmidt mentioned this pull request May 23, 2020

Use some character other than * for "main effects and interactions" #99

Open

kleinschmidt mentioned this pull request Jul 11, 2020

FunctionTerm is dead, long live FunctionTerm #183

Merged

kleinschmidt closed this Mar 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Make apply_schema work within the arguments to a FunctionalTerm #117

[WIP] Make apply_schema work within the arguments to a FunctionalTerm #117

oxinabox commented Jun 14, 2019

kleinschmidt commented Jun 14, 2019

oxinabox commented Jun 14, 2019

oxinabox commented Jun 14, 2019

oxinabox commented Jun 14, 2019

oxinabox commented Jun 17, 2019

oxinabox commented Jun 17, 2019

kleinschmidt left a comment

kleinschmidt Jun 18, 2019

oxinabox Jun 19, 2019

kleinschmidt Jun 19, 2019

kleinschmidt Jun 18, 2019

oxinabox Jun 19, 2019

kleinschmidt Jun 18, 2019

oxinabox Jun 19, 2019

kleinschmidt Jun 18, 2019

oxinabox Jun 19, 2019

kleinschmidt Jun 19, 2019

kleinschmidt Jun 19, 2019

oxinabox Jun 19, 2019

oxinabox commented Jun 19, 2019

kleinschmidt commented Jun 19, 2019

kleinschmidt commented Mar 13, 2023

[WIP] Make apply_schema work within the arguments to a FunctionalTerm #117

[WIP] Make apply_schema work within the arguments to a FunctionalTerm #117

Conversation

oxinabox commented Jun 14, 2019

kleinschmidt commented Jun 14, 2019

oxinabox commented Jun 14, 2019

oxinabox commented Jun 14, 2019

oxinabox commented Jun 14, 2019

oxinabox commented Jun 17, 2019

oxinabox commented Jun 17, 2019

kleinschmidt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oxinabox commented Jun 19, 2019

kleinschmidt commented Jun 19, 2019

kleinschmidt commented Mar 13, 2023