Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unified framework for dealing with Nulls? #17

Open
mkborregaard opened this issue Mar 22, 2017 · 12 comments
Open

Unified framework for dealing with Nulls? #17

mkborregaard opened this issue Mar 22, 2017 · 12 comments

Comments

@mkborregaard
Copy link
Contributor

Currently, lm seems to deal with NAs in DataFrames by ignoring the rows where they appear:

df = DataFrame(a = randn(10), b = randn(10));
df[:a][2] = NA
mod = lm(a~b, df)
mod.model.pp.X

This is great, but is there a general syntax for this behaviour? I'm asking, among other things, because I want to add a keyword for this behaviour to plotting with DataFrames in StatPlots.

In R, this is a complete mess. mean uses the keyword na.rm = T, lm uses na.action = "na.omit", cor uses `use = "complete.obs", which is just the sort of thing that guarantees that half your coding time in R is spent reading man pages also after 10 years of use.

@nalimilan
Copy link
Member

nalimilan commented Mar 22, 2017

For functions like mean, for now has to do mean(dropnull(x)), which is consistent but makes a copy. A keyword argument skipnull=true (like for reduce in NullableArrays) or skipna=true (like in DataArrays) would work, but we could also imagine a more generic system where a dropnull or skipnull function would return a wrapper iterator skipping nulls.

Anyway it's not clear whether this system would be the same with statistical models. If the question is as simple as for mean, the same keyword argument could be used. But maybe it's more complex. R's na.action system allows four choices: skip nulls (default), skip nulls but use null in the predict output for corresponding rows, keep nulls (i.e. no-op), throw and error if there is a null. I'm not convinced this is needed:

  • predict can take a separate argument to decide how to handle nulls (I think the R design comes from the fact that a global option allows choosing the default, so it's useful to have different effects on different functions, but we don't need that)
  • no-op does not make sense for model matrices
  • we don't really need an option to throw an error if missing values are present since we have non-nullable vectors, and it's easy to call anynull(df) beforehand

So overall, maybe we just need to standardize a keyword argument name, like skipnull? We should review what other software does.

EDIT: Forgot to address the case of pairwise functions like cor and cov, where more choices are possible. I think we've discussed them in the past, need to find the link.

@mkborregaard
Copy link
Contributor Author

Yes, sounds very sensible. Ah, and what's the current policy for predict? I believe I should not have to worry about it for JuliaData/DataFrames.jl#1160.

@nalimilan
Copy link
Member

nalimilan commented Mar 22, 2017

I forgot to address the case of pairwise functions like cor and cov, where more choices are possible: skip all rows with at least in null in one of the columns (cor(dt[completecases(dt), :])), or only skip those with a null in one of the two variables considered for each computation. This has been discussed before on julia-stats. I would say these functions could still take the skipnull=true argument for consistency with others, but possibly combined with another argument like pairwise_complete=true to consider only the relevant columns when skipping.

But this actually has really little or no relation with modeling functions. :-)

Regarding predict, AFAIK the current policy is simple: since only complete observations are kept in the model matrix, no missing values can appear in the output of predict. I think it would be useful to store the list of removed cases so that a missing value that can be returned for them when needed. Could be as simple as storing a list of indices (not sure whether storing a second model matrix with nulls where information is unavailable would be needed for some applications).

@mkborregaard
Copy link
Contributor Author

mkborregaard commented Mar 22, 2017

Cool. Yes, e.g. df = DataFrame(x = x, y = y); mod = lm(y~x, df); plot(df, :y, predict(mod)) would be nice to be able to do even in the presence of NAs (of course you can already do plot(df, :y, predict(mod, df)) so maybe not the most illustrative example).

@mkborregaard
Copy link
Contributor Author

mkborregaard commented Mar 22, 2017

It is a bit tricky perhaps to get the skipnull = true and pairwise_complete=true combination to be intuitive (would setting pairwise_complete=true lead to fewer or more observations to be included? what if skipnull = false?). Perhaps allowing skipnull to take true, false and, where applicable, :pairwise_complete would be an option (with :none an alias for false and :all an alias for true? Plots.jl uses this kind of logic in many places but maybe it would appear too magical.

@nalimilan
Copy link
Member

Yes, there's still the option of allowing both Bool and Symbol. Or we could find a better name for the argument. Anyway, better discuss this in the right packages (DataArrays and NullableArrays).

@andreasnoack
Copy link
Member

There is some iterator functionality for handling this in DataArrays. I don't know if it practical but I'd love to see some alternative way to specify this that isn't a keyword argument. Keyword arguments are so R like. I don't have a specific proposal though. Just wanted to mentioned it.

@nalimilan
Copy link
Member

For reduction functions like mean, an iterator can easily be used. For pairwise functions like cor, it would be tricky to handle the "pairwise missing deletion" since the result cannot be a data table-like structure: deletion needs to happen inside the function itself. Though for this case (which might not be a so frequent need), we could have a generic pairwise function (like the one in Distances.jl) which would take the arguments we talked about. That way at least we wouldn't need to reimplement them several times. Standard incomplete case deletion could be done by users via cor(dropnull(dt)) or cor(dt[complete_cases(dt),:]).

@mkborregaard
Copy link
Contributor Author

mkborregaard commented Mar 22, 2017

I actually recently put a generic pairwise function in VectorizedRoutines.jl if that's relevant https://github.com/ChrisRackauckas/VectorizedRoutines.jl/pull/9/files .
You'd be welcome to copy over any elements of it (given that I'm pretty sure you don't want a dependency). I got a lot of useful feedback on the PR from Steven Johnson, so the worst elements should have been ironed out.

@nalimilan
Copy link
Member

Interesting, but that pairwise function operates on individual elements of a vector, not on columns or rows. Looks like it could be generalized by passing the dimension to slice as the third argument. pairwise(f, a, 2) would behave like pairwise(dist, a) from Distances.jl, and the two functions could be merged (with a definition in StatsBase?).

@mkborregaard
Copy link
Contributor Author

mkborregaard commented Mar 25, 2017

That is a good idea - the function has always been intended to be extended to multiple dimensions. Which design do you think would be preferable - one where f always returns a scalar and the return type is a Matrix or one that allows return types of any dimension N and concatenates it to an N+2 Array? I think the first, just making sure.

@nalimilan
Copy link
Member

I would start assuming a scalar. We can always make it even more general if that turns out to be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants