-
Notifications
You must be signed in to change notification settings - Fork 32
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Document contrasts and formulas etc. (#6)
Documentation for functionality pulled from DataFrames: contrast coding, formulas, modelframe/matrix.
- Loading branch information
1 parent
1c758ae
commit 2203992
Showing
10 changed files
with
417 additions
and
47 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
*.jl.cov | ||
*.jl.*.cov | ||
*.jl.mem | ||
docs/build |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
using Documenter, StatsModels | ||
|
||
makedocs( | ||
format = :html, | ||
sitename = "StatsModels.jl", | ||
pages = [ | ||
"Introduction" => "index.md", | ||
"Modeling tabular data" => "formula.md", | ||
"Contrast coding categorical variables" => "contrasts.md" | ||
] | ||
) | ||
|
||
deploydocs( | ||
repo = "github.com/JuliaStats/StatsModels.jl.git", | ||
target = "build", | ||
deps = nothing, | ||
make = nothing | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,103 @@ | ||
```@meta | ||
CurrentModule = StatsModels | ||
``` | ||
|
||
# Modeling categorical data | ||
|
||
To convert categorical data into a numerical representation suitable for | ||
modeling, `StatsModels` implements a variety of **contrast coding systems**. | ||
Each contrast coding system maps a categorical vector with $k$ levels onto | ||
$k-1$ linearly independent model matrix columns. | ||
|
||
The following contrast coding systems are implemented: | ||
|
||
* [`DummyCoding`](@ref) | ||
* [`EffectsCoding`](@ref) | ||
* [`HelmertCoding`](@ref) | ||
* [`ContrastsCoding`](@ref) | ||
|
||
## How to specify contrast coding | ||
|
||
The default contrast coding system is `DummyCoding`. To override this, use | ||
the `contrasts` argument when constructing a `ModelFrame`: | ||
|
||
```julia | ||
mf = ModelFrame(y ~ 1 + x, df, contrasts = Dict(:x => EffectsCoding())) | ||
``` | ||
|
||
To change the contrast coding for one or more variables in place, use | ||
|
||
```@docs | ||
setcontrasts! | ||
``` | ||
|
||
## Interface | ||
|
||
```@docs | ||
AbstractContrasts | ||
ContrastsMatrix | ||
``` | ||
|
||
## Contrast coding systems | ||
|
||
```@docs | ||
DummyCoding | ||
EffectsCoding | ||
HelmertCoding | ||
ContrastsCoding | ||
``` | ||
|
||
### Special internal contrasts | ||
|
||
```@docs | ||
FullDummyCoding | ||
``` | ||
|
||
## Further details | ||
|
||
### Categorical variables in `Formula`s | ||
|
||
Generating model matrices from multiple variables, some of which are | ||
categorical, requires special care. The reason for this is that rank-$k-1$ | ||
contrasts are appropriate for a categorical variable with $k$ levels when it | ||
*aliases* other terms, making it *partially redundant*. Using rank-$k$ for such | ||
a redundant variable will generally result in a rank-deficient model matrix and | ||
a model that can't be identified. | ||
|
||
A categorical variable in a term *aliases* the term that remains when that | ||
variable is dropped. For example, with categorical `a`: | ||
|
||
* In `a`, the sole variable `a` aliases the intercept term `1`. | ||
* In `a&b`, the variable `a` aliases the main effect term `b`, and vice versa. | ||
* In `a&b&c`, the variable `a` alises the interaction term `b&c` (regardless of | ||
whether `b` and `c` are categorical). | ||
|
||
If a categorical variable aliases another term that is present elsewhere in the | ||
formula, we call that variable *redundant*. A variable is *non-redundant* when | ||
the term that it alises is _not_ present elsewhere in the formula. For | ||
categorical `a`, `b`, and `c`: | ||
|
||
* In `y ~ 1 + a`, the `a` in the main effect of `a` aliases the intercept `1`. | ||
* In `y ~ 0 + a`, `a` does not alias any other terms and is *non-redundant*. | ||
* In `y ~ 1 + a + a&b`: | ||
* The `b` in `a&b` is redundant because it aliases the main effect `a`: | ||
dropping `b` from `a&b` leaves `a`. | ||
* The `a` in `a&b` is *non-redundant* because it aliases `b`, which is not | ||
present anywhere else in the formula. | ||
|
||
When constructing a `ModelFrame` from a `Formula`, each term is checked for | ||
non-redundant categorical variables. Any such non-redundant variables are | ||
"promoted" to full rank in that term by using [`FullDummyCoding`](@ref) instead | ||
of the contrasts used elsewhere for that variable. | ||
|
||
One additional complexity is introduced by promoting non-redundant variables to | ||
full rank. For the purpose of determining redundancy, a full-rank dummy coded | ||
categorical variable _implicitly_ introduces the term that it aliases into the | ||
formula. Thus, in `y ~ 1 + a + a&b + b&c`: | ||
|
||
* In `a&b`, `a` aliases the main effect `b`, which is not explicitly present in | ||
the formula. This makes it non-redundant and so its contrast coding is | ||
promoted to `FullDummyCoding`, which _implicitly_ introduces the main effect | ||
of `b`. | ||
* Then, in `b&c`, the variable `c` is now _redundant_ because it aliases the main | ||
effect of `b`, and so it keeps its original contrast coding system. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
```@meta | ||
CurrentModule = StatsModels | ||
DocTestSetup = quote | ||
using StatsModels | ||
end | ||
``` | ||
|
||
# Modeling tabular data | ||
|
||
Most statistical models require that data be represented as a `Matrix`-like | ||
collection of a single numeric type. Much of the data we want to model, | ||
however, is **tabular data**, where data is represented as a collection of | ||
fields with possibly heterogeneous types. One of the primary goals of | ||
`StatsModels` is to make it simpler to transform tabular data into matrix format | ||
suitable for statistical modeling. | ||
|
||
At the moment, "tabular data" means an `AbstractDataFrame`. Ultimately, the | ||
goal is to support any tabular data format that adheres to a minimal API, | ||
**regardless of backend**. | ||
|
||
## The `Formula` type | ||
|
||
The basic conceptual tool for this is the `Formula`, which has a left side and a | ||
right side, separated by `~`: | ||
|
||
```jldoctest | ||
julia> y ~ 1 + a | ||
Formula: y ~ 1 + a | ||
``` | ||
|
||
The left side of a formula conventionally represents *dependent* variables, and | ||
the right side *independent* variables (or regressors). *Terms* are separated | ||
by `+`. Basic terms are the integers `1` or `0`—evaluated as the presence or | ||
absence of a constant intercept term, respectively—and variables like `x`, | ||
which will evaluate to the data source column with that name as a symbol (`:x`). | ||
|
||
Individual variables can be combined into *interaction terms* with `&`, as in | ||
`a&b`, which will evaluate to the product of the columns named `:a` and `:b`. | ||
If either `a` or `b` are categorical, then the interaction term `a&b` generates | ||
all the product of each pair of the columns of `a` and `b`. | ||
|
||
It's often convenient to include main effects and interactions for a number of | ||
variables. The `*` operator does this, expanding in the following way: | ||
|
||
```jldoctest | ||
julia> Formula(StatsModels.Terms(y ~ 1 + a*b)) | ||
Formula: y ~ 1 + a + b + a & b | ||
``` | ||
|
||
(We trigger parsing of the formula using the internal `Terms` type to show how | ||
the `Formula` expands). | ||
|
||
This applies to higher-order interactions, too: `a*b*c` expands to the main | ||
effects, all two-way interactions, and the three way interaction `a&b&c`: | ||
|
||
```jldoctest | ||
julia> Formula(StatsModels.Terms(y ~ 1 + a*b*c)) | ||
Formula: y ~ 1 + a + b + c + a & b + a & c + b & c + &(a,b,c) | ||
``` | ||
|
||
Both the `*` and the `&` operators act like multiplication, and are distributive | ||
over addition: | ||
|
||
```jldoctest | ||
julia> Formula(StatsModels.Terms(y ~ 1 + (a+b) & c)) | ||
Formula: y ~ 1 + c & a + c & b | ||
julia> Formula(StatsModels.Terms(y ~ 1 + (a+b) * c)) | ||
Formula: y ~ 1 + a + b + c + c & a + c & b | ||
``` | ||
|
||
## The `ModelFrame` and `ModelMatrix` types | ||
|
||
The main use of `Formula`s is for fitting statistical models based on tabular | ||
data. From the user's perspective, this is done by `fit` methods that take a | ||
`Formula` and a `DataFrame` instead of numeric matrices. | ||
|
||
Internally, this is accomplished in three stages: | ||
|
||
1. The `Formula` is parsed into [`Terms`](@ref). | ||
2. The `Terms` and the data source are wrapped in a [`ModelFrame`](@ref). | ||
3. A numeric [`ModelMatrix`](@ref) is generated from the `ModelFrame` and passed to the | ||
model's `fit` method. | ||
|
||
```@docs | ||
ModelFrame | ||
ModelMatrix | ||
Terms | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
# StatsModels Documentation | ||
|
||
This package provides common abstractions and utilities for specifying, fitting, | ||
and evaluating statistical models. The goal is to provide an API for package | ||
developers implementing different kinds of statistical models (see | ||
the [GLM](https://www.github.com/JuliaStats/GLM.jl) package | ||
for example), and utilities that are generally useful for both users and | ||
developers when dealing with statistical models and tabular data. | ||
|
||
* Formula notation for specifying models based on tabular data | ||
|
||
* `Formula` | ||
* `ModelFrame` | ||
* `ModelMatrix` | ||
|
||
* Contrast coding for categorical data | ||
|
||
* Abstract model types | ||
|
||
* `StatisticalModel` | ||
* `RegressionModel` | ||
|
||
Much of this package was formerly part | ||
of [`DataFrames`](https://www.github.com/JuliaStats/DataFrames.jl) | ||
and [`StatsBase`](https://www.github.com/JuliaStats/StatsBase.jl). |
Oops, something went wrong.