Document contrasts and formulas etc. (#6)

Documentation for functionality pulled from DataFrames: contrast coding, formulas, modelframe/matrix.
JuliaStats · Nov 20, 2016 · 2203992 · 2203992
1 parent 1c758ae
commit 2203992
Show file tree

Hide file tree

Showing 10 changed files with 417 additions and 47 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,4 @@
 *.jl.cov
 *.jl.*.cov
 *.jl.mem
+docs/build
diff --git a/.travis.yml b/.travis.yml
@@ -13,6 +13,8 @@ script:
   - if [[ -a .git/shallow ]]; then git fetch --unshallow; fi
   - julia -e 'Pkg.clone(pwd()); Pkg.checkout("DataFrames", "dfk/statsmodel-purge"); Pkg.build("StatsModels"); Pkg.test("StatsModels"; coverage=true)'
 after_success:
+  # build and deploy documentation with Documenter.jl
+  - julia -e 'cd(Pkg.dir("StatsModels")); Pkg.add("Documenter"); include(joinpath("docs", "make.jl"))'
   # push coverage results to Coveralls
   - julia -e 'cd(Pkg.dir("StatsModels")); Pkg.add("Coverage"); using Coverage; Coveralls.submit(Coveralls.process_folder())'
   # push coverage results to Codecov

diff --git a/docs/make.jl b/docs/make.jl
@@ -0,0 +1,18 @@
+using Documenter, StatsModels
+
+makedocs(
+    format = :html,
+    sitename = "StatsModels.jl",
+    pages = [
+        "Introduction" => "index.md",
+        "Modeling tabular data" => "formula.md",
+        "Contrast coding categorical variables" => "contrasts.md"
+    ]
+)
+
+deploydocs(
+    repo = "github.com/JuliaStats/StatsModels.jl.git",
+    target = "build",
+    deps = nothing,
+    make = nothing
+)
diff --git a/docs/src/contrasts.md b/docs/src/contrasts.md
@@ -0,0 +1,103 @@
+```@meta
+CurrentModule = StatsModels
+```
+
+# Modeling categorical data
+
+To convert categorical data into a numerical representation suitable for
+modeling, `StatsModels` implements a variety of **contrast coding systems**.
+Each contrast coding system maps a categorical vector with $k$ levels onto
+$k-1$ linearly independent model matrix columns.
+
+The following contrast coding systems are implemented:
+
+* [`DummyCoding`](@ref)
+* [`EffectsCoding`](@ref)
+* [`HelmertCoding`](@ref)
+* [`ContrastsCoding`](@ref)
+
+## How to specify contrast coding
+
+The default contrast coding system is `DummyCoding`.  To override this, use
+the `contrasts` argument when constructing a `ModelFrame`:
+
+```julia
+mf = ModelFrame(y ~ 1 + x, df, contrasts = Dict(:x => EffectsCoding()))
+```
+
+To change the contrast coding for one or more variables in place, use
+
+```@docs
+setcontrasts!
+```
+
+## Interface
+
+```@docs
+AbstractContrasts
+ContrastsMatrix
+```
+
+## Contrast coding systems
+
+```@docs
+DummyCoding
+EffectsCoding
+HelmertCoding
+ContrastsCoding
+```
+
+### Special internal contrasts
+
+```@docs
+FullDummyCoding
+```
+
+## Further details
+
+### Categorical variables in `Formula`s
+
+Generating model matrices from multiple variables, some of which are
+categorical, requires special care.  The reason for this is that rank-$k-1$
+contrasts are appropriate for a categorical variable with $k$ levels when it
+*aliases* other terms, making it *partially redundant*.  Using rank-$k$ for such
+a redundant variable will generally result in a rank-deficient model matrix and
+a model that can't be identified.
+
+A categorical variable in a term *aliases* the term that remains when that
+variable is dropped.  For example, with categorical `a`:
+
+* In `a`, the sole variable `a` aliases the intercept term `1`.
+* In `a&b`, the variable `a` aliases the main effect term `b`, and vice versa.
+* In `a&b&c`, the variable `a` alises the interaction term `b&c` (regardless of
+  whether `b` and `c` are categorical).
+
+If a categorical variable aliases another term that is present elsewhere in the
+formula, we call that variable *redundant*.  A variable is *non-redundant* when
+the term that it alises is _not_ present elsewhere in the formula.  For
+categorical `a`, `b`, and `c`:
+
+* In `y ~ 1 + a`, the `a` in the main effect of `a` aliases the intercept `1`.
+* In `y ~ 0 + a`, `a` does not alias any other terms and is *non-redundant*.
+* In `y ~ 1 + a + a&b`:
+    * The `b` in `a&b` is redundant because it aliases the main effect `a`:
+      dropping `b` from `a&b` leaves `a`.
+    * The `a` in `a&b` is *non-redundant* because it aliases `b`, which is not
+      present anywhere else in the formula.
+
+When constructing a `ModelFrame` from a `Formula`, each term is checked for
+non-redundant categorical variables.  Any such non-redundant variables are
+"promoted" to full rank in that term by using [`FullDummyCoding`](@ref) instead
+of the contrasts used elsewhere for that variable.
+
+One additional complexity is introduced by promoting non-redundant variables to
+full rank.  For the purpose of determining redundancy, a full-rank dummy coded
+categorical variable _implicitly_ introduces the term that it aliases into the
+formula.  Thus, in `y ~ 1 + a + a&b + b&c`:
+
+* In `a&b`, `a` aliases the main effect `b`, which is not explicitly present in
+  the formula.  This makes it non-redundant and so its contrast coding is
+  promoted to `FullDummyCoding`, which _implicitly_ introduces the main effect
+  of `b`.
+* Then, in `b&c`, the variable `c` is now _redundant_ because it aliases the main
+  effect of `b`, and so it keeps its original contrast coding system.
diff --git a/docs/src/formula.md b/docs/src/formula.md
@@ -0,0 +1,89 @@
+```@meta
+CurrentModule = StatsModels
+DocTestSetup = quote
+    using StatsModels
+end
+```
+
+# Modeling tabular data
+
+Most statistical models require that data be represented as a `Matrix`-like
+collection of a single numeric type.  Much of the data we want to model,
+however, is **tabular data**, where data is represented as a collection of
+fields with possibly heterogeneous types.  One of the primary goals of
+`StatsModels` is to make it simpler to transform tabular data into matrix format
+suitable for statistical modeling.
+
+At the moment, "tabular data" means an `AbstractDataFrame`.  Ultimately, the
+goal is to support any tabular data format that adheres to a minimal API,
+**regardless of backend**.
+
+## The `Formula` type
+
+The basic conceptual tool for this is the `Formula`, which has a left side and a
+right side, separated by `~`:
+
+```jldoctest
+julia> y ~ 1 + a
+Formula: y ~ 1 + a
+```
+
+The left side of a formula conventionally represents *dependent* variables, and
+the right side *independent* variables (or regressors).  *Terms* are separated
+by `+`.  Basic terms are the integers `1` or `0`—evaluated as the presence or
+absence of a constant intercept term, respectively—and variables like `x`,
+which will evaluate to the data source column with that name as a symbol (`:x`).
+
+Individual variables can be combined into *interaction terms* with `&`, as in
+`a&b`, which will evaluate to the product of the columns named `:a` and `:b`.
+If either `a` or `b` are categorical, then the interaction term `a&b` generates
+all the product of each pair of the columns of `a` and `b`.
+
+It's often convenient to include main effects and interactions for a number of
+variables.  The `*` operator does this, expanding in the following way:
+
+```jldoctest
+julia> Formula(StatsModels.Terms(y ~ 1 + a*b))
+Formula: y ~ 1 + a + b + a & b
+```
+
+(We trigger parsing of the formula using the internal `Terms` type to show how
+the `Formula` expands).
+
+This applies to higher-order interactions, too: `a*b*c` expands to the main
+effects, all two-way interactions, and the three way interaction `a&b&c`:
+
+```jldoctest
+julia> Formula(StatsModels.Terms(y ~ 1 + a*b*c))
+Formula: y ~ 1 + a + b + c + a & b + a & c + b & c + &(a,b,c)
+```
+
+Both the `*` and the `&` operators act like multiplication, and are distributive
+over addition:
+
+```jldoctest
+julia> Formula(StatsModels.Terms(y ~ 1 + (a+b) & c))
+Formula: y ~ 1 + c & a + c & b
+
+julia> Formula(StatsModels.Terms(y ~ 1 + (a+b) * c))
+Formula: y ~ 1 + a + b + c + c & a + c & b
+```
+
+## The `ModelFrame` and `ModelMatrix` types
+
+The main use of `Formula`s is for fitting statistical models based on tabular
+data.  From the user's perspective, this is done by `fit` methods that take a
+`Formula` and a `DataFrame` instead of numeric matrices.
+
+Internally, this is accomplished in three stages:
+
+1. The `Formula` is parsed into [`Terms`](@ref).
+2. The `Terms` and the data source are wrapped in a [`ModelFrame`](@ref).
+3. A numeric [`ModelMatrix`](@ref) is generated from the `ModelFrame` and passed to the
+   model's `fit` method.
+
+```@docs
+ModelFrame
+ModelMatrix
+Terms
+```
diff --git a/docs/src/index.md b/docs/src/index.md
@@ -0,0 +1,25 @@
+# StatsModels Documentation
+
+This package provides common abstractions and utilities for specifying, fitting,
+and evaluating statistical models.  The goal is to provide an API for package
+developers implementing different kinds of statistical models (see
+the [GLM](https://www.github.com/JuliaStats/GLM.jl) package
+for example), and utilities that are generally useful for both users and
+developers when dealing with statistical models and tabular data.
+
+* Formula notation for specifying models based on tabular data
+
+    * `Formula`
+    * `ModelFrame`
+    * `ModelMatrix`
+
+* Contrast coding for categorical data
+
+* Abstract model types
+
+    * `StatisticalModel`
+    * `RegressionModel`
+
+Much of this package was formerly part
+of [`DataFrames`](https://www.github.com/JuliaStats/DataFrames.jl)
+and [`StatsBase`](https://www.github.com/JuliaStats/StatsBase.jl).