Skip to content

Commit

Permalink
Document contrasts and formulas etc. (#6)
Browse files Browse the repository at this point in the history
Documentation for functionality pulled from DataFrames: contrast coding, formulas, modelframe/matrix.
  • Loading branch information
kleinschmidt authored Nov 20, 2016
1 parent 1c758ae commit 2203992
Show file tree
Hide file tree
Showing 10 changed files with 417 additions and 47 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
*.jl.cov
*.jl.*.cov
*.jl.mem
docs/build
2 changes: 2 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ script:
- if [[ -a .git/shallow ]]; then git fetch --unshallow; fi
- julia -e 'Pkg.clone(pwd()); Pkg.checkout("DataFrames", "dfk/statsmodel-purge"); Pkg.build("StatsModels"); Pkg.test("StatsModels"; coverage=true)'
after_success:
# build and deploy documentation with Documenter.jl
- julia -e 'cd(Pkg.dir("StatsModels")); Pkg.add("Documenter"); include(joinpath("docs", "make.jl"))'
# push coverage results to Coveralls
- julia -e 'cd(Pkg.dir("StatsModels")); Pkg.add("Coverage"); using Coverage; Coveralls.submit(Coveralls.process_folder())'
# push coverage results to Codecov
Expand Down
18 changes: 18 additions & 0 deletions docs/make.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
using Documenter, StatsModels

makedocs(
format = :html,
sitename = "StatsModels.jl",
pages = [
"Introduction" => "index.md",
"Modeling tabular data" => "formula.md",
"Contrast coding categorical variables" => "contrasts.md"
]
)

deploydocs(
repo = "github.com/JuliaStats/StatsModels.jl.git",
target = "build",
deps = nothing,
make = nothing
)
103 changes: 103 additions & 0 deletions docs/src/contrasts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
```@meta
CurrentModule = StatsModels
```

# Modeling categorical data

To convert categorical data into a numerical representation suitable for
modeling, `StatsModels` implements a variety of **contrast coding systems**.
Each contrast coding system maps a categorical vector with $k$ levels onto
$k-1$ linearly independent model matrix columns.

The following contrast coding systems are implemented:

* [`DummyCoding`](@ref)
* [`EffectsCoding`](@ref)
* [`HelmertCoding`](@ref)
* [`ContrastsCoding`](@ref)

## How to specify contrast coding

The default contrast coding system is `DummyCoding`. To override this, use
the `contrasts` argument when constructing a `ModelFrame`:

```julia
mf = ModelFrame(y ~ 1 + x, df, contrasts = Dict(:x => EffectsCoding()))
```

To change the contrast coding for one or more variables in place, use

```@docs
setcontrasts!
```

## Interface

```@docs
AbstractContrasts
ContrastsMatrix
```

## Contrast coding systems

```@docs
DummyCoding
EffectsCoding
HelmertCoding
ContrastsCoding
```

### Special internal contrasts

```@docs
FullDummyCoding
```

## Further details

### Categorical variables in `Formula`s

Generating model matrices from multiple variables, some of which are
categorical, requires special care. The reason for this is that rank-$k-1$
contrasts are appropriate for a categorical variable with $k$ levels when it
*aliases* other terms, making it *partially redundant*. Using rank-$k$ for such
a redundant variable will generally result in a rank-deficient model matrix and
a model that can't be identified.

A categorical variable in a term *aliases* the term that remains when that
variable is dropped. For example, with categorical `a`:

* In `a`, the sole variable `a` aliases the intercept term `1`.
* In `a&b`, the variable `a` aliases the main effect term `b`, and vice versa.
* In `a&b&c`, the variable `a` alises the interaction term `b&c` (regardless of
whether `b` and `c` are categorical).

If a categorical variable aliases another term that is present elsewhere in the
formula, we call that variable *redundant*. A variable is *non-redundant* when
the term that it alises is _not_ present elsewhere in the formula. For
categorical `a`, `b`, and `c`:

* In `y ~ 1 + a`, the `a` in the main effect of `a` aliases the intercept `1`.
* In `y ~ 0 + a`, `a` does not alias any other terms and is *non-redundant*.
* In `y ~ 1 + a + a&b`:
* The `b` in `a&b` is redundant because it aliases the main effect `a`:
dropping `b` from `a&b` leaves `a`.
* The `a` in `a&b` is *non-redundant* because it aliases `b`, which is not
present anywhere else in the formula.

When constructing a `ModelFrame` from a `Formula`, each term is checked for
non-redundant categorical variables. Any such non-redundant variables are
"promoted" to full rank in that term by using [`FullDummyCoding`](@ref) instead
of the contrasts used elsewhere for that variable.

One additional complexity is introduced by promoting non-redundant variables to
full rank. For the purpose of determining redundancy, a full-rank dummy coded
categorical variable _implicitly_ introduces the term that it aliases into the
formula. Thus, in `y ~ 1 + a + a&b + b&c`:

* In `a&b`, `a` aliases the main effect `b`, which is not explicitly present in
the formula. This makes it non-redundant and so its contrast coding is
promoted to `FullDummyCoding`, which _implicitly_ introduces the main effect
of `b`.
* Then, in `b&c`, the variable `c` is now _redundant_ because it aliases the main
effect of `b`, and so it keeps its original contrast coding system.
89 changes: 89 additions & 0 deletions docs/src/formula.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
```@meta
CurrentModule = StatsModels
DocTestSetup = quote
using StatsModels
end
```

# Modeling tabular data

Most statistical models require that data be represented as a `Matrix`-like
collection of a single numeric type. Much of the data we want to model,
however, is **tabular data**, where data is represented as a collection of
fields with possibly heterogeneous types. One of the primary goals of
`StatsModels` is to make it simpler to transform tabular data into matrix format
suitable for statistical modeling.

At the moment, "tabular data" means an `AbstractDataFrame`. Ultimately, the
goal is to support any tabular data format that adheres to a minimal API,
**regardless of backend**.

## The `Formula` type

The basic conceptual tool for this is the `Formula`, which has a left side and a
right side, separated by `~`:

```jldoctest
julia> y ~ 1 + a
Formula: y ~ 1 + a
```

The left side of a formula conventionally represents *dependent* variables, and
the right side *independent* variables (or regressors). *Terms* are separated
by `+`. Basic terms are the integers `1` or `0`—evaluated as the presence or
absence of a constant intercept term, respectively—and variables like `x`,
which will evaluate to the data source column with that name as a symbol (`:x`).

Individual variables can be combined into *interaction terms* with `&`, as in
`a&b`, which will evaluate to the product of the columns named `:a` and `:b`.
If either `a` or `b` are categorical, then the interaction term `a&b` generates
all the product of each pair of the columns of `a` and `b`.

It's often convenient to include main effects and interactions for a number of
variables. The `*` operator does this, expanding in the following way:

```jldoctest
julia> Formula(StatsModels.Terms(y ~ 1 + a*b))
Formula: y ~ 1 + a + b + a & b
```

(We trigger parsing of the formula using the internal `Terms` type to show how
the `Formula` expands).

This applies to higher-order interactions, too: `a*b*c` expands to the main
effects, all two-way interactions, and the three way interaction `a&b&c`:

```jldoctest
julia> Formula(StatsModels.Terms(y ~ 1 + a*b*c))
Formula: y ~ 1 + a + b + c + a & b + a & c + b & c + &(a,b,c)
```

Both the `*` and the `&` operators act like multiplication, and are distributive
over addition:

```jldoctest
julia> Formula(StatsModels.Terms(y ~ 1 + (a+b) & c))
Formula: y ~ 1 + c & a + c & b
julia> Formula(StatsModels.Terms(y ~ 1 + (a+b) * c))
Formula: y ~ 1 + a + b + c + c & a + c & b
```

## The `ModelFrame` and `ModelMatrix` types

The main use of `Formula`s is for fitting statistical models based on tabular
data. From the user's perspective, this is done by `fit` methods that take a
`Formula` and a `DataFrame` instead of numeric matrices.

Internally, this is accomplished in three stages:

1. The `Formula` is parsed into [`Terms`](@ref).
2. The `Terms` and the data source are wrapped in a [`ModelFrame`](@ref).
3. A numeric [`ModelMatrix`](@ref) is generated from the `ModelFrame` and passed to the
model's `fit` method.

```@docs
ModelFrame
ModelMatrix
Terms
```
25 changes: 25 additions & 0 deletions docs/src/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# StatsModels Documentation

This package provides common abstractions and utilities for specifying, fitting,
and evaluating statistical models. The goal is to provide an API for package
developers implementing different kinds of statistical models (see
the [GLM](https://www.github.com/JuliaStats/GLM.jl) package
for example), and utilities that are generally useful for both users and
developers when dealing with statistical models and tabular data.

* Formula notation for specifying models based on tabular data

* `Formula`
* `ModelFrame`
* `ModelMatrix`

* Contrast coding for categorical data

* Abstract model types

* `StatisticalModel`
* `RegressionModel`

Much of this package was formerly part
of [`DataFrames`](https://www.github.com/JuliaStats/DataFrames.jl)
and [`StatsBase`](https://www.github.com/JuliaStats/StatsBase.jl).
Loading

0 comments on commit 2203992

Please sign in to comment.