Carlos Omar Pardo Gomez (cop2108@columbia.edu)
R Package. Bayesian and nonparametric quantile regression, using Gaussian Processes to model the trend, and Dirichlet Processes for the error. An MCMC algorithm works behind to fit the model.
Be one dependent random variable y, and a vector of independent variables x, so P(y|x) makes sense. Be qp the function which returns the p-quantile for a random variable. The model assumes that given any probability p (between 0 and 1) for which you want to estimate the y|x's p-quantile, an observation y comes from the sum
where fp is the function which links x to y, and epsilon is a dispersion random variable, such that its p-quantile is 0. Then, because the quantile of a sum is the sum of the quantiles, we get that qp(y|x) = fp(x). This package is going to focus on estimating fp.
We are going to address that problem by using the GPDP model, which is described below. (For a complete understanding of it, it is recommended to have some previous knowledge in Gaussian Processes (Bagnell's lecture) and Dirichlet Processes (Teh 2010), particularly, the stick-breaking representation).
Where:
- p is the probability for which you want to estimate the quantile.
- ALp is the Assymetric Laplace Distribution (ALD), with density function given by , where .
- GP is a Gaussian process, with mean (m) and covariance (k) functions.
- GI is the Inverse-Gamma distribution.
- Multinf is the Multinomial distribution, when the number of categories tend to infinity.
- GEM (for Griffiths, Engen and McCloskey) is a distribution used in Dirichlet Processes' literature, as described in Teh (2010).
An MCMC algorithm is used to find the fp's posterior distribution via simulations, particularly, a Gibbs sampler has been developed.
Since the theoretical model is a nonparametric one, it contemplates infinite parameters, particularly for the Dirichlet process. However, it's clear we cannot estimate and allocate such a number of values, so this packages uses the slice sampling algorithm proposed by Kalli et al. to truncate the number of them in a dynamic way. The results approximately converge to the expected ones, if we could do it in the theoretical way.
First, you must install the package directly from Github. (I hope eventually you will do it from CRAN.)
library(devtools)
install_github("opardo/GPDPQuantReg")
library(GPDPQuantReg)
Then, you can create some artificial data to start familiarizing with the package's dynamic. In this case, I'm using a complex trend function and a non-normal error, sampling ONLY 20 POINTS.
set.seed(201707)
f_x <- function(x) return(0.5 * x * cos(x) - exp(0.1 * x))
error <- function(m) rgamma(m, 2, 1)
m <- 20
x <- sort(sample(seq(-15, 15, 0.005), m))
sample_data <- data.frame(x = x, y = f_x(x) + error(m))
Now, it's time to fit the model with a MCMC algorithm for a specific p probability.
GPDP_MCMC <- GPDPQuantReg(y ~ x, sample_data, p = 0.250)
Since it is a nonparametric model, it is focused on prediction with a credible interval.
predictive_data <- data.frame(x = seq(-15, 15, 0.25))
credibility <- 0.90
prediction <- predict(GPDP_MCMC, predictive_data, credibility)
And for a complex trend function, non-normal error and only 20 sampled points... we get AWESOME RESULTS!
Some diagnostics (ergodicity, autocorrelation, crosscorrelation and traces) for the Markov Chains are available too.
diagnose(GPDP_MCMC)
TODO