[Feature Request] Metrics for GP regression/classification #1857
Description
🚀 Feature Request
It would be great to have a set of probabilistic/Bayesian metrics to evaluate GP regression/classification models.
Motivation
Scikit-learn has sklearn.metrics
to facilitate standard metrics for regression and classification models. However, conventional metrics such as RMSE or R^2 are insufficient to evaluate GP regression models (in general, Bayesian regression models). The primary reason is the inability of including predictive variance or the entire posterior distribution in the evaluation. GP community has been using the metrics such as NLPD (Negative Log Predictive Density) and MSLL (Mean Standardized Log Loss) (Please see this issue for the references that support this argument). Thus, it might be a good idea to standardize and pack these metrics in a sophisticated GP library such as GPyTorch. Additionally, these will also be useful for new people in the community struggling to search/calculate/compare/benchmark widely-used metrics.
Pitch
Describe the solution you'd like
Something like the following may be very helpful to the community.
from gpytorch.metrics import neg_log_predictive_density, mean_standardized_log_loss
nlpd = neg_log_predictive_density(model, likelihood, test_x, test_y)
msll = mean_standardized_log_loss(model, likelihood, test_x, test_y)
A list of metrics currently I am aware of:
- NLPD (Negative Log Predictive Density)
- MSLL (Mean Standardized Log Loss)
- CE (Coverage Error): Absolute difference between x% confidence interval and percentage of ground truth samples lying within that interval.
Describe alternatives you've considered
Manually calculating these metrics, which could be error-prone or time-consuming.
Are you willing to open a pull request? (We LOVE contributions!!!)
Definitely!
Additional context
Please see this issue