MLTK is a collection of various supervised machine learning algorithms, which is designed for directly training models and further development. For questions or suggestions with the code, please email machinelearningtoolkit@gmail.com.
Currently MLTK supports:
- Generalized Linear Models
- Ridge
- Lasso
- Elastic Net
- Group Lasso
- Generalized Additive Models
- Regression Trees
- Tree Ensembles
- Random Forests
- Boosted Trees
- Additive Groves
Typical input to MLTK is a text file containing the data matrix. An optional attribute file may also be provided to specify the target attribute. Datasets should be provided in separate white-space-delimited text files without any headers. MLTK supports continuous, nominal and binned attributes. All dense datasets should have the same number and order of columns. The structure of the attribute description is the following:
attribute_name: type [(class)]
There are two types of binned attributes. One is specified using the number of bins, and the other is specified using number of bins, upper bounds and medians for each bin.
f1: cont
f2: {a, b, c}
f3: binned (256)
f4: binned (3;[1, 5, 6];[0.5, 2.5, 3])
label: cont (class)
0.1 1 2 0 5
-2.3 0 255 1 2
3.1 2 128 2 -3
5 1 0 1 0.2
0.1 1 37 0 0.1
MLTK uses the following structure for sparse input format:
target feature:value ... feature:value
Feature/value pairs must be ordered by increasing feature number. MLTK does not skip zero valued features. For classification problems, make sure the target is in {0, ..., K - 1}, where K is the number of classes.
0 1:0.2 3:0.5
0 2:-0.4
1 1:3.2 5:-3
MLTK provides two classes to perform reading/writing of datasets: mltk.core.io.InstancesReader
and mltk.core.io.InstancesWriter
.
Instances instances = InstancesReader.read(< attr file path >, < dataset file path >)
It reads a dataset from attribute file and data file. Attribute file can be null
. When attribute file is not provided, if the data file follows dense format, no class attribute will be assigned, if the data file follows sparse format, the class attribute will be nominal if all class values are integers, or continuous attribute otherwise.
Instances instances = InstancesReader.read(< dataset file path >, < class index >)
It reads a dense dataset from data file and a specified class index. A negative class index (e.g., -1) means no target is specified.
All learning algorithms in MLTK inherits mltk.predictor.Learner class
. To build models from command line, use the following command (using LassoLearner
as example):
$ java mltk.predictor.glm.LassoLearner
It should output a message like this:
Usage: LassoLearner
-t train set path
[-r] attribute file path
[-o] output model path
[-g] task between classification (c) and regression (r) (default: r)
[-m] maximum num of iterations (default: 0)
[-l] lambda (default: 0)
Using LogitBoostLearner
as example, the following code builds a boosted tree model:
Instances trainSet = InstancesReader.read(...)
LogitBoostLearner learner = new LogitBoostLearner();
learner.setLearningRate(0.1);
learner.setMaxNumIters(10000);
learner.setMaxNumLeaves(2);
learner.setVerbose(true);
BRT brt = learner.build(trainSet);
MLTK uses mltk.predictor.evaluation.Evaluator
class. To evaluate models from command line, use the following command:
$ java mltk.predictor.evaluation.Evaluator
It should output a message like this:
Usage: Evaluator
-d data set path
-m model path
[-r] attribute file path
[-e] AUC (a), Error (c), RMSE (r) (default: r)
Currently MLTK supports area-under-curve (AUC), classification error (Error) and root-mean-squared error (RMSE).
The following code builds a boosted tree model and evaluate the classification error on a held-out test set:
Instances trainSet = InstancesReader.read(...)
Instances testSet = InstancesReader.read(...)
LogitBoostLearner learner = new LogitBoostLearner();
learner.setLearningRate(0.1);
learner.setMaxNumIters(10000);
learner.setMaxNumLeaves(2);
learner.setVerbose(true);
BRT brt = learner.build(trainSet);
double error = Evaluator.evalError(brt, testSet);