DISCORD: ollstar41
The program for transpiling decision tree models created using the scikit-learn library into Leo language code from the Aleo project.
- 🛠 Installation
- 🧠 Technical Solutions Implemented in This Project
- 🔄 Workflow of the Program
- 📝 Example of the Program's Work
- 📊 How Metrics Are Calculated
- 📈 Model Performance Results
- 🔒 License
- 📘 Theoretical Information on Decision Trees
- 📮 Contact Information
- Install Python 3.11
- Install the Leo compiler
- In the terminal, run the command
python3 -m pip install -r requirements.txt
to install all dependencies - Start the program using the command
python3 py_transpiler/main.py
Leo language does not support more than 16 arguments for functions, however, some datasets may contain far more than 16 features, for example, the Digits dataset, which contains 64 features. To address this issue, the following approach was used:
- First, the base 16 logarithm of the total number of features is calculated and rounded up.
- The resulting value is the number of groups into which our features must be grouped.
- If this value is greater than one, the function will work recursively, creating several levels of groups, each with 16 or fewer groups.
This logic is also retained when running the program with the leo run predict
command and when filling in the .in
file.
Another problem is the absence of floating-point numbers in Leo language. To solve this issue, the following approach was used:
- After training on a dataset, a certain multiplier is set, which will be used to multiply all values.
- The program generates code using the
i128
type for maximum precision, and the maximum multiplier is2^64
to avoid situations where we get overflow of thei128
type in the test dataset.
This will not guarantee 100% accuracy of the model, but the indicator is still at a very high level.
- Immediately after launching, you will be prompted to select a dataset presented in the scikit-learn library for classification tasks:
- After selecting the data, the program will request to enter the private key from the Aleo wallet for the correct operation of Leo.
- Then the model training on Python language begins with further translation into Leo code.
- The next step, the program will ask if you need to evaluate the model. If you choose yes, the program will run a test dataset on Leo language and compare the results with those obtained using the model on Python language.
- After checking the model, the program will display classification reports:
- The first will compare the results of the model in Leo language with the true values.
- The second will compare the model in Python language with the model in Leo language: if accuracy is 1, then the models match.
In this example, we will consider the most technically interesting dataset - Digits, which contains 64 features and 10 classes.
- A dataset selection dialog appears, we choose the third dataset.
- A field for entering the private key from the Aleo wallet appears.
- The model training starts in Python language (invisible to the user).
- A dialog appears asking whether to evaluate the model.
- We select yes, the model is evaluated.
- Classification reports appear:
The following metrics are used in this project:
- Accuracy
- Precision
- Recall
- F1 score
First, the program runs the leo run predict
command for each row in the test data set, gets a response, and records it. Then the sklearn.metics.classification_report function is given the true values and obtained responses, after which it returns all the metrics that we output to the console.
Then the same classification_report
function is passed the previous responses of the model in Leo language and responses of the model in Python language, after which it returns all the metrics, which we also output to the console.
The following table shows the results of the decision tree on Leo language for all mentioned datasets.
Dataset | Accuracy | Precision | Recall | F1 score |
---|---|---|---|---|
Iris | 0.868 | 0.887 | 0.868 | 0.869 |
Wine | 0.844 | 0.862 | 0.844 | 0.846 |
Digits | 0.878 | 0.880 | 0.878 | 0.878 |
Breast cancer | 0.930 | 0.929 | 0.930 | 0.930 |
The results are excellent, with all models in Leo language matching the models in Python language, confirming the program's correctness. It should be noted that the data was obtained during testing on a randomly selected part of the dataset, which is 25% of the total volume of data.
If you want to verify the model's work, please adapt the program code for this and create an Issue with the work results, which will be added to this table.
This project is distributed under the MIT License. See LICENSE
for more information. The MIT License is a permissive license that is short and to the point. It lets people do anything they want with your code as long as they provide attribution back to you and don’t hold you liable.
Decision Trees are a type of Supervised Machine Learning where the data is continuously split according to a certain parameter. The tree can be explained by two entities, namely decision nodes and leaves. The leaves are the decisions or the final outcomes. And the decision nodes are where the data is split.
Here's a simple way to understand the mechanism of a Decision Tree:
- Start at the tree's root node as the parent node.
- Select the best feature using the Attribute Selection Measures(ASM) to split the records.
- Make that feature the decision node and break the dataset into smaller subsets.
- Starts tree building by repeating this process recursively for each child until one of the condition will match:
- All the tuples belong to the same attribute value.
- There are no more remaining attributes.
- There are no more instances.
For any inquiries or issues with the software, please open an issue in the GitHub repository. Additionally, contributions are always welcome! If you would like to contribute, please fork the repository and use a feature branch. Pull requests are warmly welcome.