Advancing Cybersecurity with AI: This project fortifies phishing defense using cutting-edge models, trained on a diverse dataset of 737,000 URLs. It was the final project for the AI for Cybersecurity course in my Master's at uOttawa in 2023.
- Required libraries: scikit-learn, pandas, matplotlib.
- Execute cells in a Jupyter Notebook environment.
- The uploaded code has been executed and tested successfully within the Google Colab environment.
Task is to classify the likelihood of a URL: Phishing , Benign.
The independent variables in the provided dataset can be categorized into three groups:
-
Length and Count Features:These include measures related to the length and count of different components in a URL, such as domain length, URL length, count of digits, letters, path components, and various symbols.
-
Boolean Features:These features are binary indicators, representing the presence or absence of certain characteristics in a URL, such as whether it contains an IP address (ip), has redirection (redirection), uses IPv notation (ipv), is a shortened URL (short), is encoded (is_encoded), or has a suspicious top-level domain (sus).
-
Calculation-Based Features:These features involve calculated values based on the URL, including a malicious probability score (malicious_probability), entropy of characters (entropy), and a ratio of special characters and digits to the total characters in the URL (ratio).
- 'Label' indicating the classification into two classes: 1 (Phishing) / 0 (Benign)
-
Data Concatenation:
- Concatenated multiple DataFrames vertically into a single DataFrame.
- PhishStorm-URL dataset: 96011 Data Size.
- ISCX-URL2016 dataset: Extracted only Phishing / Legitimate from165366 rows.
- Malicious URL dataset: 651,191 Data Size
- Concatenated multiple DataFrames vertically into a single DataFrame.
-
Feature Extraction:
- Defined a function for extracting features from URLs.
- Extracted various features such as domain, path, first directory length, presence of IP address, URL length, etc.
- Calculated counts and frequencies of characters, entropy, URL decoding, and presence of unusual characters.
- Checked for URL shortening, special characters, and suspicious top-level domains.
-
Exploratory Data Analysis (EDA)
-
Feature Engineering and Data Cleaning:
-
Features Selection: Developed a function to assess and identify features with overwhelmingly repeated maximum values.
- Evaluated the percentage of occurrences for the most frequent value in each feature.
- Removed features where the maximum value was repeated over 90% of the time.
- Applied a 90% repetition threshold to exclude less informative or near-constant features.
- Improved model efficiency and computational performance by reducing redundancy in the dataset.
-
Modeling:
- Model Training: Trained various classification models (Logistic Regression, SVM, Decision Tree, Random Forest, XGBoost, etc.) using LazyClassifier.
-
Stacking and Voting Classifiers:
-
Selecting the two models with the highest true positive rates and the two with the highest true negative rates.
-
Combining each pair of top models using a stacking classifier approach to create ensemble models.
-
Applying soft voting to the predictions from the two ensemble models.
-
Re-evaluating the final integrated model on the test set and comparing its performance to the highest traditional model.
-
-
Champion Model: