Skip to content

chikeorah/My-Home-Energy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning Analysis Report for Energy Dataset

1. Executive Summary

This report presents the findings of a machine learning analysis conducted on an energy consumption dataset. The analysis aimed to predict energy readings based on various features. Multiple models were evaluated, with Random Forest and XGBoost achieving perfect R-squared scores. After optimization, XGBoost emerged as the best-performing model. The perfect fit and the dominance of the 'Type_Encoded' feature in importance are now better understood given the domain-specific relationship between energy type, price, and consumption.

2. Data Overview

2.1 Selected Features

The following features were selected for the final model:

  1. Duration
  2. Price(£)
  3. Type_Encoded (representing gas or electricity)
  4. Start_Month
  5. Start_Day

2.2 Target Variable

The target variable for prediction was the energy Reading.

2.3 Domain-Specific Feature Relationships

  • The 'Type_Encoded' feature represents whether the energy type is gas or electricity.
  • Electricity and gas have fixed, distinct prices.
  • The type of energy significantly affects both the price and the reading.

3. Model Performance

3.1 Model Comparison

The performance of different models before optimization:

  1. Linear Regression - MSE: 203039.5827, R2: 0.9992
  2. Ridge Regression - MSE: 309823.4066, R2: 0.9988
  3. Lasso Regression - MSE: 202158.4147, R2: 0.9992
  4. Random Forest - MSE: 7882.3646, R2: 1.0000
  5. XGBoost - MSE: 6290.0844, R2: 1.0000

Both Random Forest and XGBoost achieved perfect R-squared scores, with XGBoost having a slightly lower MSE.

3.2 Best Model: Optimized XGBoost

After optimization, XGBoost remained the best-performing model.

3.3 Model Parameters

The optimal parameters for the XGBoost model were:

  • learning_rate: 0.3
  • max_depth: 3
  • n_estimators: 50

These parameters suggest a relatively simple model with shallow trees and a high learning rate.

3.4 Final Performance Metrics

  • Mean Squared Error (MSE): 5936.5632
  • R-squared (R2): 1.0000

4. Feature Importance

The feature importance analysis for the best model (XGBoost) showed:

  • Type_Encoded: 1.0
  • Start_Month, Price, Duration, Start_Day: Approximately 0 (not displayed on the graph)

This distribution, while initially concerning, aligns with the domain-specific knowledge that the type of energy (gas or electricity) is a key determinant of both price and consumption.

5. Interpretation and Insights

5.1 Model Fit

The perfect R-squared scores for both Random Forest and XGBoost, along with very high scores for linear models, can be explained by:

  1. The strong deterministic relationship between energy type, price, and consumption.
  2. The fixed pricing for each energy type, creating clear, distinct patterns in the data.
  3. The ability of tree-based models to capture this categorical distinction perfectly.

5.2 Feature Importance

The dominance of 'Type_Encoded' in feature importance is now understood to be a reflection of the underlying data structure:

  1. Logical Importance: The type of energy (gas or electricity) is indeed the primary factor in determining both price and consumption.
  2. Price as a Proxy: Since prices are fixed for each type, the 'Type_Encoded' feature effectively encapsulates price information.
  3. Simplified Prediction Task: Given fixed prices for each type, determining the reading becomes a much simpler task once the type is known.

5.3 Model Behavior

The models, particularly XGBoost, have essentially learned to:

  1. Primarily use the 'Type_Encoded' feature to distinguish between gas and electricity.
  2. Implicitly use the fixed price information associated with each type.
  3. Make highly accurate predictions based on this clear categorical distinction.

6. Implications for Predictive Modeling

  1. Simplification of the Problem: The fixed pricing and clear distinction between gas and electricity have simplified the prediction task, explaining the exceptionally high performance across models.

  2. Potential for Overfitting: While the model's reliance on 'Type_Encoded' is justified, it may lead to poor generalization if energy prices change or if applied to regions with different pricing structures.

  3. Limited Utility of Other Features: The dominance of 'Type_Encoded' suggests that other features (Duration, Start_Month, Start_Day) are not contributing significantly to predictions in the current model.

  4. Predictive Power vs. Explanatory Insight: While the model achieves high accuracy, it may not provide much insight into factors affecting energy consumption beyond the type of energy used.

7. Recommendations

  1. Separate Models for Gas and Electricity:

    • Consider developing separate models for gas and electricity to capture nuances specific to each energy type.
  2. Feature Engineering:

    • Create interaction terms between 'Type_Encoded' and other features to capture type-specific temporal or duration effects.
    • Develop features that represent deviations from average consumption for each type.
  3. Incorporate Price Variability:

    • If possible, introduce historical price data or price variability to make the model more robust to potential future price changes.
  4. Time Series Analysis:

    • Explore time series models to capture temporal patterns in consumption for each energy type.
  5. Domain-Specific Metrics:

    • Develop evaluation metrics that are meaningful in the energy sector, such as prediction accuracy within specific consumption ranges.
  6. Scenario Testing:

    • Test the model's performance under hypothetical scenarios, such as price changes or introduction of new energy types.
  7. Explainable AI Techniques:

    • Employ techniques like SHAP (SHapley Additive exPlanations) values to provide more nuanced insights into feature contributions.

8. Conclusion

The machine learning models, particularly XGBoost, have demonstrated exceptional performance in predicting energy readings. This performance is now understood to be a result of the clear distinction between gas and electricity in terms of pricing and consumption patterns. While the model's reliance on the 'Type_Encoded' feature is justified, it also highlights the deterministic nature of the current prediction task.

To develop a more robust and insightful model, future work should focus on capturing more nuanced patterns within each energy type, preparing for potential changes in pricing structures, and deriving actionable insights beyond the basic gas-electricity distinction. This approach will ensure that the model remains valuable even as the energy landscape evolves.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published