Page dedicated to data exploratory analysis, preparation, cleaning, pre-processing / wrangling, generation, feature engineering and other related topics
The question to ask ourselves: Do we know our data...?
- Ethics / altruistic motives
- Datasets and sources of raw data
- Data Exploratory Analysis
- Data preparation
- Data Generation
- Feature Selection
- Feature Engineering
- Post model-creation analysis, ML interpretation/explainability
- Statistics
- Visualisation
- Common mistakes when training models (data related)
- Presentations
- Cheatsheets
- Course / books
- Best practices / rules / an unordered list of high level or low level guidelines
- Framework(s) / checklist(s)
- Notebooks
- Programs and Tools
- Databases
- References
- Credits
- Contributing
See Ethics / altruistic motives
- Boston Housing Dataset (archive contains unclean dataset) | Download
- Datasets for Data cleaning practise
- Cleaning up and combining data, a dataset for practice
- Datasets from various themes and domains: retail, government. Datasets with a good mix of incorrect, wrongly-input, missing data
- (Specific) Web Traffic Time Series Forecasting
- Quora: What are some dirty/untidy datasets to clean for data analysis? I have just finished datacamp's course on cleaning data using R. I want to practice
- Boston Housing Dataset (archive contains clean dataset) | Download
- Google Dataset Search
- Kaggle Datasets | Blogs: 1 2 3 4
- Carnegie Mellon University Datasets
- GeoPlatform Data.gov Search
- Data.gov - Data Catalog
- TidyTuesday projects on GitHub
- Enron Email Digest Dataset
- Mathematics Datasets
- Data.world datasets
- Microsoft Research Open Data
- Free Datasets recommended by r-directory
- UC Irvine Machine Learning Repository
- Google Research: A large-scale dataset of manually annotated audio events
- The Ultimate Python Seaborn Tutorial: Gotta Catch ‘Em All
- Exploratory Analysis
- Understand Your Machine Learning Data With Descriptive Statistics in Python
- Visualize Machine Learning Data in Python With Pandas
- 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset
- Exploring and Transforming H2O DataFrame in R and Python
- ML with H2O by Sudalai Rajkumar (slide 20 onwards)
- How to Use Statistics to Identify Outliers in Data
- Fundamentals of Data Visualization
- Helpful Python Code Snippets for Data Exploration in Pandas
- 5 Steps to correctly prepare your data for your machine learning model
- Introduction to Data Analysis and Cleaning presentation by Mark Bell
- Data cleaning
- Spend Less Time Cleaning Data with Machine Learning
- Helpful Python Code Snippets for Data Exploration in Pandas - lots of python snippets to select / clean / prepare
- Working with missing data
- Journal of Statistical Software - TidyData
- 5 Steps to correctly prepare your data for your machine learning model
- Introduction to Data Analysis and Cleaning presentation by Mark Bell
- Data Preprocessing vs. Data Wrangling in Machine Learning Projects
- Improve Model Accuracy with Data Pre-Processing
- 5 Steps to correctly prepare your data for your machine learning model
- Introduction to Data Analysis and Cleaning presentation by Mark Bell
- Pandas
- SQLAlchemy
- See discussion on how data cleaning/preprocessing went wrong resulting in poorly performing model
- Learning with Limited Labeled Data with Shioulin Sam
See Data Generation
- Feature selection with mutual information
- Forward Feature selection: Blog on Towards DS | Scikit learn
- What is dimensionality reduction? What is the difference between feature selection and extraction?
- Feature Engineering and Feature Selection
- Basic Feature Engineering With Time Series Data in Python
- Zillow Prize - EDA, Data Cleaning & Feature Engineering
- Feature-wise transformations
- tsfresh - tsfresh is used to to extract characteristics from time series
- featuretools - an open source python framework for automated feature engineering
- 5 Steps to correctly prepare your data for your machine learning model
- scikit learn's SelectKBest
- mlbox's Feature selection
- Chi2 test: Feature selection: Quora | NLP Stanford Group | Learn for Master
- Feature engineering and Dimensionality reduction
- Seven Techniques for Data Dimensionality Reduction
- Feature Engineering and Feature Selection
- ML topics expanded by Chris Albon - look for topics: Feature Engineering • Feature Selection
- Yellowbrick - is a suite of visual diagnostic tools called “Visualizers” that extend the Scikit-Learn API to allow human steering of the model selection process
- Shap - A unified approach to explain the output of any machine learning model
- LIME
- 4 Python Libraries For Getting Better Model Interpretability
- Integrated Gradients: Axiomatic Attribution for Deep Networks | Paper
- Resources on GitHub on interpretability
- Awesome Machine Learning Interpretability - A Curated, but Probably Biased and Incomplete, List of Awesome Machine Learning Interpretability Resources
- DataRobot: Model Interpretability - What is Model Interpretability in Machine Learning?
- Model Interpretability with SHAP
- Interpreting bag of words models with SHAP
- Explain any machine learning model prediction - using SHAP
- Explain ML Models notebooks
- How to explain the prediction of a ML model
- Explaining complex machine learning models with LIME
- Hermeneutic Investigations: ML Interpreation - why?: Video | Slides by Dean Allsopp
- Explaining Explanations: An Overview ofInterpretability of Machine Learning
- Explaining Black-Box Machine Learning Models
- Interpretable Machine Learning
- R Machine Learning Projects by Dr. Sunil Kumar Chinnamgari: Model interpretability
- Hands-on Machine Learning with R: Interpretable Machine Learning
- Tree SHAP
- Exact SHAP: A Unified Approach to Interpreting Model Predictions
- Integrated Gradients: Axiomatic Attribution for Deep Networks | GitHub
- Know Data Science
- Understand How to answer Why
- Learning with Explanations by Tim Rocktäschel
- Towards Explainable AI: Slides | Video | Book: A Concise Introduction to Machine Learning by Anita Faul
- Machine Learning Project End to End with Python Code (data science focussed)
- Python Project (Classification) :
- Part A: https://www.youtube.com/watch?v=p0snNMCbvN4&list=PLcQCwsZDEzFkQj3tOV2NDrjJ43iuNY5yC&index=8
- Part B: https://www.youtube.com/watch?v=j4IgXflsZtg&list=PLcQCwsZDEzFkQj3tOV2NDrjJ43iuNY5yC&index=9
- Part C: https://www.youtube.com/watch?v=kHZmFVDm0QQ&list=PLcQCwsZDEzFkQj3tOV2NDrjJ43iuNY5yC&index=10
- Python Project (Classification) :
- An Introduction To Statistical Learning with Applications in R
- Statistical Inferencing
- Understand Your Machine Learning Data With Descriptive Statistics in Python
- How to Use Statistics to Identify Outliers in Data
- Applying Physics functions
- Chapter 2 of Introduction to Statistical Learning
- Naked statistics | Book on Amazon Naked statistics flash cards | Summary by Daniel Miessler
- Cartoon Guide to Statistics (Cartoon Guide Series)
- Journal of Statistical Software - TidyData
- Statistics courses at Coursera | Udemy | Udacity - search for
Statistics
| Harvard University: Statistics 110 | more videos on their YouTube channel | Stanford University - 15 Statistical Hypothesis Tests in Python (Cheat Sheet)
- Statistics by Chris Albon - covering Frequentist topics
- For more, see Mathematics, Statistics, Probability & Probabilistic programming
See Visualisation
- Having a lot more training examples of one type of object than the other types
- Accidentally testing the neural network using images that were in the training set
- Training the neural network on data that is easier to recognize or more consistent than the real-world data it will be used to classify later on
- A Rubric for ML Production Readiness - by Jiameng Gao from Applied Deep Learning Meetup in Feb 2019 (Paper: https://ai.google/research/pubs/pub46555)
- Introduction to Data Analysis and Cleaning presentation by Mark Bell
- Do we know our data, as good as we know our tools by Jeremie Charlet and Mani Sarkar
See under Cheatsheets
- Data Science Primer
- 27 Amazing Data Science Books Every Data Scientist Should Read
- Coursera course: Getting and Cleaning Data
- Data Science courses on Coursera
- Data courses on Udemy
- Data courses on Udacity
- Latest Machine learning, visualization, data mining techniques. Online Master�s in Data Analytic from Penn State
- Data Science Handbook
- 12 Best Practices for Modern Data Ingestion
- A Rubric for ML Production Readiness - by Jiameng Gao from Applied Deep Learning Meetup in Feb 2019 (Paper: https://ai.google/research/pubs/pub46555)
- Rules of Machine Learning: Best Practices for ML Engineering
- Data Science Primer
- How to Prepare Data For Machine Learning
- What is Data Mining and KDD
- The KDD process for extracting useful knowledge from volumes of data
- Data Mining: Practical ML Tools and Techniques by Witten, Frank and Mark 3rd edition
- Foundational Methodology for Data Science - IBM Analytics Whitepaper
- Coursera Data Science Methodology course
- From Problem to Approach and From Requirements to Collection
- Business Understanding
- Analytic Approach
- Data Requirements
- Data Collection
- From Understanding to Preparation and From Modeling to Evaluation
- Data Understanding
- Data Preparation
- Modeling
- Model Evaluation
- From Problem to Approach and From Requirements to Collection
- Python Data Science Handbook on Azure git repo
- Python for Data Analysis on Azure git repo
- Python Data Science Handbook
- House prices
- ML End-to-End Tutorial + Pandas notebook
- Regression example notebook
- Classification example notebook
- Some explanations of the above Regression & Classification examples: as a Notebook | as a PDF file
- Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command line tools
- Synthetic features and outliers notebook
- Do we know our data...
- Grakn and Graql - not just graphs, or graph database knowledge graphs | Docs | Quick start | GitHub
- See example in the
../examples/data/databases/graph/grakn
folder
- See example in the
- Redis Graph | Blogs | Videos | Skillsmatter: how redis enterprise made redis highly available, scalable, durable and cloudnative
- Neo4j
- Gun: A realtime, decentralized, offline-first, mutable graph database engine
- Cayley: An open-source graph database
- Time-scale
- kdb+ - is a column-based relational time series database (TSDB) with in-memory (IMDB) abilities, developed and marketed by Kx Systems
- How to build a data science project from scratch
- Common mistakes when carrying out machine learning and data science
- A Rubric for ML Production Readiness - Breck et al. 2017 by Jiameng Gao (28 rules to follow, suggested by Google) | Original Paper by Google
- Understanding Data Science Problems - template of questions to ask
- eBook: How to Succeed in Data Science
Big thanks to Jeremie Charlet for his contributions to many of the resources on this page. Not forgetting the others who have also helped support in the building of the above resources.
Contributions are very welcome, please share back with the wider community (and get credited for it)!
Please have a look at the CONTRIBUTING guidelines, also have a read about our licensing policy.
Back to main page (table of contents)