Page dedicated to data exploratory analysis, preparation, cleaning, pre-processing / wrangling, generation, feature engineering and other related topics
The question to ask ourselves: Do we know our data...?
- Ethics / altruistic motives
- Datasets and sources of raw data
- Data Exploratory Analysis
- Data preparation
- Data Generation
- Feature Engineering
- Statistics
- Visualisation
- Common mistakes when training models (data related)
- Cheatsheets
- Course / books
- Best practices / rules / an unordered list of high level or low level guidelines
- Framework(s) / checklist(s)
- Notebooks
- Programs and Tools
- Databases
- Credits
- Contributing
See Ethics / altruistic motives
- Boston Housing Dataset (archive contains unclean dataset) | Download
- Datasets for Data cleaning practise
- Cleaning up and combining data, a dataset for practice
- Datasets from various themes and domains: retail, government. Datasets with a good mix of incorrect, wrongly-input, missing data
- (Specific) Web Traffic Time Series Forecasting
- Quora: What are some dirty/untidy datasets to clean for data analysis? I have just finished datacamp's course on cleaning data using R. I want to practice
- Boston Housing Dataset (archive contains clean dataset) | Download
- Google Dataset Search
- Kaggle Datasets | Blogs: 1 2 3 4
- Carnegie Mellon University Datasets
- GeoPlatform Data.gov Search
- Data.gov - Data Catalog
- TidyTuesday projects on GitHub
- Enron Email Digest Dataset
- Mathematics Datasets
- Data.world datasets
- Microsoft Research Open Data
- Free Datasets recommended by r-directory
- UC Irvine Machine Learning Repository
- The Ultimate Python Seaborn Tutorial: Gotta Catch ‘Em All
- Exploratory Analysis
- Understand Your Machine Learning Data With Descriptive Statistics in Python
- Visualize Machine Learning Data in Python With Pandas
- 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset
- Exploring and Transforming H2O DataFrame in R and Python
- ML with H2O by Sudalai Rajkumar (slide 20 onwards)
- How to Use Statistics to Identify Outliers in Data
- Fundamentals of Data Visualization
- Helpful Python Code Snippets for Data Exploration in Pandas
- 5 Steps to correctly prepare your data for your machine learning model
- Introduction to Data Analysis and Cleaning presentation by Mark Bell
- Data cleaning
- Spend Less Time Cleaning Data with Machine Learning
- Helpful Python Code Snippets for Data Exploration in Pandas - lots of python snippets to select / clean / prepare
- Working with missing data
- Journal of Statistical Software - TidyData
- 5 Steps to correctly prepare your data for your machine learning model
- Introduction to Data Analysis and Cleaning presentation by Mark Bell
- Data Preprocessing vs. Data Wrangling in Machine Learning Projects
- Improve Model Accuracy with Data Pre-Processing
- 5 Steps to correctly prepare your data for your machine learning model
- Introduction to Data Analysis and Cleaning presentation by Mark Bell
- See discussion on how data cleaning/preprocessing went wrong resulting in poorly performing model
- Learning with Limited Labeled Data with Shioulin Sam
See Data Generation
- Basic Feature Engineering With Time Series Data in Python
- Zillow Prize - EDA, Data Cleaning & Feature Engineering
- Feature-wise transformations
- tsfresh - tsfresh is used to to extract characteristics from time series
- featuretools - an open source python framework for automated feature engineering
- 5 Steps to correctly prepare your data for your machine learning model
- An Introduction To Statistical Learning with Applications in R
- Statistical Inferencing
- Understand Your Machine Learning Data With Descriptive Statistics in Python
- How to Use Statistics to Identify Outliers in Data
- Applying Physics functions
- Chapter 2 of Introduction to Statistical Learning
- Naked statistics | Book on Amazon Naked statistics flash cards | Summary by Daniel Miessler
- Cartoon Guide to Statistics (Cartoon Guide Series)
- Journal of Statistical Software - TidyData
- Statistics courses at Coursera | Udemy | Udacity - search for
Statistics
| Harvard University: Statistics 110 | more videos on their YouTube channel | Stanford University - For more, see Mathematics, Statistics, Probability & Probabilistic programming
See Visualisation
- Having a lot more training examples of one type of object than the other types
- Accidentally testing the neural network using images that were in the training set
- Training the neural network on data that is easier to recognize or more consistent than the real-world data it will be used to classify later on
See under Cheatsheets
- Data Science Primer
- 27 Amazing Data Science Books Every Data Scientist Should Read
- Coursera course: Getting and Cleaning Data
- Data Science courses on Coursera
- Data courses on Udemy
- Data courses on Udacity
- Latest Machine learning, visualization, data mining techniques. Online Master�s in Data Analytic from Penn State
- Data Science Handbook
- 12 Best Practices for Modern Data Ingestion
- A Rubric for ML Production Readiness - by Jiameng Gao from Applied Deep Learning Meetup in Feb 2019 (Paper: https://ai.google/research/pubs/pub46555)
- Rules of Machine Learning: Best Practices for ML Engineering
- Data Science Primer
- How to Prepare Data For Machine Learning
- What is Data Mining and KDD
- The KDD process for extracting useful knowledge from volumes of data
- Data Mining: Practical ML Tools and Techniques by Witten, Frank and Mark 3rd edition
- Foundational Methodology for Data Science - IBM Analytics Whitepaper
- Coursera Data Science Methodology course
- From Problem to Approach and From Requirements to Collection
- Business Understanding
- Analytic Approach
- Data Requirements
- Data Collection
- From Understanding to Preparation and From Modeling to Evaluation
- Data Understanding
- Data Preparation
- Modeling
- Model Evaluation
- From Problem to Approach and From Requirements to Collection
- Python Data Science Handbook on Azure git repo
- Python for Data Analysis on Azure git repo
- Python Data Science Handbook
- House prices
- ML End-to-End Tutorial + Pandas notebook
- Regression example notebook
- Classification example notebook
- Some explanations of the above Regression & Classification examples: as a Notebook | as a PDF file
- Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command line tools
- Synthetic features and outliers notebook
- Do we know our data...
- Grakn and Graql - not just graphs, or graph database knowledge graphs | Docs | Quick start | GitHub
- See example in the
../examples/data/databases/graph/grakn
folder
- See example in the
- Redis Graph | Blogs | Videos | Skillsmatter: how redis enterprise made redis highly available, scalable, durable and cloudnative
- Neo4j
- Gun: A realtime, decentralized, offline-first, mutable graph database engine
- Cayley: An open-source graph database
- Time-scale
- kdb+ - is a column-based relational time series database (TSDB) with in-memory (IMDB) abilities, developed and marketed by Kx Systems
- How to build a data science project from scratch
- Common mistakes when carrying out machine learning and data science
- A Rubric for ML Production Readiness - Breck et al. 2017 by Jiameng Gao (28 rules to follow, suggested by Google) | Original Paper by Google
Big thanks to Jeremie Charlet for his contributions to many of the resources on this page. Not forgetting the others who have also helped support in the building of the above resources.
Contributions are very welcome, please share back with the wider community (and get credited for it)!
Please have a look at the CONTRIBUTING guidelines, also have a read about our licensing policy.
Back to main page (table of contents)