Page dedicated to data exploratory analysis, preparation, cleaning, pre-processing / wrangling, generation, feature engineering and other related topics
The question to ask ourselves: Do we know our data...?
See Ethics / altruistic motives
- Datasets for Data cleaning practise
- Cleaning up and combining data, a dataset for practice
- (Specific) Web Traffic Time Series Forecasting
- Quora: What are some dirty/untidy datasets to clean for data analysis? I have just finished datacamp's course on cleaning data using R. I want to practice
- Google Dataset Search
- Kaggle Datasets | Blogs: 1 2 3 4
- Carnegie Mellon University Datasets
- GeoPlatform Data.gov Search
- Data.gov - Data Catalog
- TidyTuesday projects on GitHub
- Enron Email Digest Dataset
- Mathematics Datasets
- Data.world datasets
- Microsoft Research Open Data
- Free Datasets recommended by r-directory
- UC Irvine Machine Learning Repository
- The Ultimate Python Seaborn Tutorial: Gotta Catch ‘Em All
- Exploratory Analysis
- Understand Your Machine Learning Data With Descriptive Statistics in Python
- Visualize Machine Learning Data in Python With Pandas
- 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset
- Exploring and Transforming H2O DataFrame in R and Python
- ML with H2O by Sudalai Rajkumar (slide 20 onwards)
- How to Use Statistics to Identify Outliers in Data
- Fundamentals of Data Visualization
- Helpful Python Code Snippets for Data Exploration in Pandas
- 5 Steps to correctly prepare your data for your machine learning model
- Introduction to Data Analysis and Cleaning presentation by Mark Bell
- Data cleaning
- Spend Less Time Cleaning Data with Machine Learning
- Helpful Python Code Snippets for Data Exploration in Pandas - lots of python snippets to select / clean / prepare
- Working with missing data
- Journal of Statistical Software - TidyData
- 5 Steps to correctly prepare your data for your machine learning model
- Introduction to Data Analysis and Cleaning presentation by Mark Bell
- Data Preprocessing vs. Data Wrangling in Machine Learning Projects
- Improve Model Accuracy with Data Pre-Processing
- 5 Steps to correctly prepare your data for your machine learning model
- Introduction to Data Analysis and Cleaning presentation by Mark Bell
- See discussion on how data cleaning/preprocessing went wrong resulting in poorly performing model
- Synthetic data generation — a must-have skill for new data scientists
- How to Generate Test Datasets in Python with scikit-learn
- Python packages
- R packages
- Distributions generation package in R
- sythpop - an R package that generates data
- Another R package
- Set of R examples
- Very good examples of how R’s packages can be used to generate datasets time series, adjusting correlations and visualise them
- Examples of how Scikit and R’s packages can be used to generate synthetic data
Generate random data matching a rule or type (people’s names / phone numbers / etc, financial data, etc)
- Random database/dataframe generator
- MyRiad Toolkit (Paper: http://vldb.org/pvldb/vol5/p1890_alexanderalexandrov_vldb2012.pdf) - focuses on how to generate massive amounts of data following a database schema (create data for your relational db with users, orders, etc)
- Generating Synthetic Data to Match Data Mining Patterns
- SMOTE with Imbalance Data
- imbalanced-learn library
- SO discussion on using Python libraries
- Simple example of how stock prices can be generated
- Building a simple Generative Adversarial Network (GAN) using TensorFlow
- GENERATIVE ADVERSARIAL NETWORKS (GANS) FOR TEXT USING WORD2VEC: PART 1
- GAN for 2D data generation
- Generative Adversarial Networks (GANs) for Discrete Data
- Boundary-seeking GANs: A new method for adversarial generation of discrete data
- Introductory guide to Generative Adversarial Networks (GANs) and their promise!
- Private Synthetic Data Generation via GANs (Supporting PDF)
- GANs (generative adversarial networks) possible for text as well?
- Basic Feature Engineering With Time Series Data in Python
- Zillow Prize - EDA, Data Cleaning & Feature Engineering
- Feature-wise transformations
- tsfresh - tsfresh is used to to extract characteristics from time series
- featuretools - an open source python framework for automated feature engineering
- 5 Steps to correctly prepare your data for your machine learning model
- An Introduction To Statistical Learning with Applications in R
- Statistical Inferencing
- Applying Physics functions
- Chapter 2 of Introduction to Statistical Learning
- For more, see Mathematics, Statistics, Probability & Probabilistic programming
See Visualisation
- Having a lot more training examples of one type of object than the other types
- Accidentally testing the neural network using images that were in the training set
- Training the neural network on data that is easier to recognize or more consistent than the real-world data it will be used to classify later on
See under Cheatsheets
- Data Science Primer
- 27 Amazing Data Science Books Every Data Scientist Should Read
- Coursera course: Getting and Cleaning Data
- Data Science courses on Coursera
- Data courses on Udemy
- Data courses on Udacity
- Latest Machine learning, visualization, data mining techniques. Online Master�s in Data Analytic from Penn State
- Data Science Handbook
- 12 Best Practices for Modern Data Ingestion
- A Rubric for ML Production Readiness - by Jiameng Gao from Applied Deep Learning Meetup in Feb 2019 (Paper: https://ai.google/research/pubs/pub46555)
- Rules of Machine Learning: Best Practices for ML Engineering
- Data Science Primer
- How to Prepare Data For Machine Learning
- What is Data Mining and KDD
- The KDD process for extracting useful knowledge from volumes of data
- Data Mining: Practical ML Tools and Techniques by Witten, Frank and Mark 3rd edition
- Foundational Methodology for Data Science - IBM Analytics Whitepaper
- Coursera Data Science Methodology course
- From Problem to Approach and From Requirements to Collection
- Business Understanding
- Analytic Approach
- Data Requirements
- Data Collection
- From Understanding to Preparation and From Modeling to Evaluation
- Data Understanding
- Data Preparation
- Modeling
- Model Evaluation
- From Problem to Approach and From Requirements to Collection
- Python Data Science Handbook on Azure git repo
- Python for Data Analysis on Azure git repo
- Python Data Science Handbook
- House prices
- ML End-to-End Tutorial + Pandas notebook
- Regression example notebook
- Classification example notebook
- Some explanations of the above Regression & Classification examples: as a Notebook | as a PDF file
- Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command line tools
- Synthetic features and outliers notebook
- Do we know our data...
- Grakn and Graql - not just graphs, or graph database knowledge graphs | Docs | Quick start | GitHub
- See example in the
../examples/data/databases/graph/grakn
folder
- See example in the
- Redis Graph | Blogs | Videos | Skillsmatter: how redis enterprise made redis highly available, scalable, durable and cloudnative
- Neo4j
- Gun: A realtime, decentralized, offline-first, mutable graph database engine
- Cayley: An open-source graph database
- Time-scale
- kdb+ - is a column-based relational time series database (TSDB) with in-memory (IMDB) abilities, developed and marketed by Kx Systems
- How to build a data science project from scratch
- Common mistakes when carrying out machine learning and data science
- A Rubric for ML Production Readiness - Breck et al. 2017 by Jiameng Gao (28 rules to follow, suggested by Google) | Original Paper by Google
Big thanks to Jeremie Charlet for his contributions to many of the resources on this page. Not forgetting the others who have also helped support in the building of the above resources.
Contributions are very welcome, please share back with the wider community (and get credited for it)!
Please have a look at the CONTRIBUTING guidelines, also have a read about our licensing policy.
Back to main page (table of contents)