Page dedicated to data exploratory analysis, preparation, cleaning, pre-processing / wrangling, generation, feature engineering and other related topics
See Ethics / altruistic motives
- Google Dataset Search
- Kaggle Datasets | Blogs: 1 2 3 4
- Carnegie Mellon University Datasets
- GeoPlatform Data.gov Search
- Data.gov - Data Catalog
- TidyTuesday projects on GitHub
- Enron Email Digest Dataset
- Mathematics Dataset
- The Ultimate Python Seaborn Tutorial: Gotta Catch ‘Em All
- Exploratory Analysis
- Understand Your Machine Learning Data With Descriptive Statistics in Python
- Visualize Machine Learning Data in Python With Pandas
- 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset
- Exploring and Transforming H2O DataFrame in R and Python
- ML with H2O by Sudalai Rajkumar (slide 20 onwards)
- How to Use Statistics to Identify Outliers in Data
- Fundamentals of Data Visualization
- Helpful Python Code Snippets for Data Exploration in Pandas
- Data cleaning
- Spend Less Time Cleaning Data with Machine Learning
- Helpful Python Code Snippets for Data Exploration in Pandas - lots of python snippets to select / clean / prepare
- Working with missing data
- Journal of Statistical Software - TidyData
- Data Preprocessing vs. Data Wrangling in Machine Learning Projects
- Improve Model Accuracy with Data Pre-Processing
- See discussion on how data cleaning/preprocessing went wrong resulting in poorly performing model
- Synthetic data generation — a must-have skill for new data scientists
- How to Generate Test Datasets in Python with scikit-learn
- Python packages
- R packages
- Distributions generation package in R
- sythpop - an R package that generates data
- Another R package
- Set of R examples
- Very good examples of how R’s packages can be used to generate datasets time series, adjusting correlations and visualise them
- Examples of how Scikit and R’s packages can be used to generate synthetic data
Generate random data matching a rule or type (people’s names / phone numbers / etc, financial data, etc)
- Random database/dataframe generator
- MyRiad Toolkit (Paper: http://vldb.org/pvldb/vol5/p1890_alexanderalexandrov_vldb2012.pdf) - focuses on how to generate massive amounts of data following a database schema (create data for your relational db with users, orders, etc)
- Generating Synthetic Data to Match Data Mining Patterns
- SMOTE with Imbalance Data
- imbalanced-learn library
- SO discussion on using Python libraries
- Simple example of how stock prices can be generated
- Building a simple Generative Adversarial Network (GAN) using TensorFlow
- GENERATIVE ADVERSARIAL NETWORKS (GANS) FOR TEXT USING WORD2VEC: PART 1
- GAN for 2D data generation
- Generative Adversarial Networks (GANs) for Discrete Data
- Boundary-seeking GANs: A new method for adversarial generation of discrete data
- Introductory guide to Generative Adversarial Networks (GANs) and their promise!
- Private Synthetic Data Generation via GANs (Supporting PDF)
- GANs (generative adversarial networks) possible for text as well?
- Basic Feature Engineering With Time Series Data in Python
- Zillow Prize - EDA, Data Cleaning & Feature Engineering
- Feature-wise transformations
- tsfresh - tsfresh is used to to extract characteristics from time series
- featuretools - an open source python framework for automated feature engineering
- An Introduction To Statistical Learning with Applications in R
- Statistical Inferencing
- Applying Physics functions
- For more, see Mathematics, Statistics, Probability & Probabilistic programming
See Visualisation
- Having a lot more training examples of one type of object than the other types
- Accidentally testing the neural network using images that were in the training set
- Training the neural network on data that is easier to recognize or more consistent than the real-world data it will be used to classify later on
See under Cheatsheets
- Data Science Primer
- 27 Amazing Data Science Books Every Data Scientist Should Read
- Coursera course: Getting and Cleaning Data
- Data Science courses on Coursera
- Data courses on Udemy
- Data courses on Udacity
- Latest Machine learning, visualization, data mining techniques. Online Master�s in Data Analytic from Penn State
- Data Science Handbook
- 12 Best Practices for Modern Data Ingestion
- A Rubric for ML Production Readiness - by Jiameng Gao from Applied Deep Learning Meetup in Feb 2019 (Paper: https://ai.google/research/pubs/pub46555)
- Rules of Machine Learning: Best Practices for ML Engineering
- Data Science Primer
- How to Prepare Data For Machine Learning
- What is Data Mining and KDD
- The KDD process for extracting useful knowledge from volumes of data
- Data Mining: Practical ML Tools and Techniques by Witten, Frank and Mark 3rd edition
- Foundational Methodology for Data Science - IBM Analytics Whitepaper
- Coursera Data Science Methodology course
- From Problem to Approach and From Requirements to Collection
- Business Understanding
- Analytic Approach
- Data Requirements
- Data Collection
- From Understanding to Preparation and From Modeling to Evaluation
- Data Understanding
- Data Preparation
- Modeling
- Model Evaluation
- From Problem to Approach and From Requirements to Collection
- Python Data Science Handbook on Azure git repo
- Python for Data Analysis on Azure git repo
- Python Data Science Handbook
- House prices
- Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command line tools
- Synthetic features and outliers notebook
- Grakn and Graql - not just graphs, or graph database knowledge graphs | Docs | Quick start | GitHub
- See example in the
../examples/data/databases/graph/grakn
folder
- See example in the
- Redis Graph | Blogs | Videos | Skillsmatter: how redis enterprise made redis highly available, scalable, durable and cloudnative
- Neo4j
- Gun: A realtime, decentralized, offline-first, mutable graph database engine
- Cayley: An open-source graph database
- How to build a data science project from scratch
- Common mistakes when carrying out machine learning and data science
- A Rubric for ML Production Readiness - Breck et al. 2017 by Jiameng Gao (28 rules to follow, suggested by Google) | Original Paper by Google
Contributions are very welcome, please share back with the wider community (and get credited for it)!
Please have a look at the CONTRIBUTING guidelines, also have a read about our licensing policy.
Back to main page (table of contents)