Page dedicated to data exploratory analysis, preparation, cleaning, pre-processing / wrangling, generation, feature engineering and other related topics
- The Ultimate Python Seaborn Tutorial: Gotta Catch ‘Em All
- Exploratory Analysis
- Understand Your Machine Learning Data With Descriptive Statistics in Python
- Visualize Machine Learning Data in Python With Pandas
- 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset
- Exploring and Transforming H2O DataFrame in R and Python
- ML with H2O by Sudalai Rajkumar (slide 20 onwards)
- How to Use Statistics to Identify Outliers in Data
- Fundamentals of Data Visualization
- Helpful Python Code Snippets for Data Exploration in Pandas
- Data cleaning
- Spend Less Time Cleaning Data with Machine Learning
- Helpful Python Code Snippets for Data Exploration in Pandas - lots of python snippets to select / clean / prepare
- Working with missing data
- Journal of Statistical Software - TidyData
- Data Preprocessing vs. Data Wrangling in Machine Learning Projects
- Improve Model Accuracy with Data Pre-Processing
- See discussion on how data cleaning/preprocessing went wrong resulting in poorly performing model
- Synthetic data generation — a must-have skill for new data scientists
- How to Generate Test Datasets in Python with scikit-learn
- Python packages
- R packages
- Distributions generation package in R
- sythpop - an R package that generates data
- Another R package
- Set of R examples
- Very good examples of how R’s packages can be used to generate datasets time series, adjusting correlations and visualise them
- Examples of how Scikit and R’s packages can be used to generate synthetic data
Generate random data matching a rule or type (people’s names / phone numbers / etc, financial data, etc)
- Random database/dataframe generator
- MyRiad Toolkit (Paper: http://vldb.org/pvldb/vol5/p1890_alexanderalexandrov_vldb2012.pdf) - focuses on how to generate massive amounts of data following a database schema (create data for your relational db with users, orders, etc)
- Generating Synthetic Data to Match Data Mining Patterns
- SMOTE with Imbalance Data
- imbalanced-learn library
- SO discussion on using Python libraries
- Simple example of how stock prices can be generated
- Building a simple Generative Adversarial Network (GAN) using TensorFlow
- GENERATIVE ADVERSARIAL NETWORKS (GANS) FOR TEXT USING WORD2VEC: PART 1
- GAN for 2D data generation
- Generative Adversarial Networks (GANs) for Discrete Data
- Boundary-seeking GANs: A new method for adversarial generation of discrete data
- Introductory guide to Generative Adversarial Networks (GANs) and their promise!
- Private Synthetic Data Generation via GANs (Supporting PDF)
- GANs (generative adversarial networks) possible for text as well?
- Basic Feature Engineering With Time Series Data in Python
- Zillow Prize - EDA, Data Cleaning & Feature Engineering
- Feature-wise transformations
- An Introduction To Statistical Learning with Applications in R
- Statistical Inferencing
- Having a lot more training examples of one type of object than the other types
- Accidentally testing the neural network using images that were in the training set
- Training the neural network on data that is easier to recognize or more consistent than the real-world data it will be used to classify later on
See under Data Science and related cheatsheets and also under Tools & Libraries, Cheatsheets, Resources
- Data Science Primer
- 27 Amazing Data Science Books Every Data Scientist Should Read
- Coursera course: Getting and Cleaning Data
- Data Science courses on Coursera
- Data courses on Udemy
- Data courses on Udacity
- Learn Data Science by bitgrit
- Latest Machine learning, visualization, data mining techniques. Online Master�s in Data Analytic from Penn State
- Data Science Handbook
- 12 Best Practices for Modern Data Ingestion
- A Rubric for ML Production Readiness - by Jiameng Gao from Applied Deep Learning Meetup in Feb 2019 (Paper: https://ai.google/research/pubs/pub46555)
- Rules of Machine Learning: Best Practices for ML Engineering
- Data Science Primer
- How to Prepare Data For Machine Learning
- What is Data Mining and KDD
- The KDD process for extracting useful knowledge from volumes of data
- Data Mining: Practical ML Tools and Techniques by Witten, Frank and Mark 3rd edition
- Foundational Methodology for Data Science - IBM Analytics Whitepaper
- Coursera Data Science Methodology course
- From Problem to Approach and From Requirements to Collection
- Business Understanding
- Analytic Approach
- Data Requirements
- Data Collection
- From Understanding to Preparation and From Modeling to Evaluation
- Data Understanding
- Data Preparation
- Modeling
- Model Evaluation
- From Problem to Approach and From Requirements to Collection
- [Python Data Science Handbook on Azure git repo](https://noteboo ks.azure.com/jakevdp/projects/PythonDataScienceHandbook/tree/notebooks)
- Python for data analysis
- Python Data Science Handbook
- House prices
- Old example notebook: examples/data/notebooks
- Regression example: https://colab.research.google.com/drive/19uoDyGAxJ0zCwPT6cNb1xkYOfySNZChV
- Classification example: https://colab.research.google.com/drive/1i-fOhU87wWrzgnTV0o54MQyHmRVJK0qt
- Grakn and Graql - not just graphs, or graph database knowledge graphs | Docs | Quick start | GitHub
- See example in the
../examples/data/databases/graph/grakn
folder
- See example in the
- Redis Graph | Blogs | Videos
- Neo4j
- Gun: A realtime, decentralized, offline-first, mutable graph database engine
- Cayley: An open-source graph database
- How to build a data science project from scratch
- Common mistakes when carrying out machine learning and data science
- A Rubric for ML Production Readiness - Breck et al. 2017 by Jiameng Gao (28 rules to follow, suggested by Google) | Original Paper by Google
Contributions are very welcome, please share back with the wider community (and get credited for it)!
Please have a look at the CONTRIBUTING guidelines, also have a read about our licensing policy.
Back to main page (table of contents)