data

Data

Page dedicated to data exploratory analysis, preparation, cleaning, pre-processing / wrangling, generation, feature engineering and other related topics

The question to ask ourselves: Do we know our data...?

Ethics / altruistic motives

See Ethics / altruistic motives

Datasets and sources of raw data

Raw / unclean datasets

Boston Housing Dataset (archive contains unclean dataset) | Download
Datasets for Data cleaning practise
Cleaning up and combining data, a dataset for practice
Datasets from various themes and domains: retail, government. Datasets with a good mix of incorrect, wrongly-input, missing data
(Specific) Web Traffic Time Series Forecasting
Quora: What are some dirty/untidy datasets to clean for data analysis? I have just finished datacamp's course on cleaning data using R. I want to practice

Clean / ready-to-use datasets

Data Exploratory Analysis

Data preparation

Data cleaning

Data cleaning
Spend Less Time Cleaning Data with Machine Learning
Helpful Python Code Snippets for Data Exploration in Pandas - lots of python snippets to select / clean / prepare
Working with missing data
Journal of Statistical Software - TidyData
5 Steps to correctly prepare your data for your machine learning model
Introduction to Data Analysis and Cleaning presentation by Mark Bell

Data preprocessing / Data wrangling / Data manipulation

Misc

See discussion on how data cleaning/preprocessing went wrong resulting in poorly performing model
Learning with Limited Labeled Data with Shioulin Sam

Data Generation

See Data Generation

Feature Selection

Feature engineering

Basic Feature Engineering With Time Series Data in Python
Zillow Prize - EDA, Data Cleaning & Feature Engineering
Feature-wise transformations
tsfresh - tsfresh is used to to extract characteristics from time series
featuretools - an open source python framework for automated feature engineering
5 Steps to correctly prepare your data for your machine learning model
scikit learn's SelectKBest
mlbox's Feature selection
Chi2 test: Feature selection: Quora | NLP Stanford Group | Learn for Master
Feature engineering and Dimensionality reduction
Seven Techniques for Data Dimensionality Reduction
Feature Engineering and Feature Selection
ML topics expanded by Chris Albon - look for topics: Feature Engineering • Feature Selection

Post model-creation analysis, ML interpretation/explainability

Libraries & packages

Yellowbrick - is a suite of visual diagnostic tools called “Visualizers” that extend the Scikit-Learn API to allow human steering of the model selection process
Shap - A unified approach to explain the output of any machine learning model
LIME
4 Python Libraries For Getting Better Model Interpretability
Integrated Gradients: Axiomatic Attribution for Deep Networks | Paper
Resources on GitHub on interpretability
Awesome Machine Learning Interpretability - A Curated, but Probably Biased and Incomplete, List of Awesome Machine Learning Interpretability Resources

Articles, blog posts, papers, notebooks, books, presentations

Statistics

An Introduction To Statistical Learning with Applications in R
Statistical Inferencing
Understand Your Machine Learning Data With Descriptive Statistics in Python
How to Use Statistics to Identify Outliers in Data
Applying Physics functions
Chapter 2 of Introduction to Statistical Learning
Naked statistics | Book on Amazon Naked statistics flash cards | Summary by Daniel Miessler
Cartoon Guide to Statistics (Cartoon Guide Series)
Journal of Statistical Software - TidyData
Statistics courses at Coursera | Udemy | Udacity - search for Statistics | Harvard University: Statistics 110 | more videos on their YouTube channel | Stanford University
15 Statistical Hypothesis Tests in Python (Cheat Sheet)
Statistics by Chris Albon - covering Frequentist topics
For more, see Mathematics, Statistics, Probability & Probabilistic programming

Visualisation

See Visualisation

Common mistakes when training models (data related)

Having a lot more training examples of one type of object than the other types
Accidentally testing the neural network using images that were in the training set
Training the neural network on data that is easier to recognize or more consistent than the real-world data it will be used to classify later on

Presentations

Cheatsheets

See under Cheatsheets

Course / books

Best practices / rules / an unordered list of high level or low level guidelines

Framework(s) / checklist(s)

Data Science Primer
How to Prepare Data For Machine Learning
What is Data Mining and KDD
The KDD process for extracting useful knowledge from volumes of data
Data Mining: Practical ML Tools and Techniques by Witten, Frank and Mark 3rd edition
Foundational Methodology for Data Science - IBM Analytics Whitepaper
Coursera Data Science Methodology course
- From Problem to Approach and From Requirements to Collection
  - Business Understanding
  - Analytic Approach
  - Data Requirements
  - Data Collection
- From Understanding to Preparation and From Modeling to Evaluation
  - Data Understanding
  - Data Preparation
  - Modeling
  - Model Evaluation

Notebooks

Python Data Science Handbook on Azure git repo
Python for Data Analysis on Azure git repo
Python Data Science Handbook
House prices
- ML End-to-End Tutorial + Pandas notebook
- Regression example notebook
- Classification example notebook
- Some explanations of the above Regression & Classification examples: as a Notebook | as a PDF file
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command line tools
Synthetic features and outliers notebook
Do we know our data...

Programs and Tools

See Programs and Tools

Databases

Grakn and Graql - not just graphs, or graph database knowledge graphs | Docs | Quick start | GitHub
- See example in the ../examples/data/databases/graph/grakn folder
Redis Graph | Blogs | Videos | Skillsmatter: how redis enterprise made redis highly available, scalable, durable and cloudnative
Neo4j
Gun: A realtime, decentralized, offline-first, mutable graph database engine
Cayley: An open-source graph database

Time-series databases

Time-scale
kdb+ - is a column-based relational time series database (TSDB) with in-memory (IMDB) abilities, developed and marketed by Kx Systems

References

Credits

Big thanks to Jeremie Charlet for his contributions to many of the resources on this page. Not forgetting the others who have also helped support in the building of the above resources.

Contributing

Contributions are very welcome, please share back with the wider community (and get credited for it)!

Please have a look at the CONTRIBUTING guidelines, also have a read about our licensing policy.

Back to main page (table of contents)

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
about-Dataiku.md		about-Dataiku.md
about-Google-Data-Studio.md		about-Google-Data-Studio.md
about-H2O-Driverless-AI.md		about-H2O-Driverless-AI.md
about-Microstrategy.md		about-Microstrategy.md
about-ModeAnalytics.md		about-ModeAnalytics.md
about-Pipeline.ai.md		about-Pipeline.ai.md
about-Tableau-Prep.md		about-Tableau-Prep.md
about-Valohai.md		about-Valohai.md
about-Weights-and-Biases.md		about-Weights-and-Biases.md
about-fast.ai.md		about-fast.ai.md
data-generation.md		data-generation.md
how-to-choose-your-data-visualisations.jpg		how-to-choose-your-data-visualisations.jpg
programs-and-tools.md		programs-and-tools.md
what-is-a-tensor.jpg		what-is-a-tensor.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

README.md

Data

Table of contents

Ethics / altruistic motives

Datasets and sources of raw data

Raw / unclean datasets

Clean / ready-to-use datasets

Data Exploratory Analysis

Data preparation

Data cleaning

Data preprocessing / Data wrangling / Data manipulation

Misc

Data Generation

Feature Selection

Feature engineering

Post model-creation analysis, ML interpretation/explainability

Libraries & packages

Articles, blog posts, papers, notebooks, books, presentations

Statistics

Visualisation

Common mistakes when training models (data related)

Presentations

Cheatsheets

Course / books

Best practices / rules / an unordered list of high level or low level guidelines

Framework(s) / checklist(s)

Notebooks

Programs and Tools

Databases

Time-series databases

References

Credits

Contributing

Files

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

README.md

Data

Table of contents

Ethics / altruistic motives

Datasets and sources of raw data

Raw / unclean datasets

Clean / ready-to-use datasets

Data Exploratory Analysis

Data preparation

Data cleaning

Data preprocessing / Data wrangling / Data manipulation

Misc

Data Generation

Feature Selection

Feature engineering

Post model-creation analysis, ML interpretation/explainability

Libraries & packages

Articles, blog posts, papers, notebooks, books, presentations

Statistics

Visualisation

Common mistakes when training models (data related)

Presentations

Cheatsheets

Course / books

Best practices / rules / an unordered list of high level or low level guidelines

Framework(s) / checklist(s)

Notebooks

Programs and Tools

Databases

Time-series databases

References

Credits

Contributing