data

Data

Page dedicated to data exploratory analysis, preparation, cleaning, pre-processing / wrangling, generation, feature engineering and other related topics

Data Exploratory Analysis

Data preparation

Data cleaning

Data cleaning
Spend Less Time Cleaning Data with Machine Learning
Helpful Python Code Snippets for Data Exploration in Pandas - lots of python snippets to select / clean / prepare
Working with missing data
Journal of Statistical Software - TidyData

Data preprocessing / Data Wrangling

Misc

See discussion on how data cleaning/preprocessing went wrong resulting in poorly performing model

Data Generation

Generate numeric data fitting a model/distribution (to fit linear model / ring / etc)

Generate random data matching a rule or type (people’s names / phone numbers / etc, financial data, etc)

Random database/dataframe generator
MyRiad Toolkit (Paper: http://vldb.org/pvldb/vol5/p1890_alexanderalexandrov_vldb2012.pdf) - focuses on how to generate massive amounts of data following a database schema (create data for your relational db with users, orders, etc)
Generating Synthetic Data to Match Data Mining Patterns

Generate data from existing

Generate fake images

(Computer Vision example, see code

Generate data using GAN

Feature engineering / selection

Statistics

Common mistakes when training models (data related)

Having a lot more training examples of one type of object than the other types
Accidentally testing the neural network using images that were in the training set
Training the neural network on data that is easier to recognize or more consistent than the real-world data it will be used to classify later on

Cheatsheets

See under Data Science and related cheatsheets and also under Tools & Libraries, Cheatsheets, Resources

Course / books

Best practices / rules / an unordered list of high level or low level guidelines

Framework(s) / checklist(s)

Data Science Primer
How to Prepare Data For Machine Learning
What is Data Mining and KDD
The KDD process for extracting useful knowledge from volumes of data
Data Mining: Practical ML Tools and Techniques by Witten, Frank and Mark 3rd edition
Foundational Methodology for Data Science - IBM Analytics Whitepaper
Coursera Data Science Methodology course
- From Problem to Approach and From Requirements to Collection
  - Business Understanding
  - Analytic Approach
  - Data Requirements
  - Data Collection
- From Understanding to Preparation and From Modeling to Evaluation
  - Data Understanding
  - Data Preparation
  - Modeling
  - Model Evaluation

Notebooks

[Python Data Science Handbook on Azure git repo](https://noteboo ks.azure.com/jakevdp/projects/PythonDataScienceHandbook/tree/notebooks)
Python for data analysis
Python Data Science Handbook
House prices
- Old example notebook: examples/data/notebooks
- Regression example: https://colab.research.google.com/drive/19uoDyGAxJ0zCwPT6cNb1xkYOfySNZChV
- Classification example: https://colab.research.google.com/drive/1i-fOhU87wWrzgnTV0o54MQyHmRVJK0qt

Programs and Tools

See Programs and Tools

Databases

Grakn and Graql - not just graphs, or graph database knowledge graphs | Docs | Quick start | GitHub
- See example in the ../examples/data/databases/graph/grakn folder
Redis Graph | Blogs | Videos
Neo4j
Gun: A realtime, decentralized, offline-first, mutable graph database engine
Cayley: An open-source graph database

References

Contributing

Contributions are very welcome, please share back with the wider community (and get credited for it)!

Please have a look at the CONTRIBUTING guidelines, also have a read about our licensing policy.

Back to main page (table of contents)

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
about-Dataiku.md		about-Dataiku.md
about-Google-Data-Studio.md		about-Google-Data-Studio.md
about-H2O-Driverless-AI.md		about-H2O-Driverless-AI.md
about-Microstrategy.md		about-Microstrategy.md
about-ModeAnalytics.md		about-ModeAnalytics.md
about-Pipeline.ai.md		about-Pipeline.ai.md
about-Tableau-Prep.md		about-Tableau-Prep.md
about-Weights-and-Biases.md		about-Weights-and-Biases.md
about-fast.ai.md		about-fast.ai.md
programs-and-tools.md		programs-and-tools.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

README.md

Data

Data Exploratory Analysis

Data preparation

Data cleaning

Data preprocessing / Data Wrangling

Misc

Data Generation

Generate numeric data fitting a model/distribution (to fit linear model / ring / etc)

Generate random data matching a rule or type (people’s names / phone numbers / etc, financial data, etc)

Generate data from existing

Generate fake images

Generate data using GAN

Feature engineering / selection

Statistics

Common mistakes when training models (data related)

Cheatsheets

Course / books

Best practices / rules / an unordered list of high level or low level guidelines

Framework(s) / checklist(s)

Notebooks

Programs and Tools

Databases

References

Contributing

Files

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

README.md

Data

Data Exploratory Analysis

Data preparation

Data cleaning

Data preprocessing / Data Wrangling

Misc

Data Generation

Generate numeric data fitting a model/distribution (to fit linear model / ring / etc)

Generate random data matching a rule or type (people’s names / phone numbers / etc, financial data, etc)

Generate data from existing

Generate fake images

Generate data using GAN

Feature engineering / selection

Statistics

Common mistakes when training models (data related)

Cheatsheets

Course / books

Best practices / rules / an unordered list of high level or low level guidelines

Framework(s) / checklist(s)

Notebooks

Programs and Tools

Databases

References

Contributing