data

Data

Page dedicated to data exploratory analysis, preparation, cleaning, pre-processing / wrangling, generation, feature engineering and other related topics

Ethics / altruistic motives

See Ethics / altruistic motives

Datasets and sources of raw data

Data Exploratory Analysis

Data preparation

Data cleaning

Data cleaning
Spend Less Time Cleaning Data with Machine Learning
Helpful Python Code Snippets for Data Exploration in Pandas - lots of python snippets to select / clean / prepare
Working with missing data
Journal of Statistical Software - TidyData

Data preprocessing / Data Wrangling

Misc

See discussion on how data cleaning/preprocessing went wrong resulting in poorly performing model

Data Generation

Generate numeric data fitting a model/distribution (to fit linear model / ring / etc)

Generate random data matching a rule or type (people’s names / phone numbers / etc, financial data, etc)

Random database/dataframe generator
MyRiad Toolkit (Paper: http://vldb.org/pvldb/vol5/p1890_alexanderalexandrov_vldb2012.pdf) - focuses on how to generate massive amounts of data following a database schema (create data for your relational db with users, orders, etc)
Generating Synthetic Data to Match Data Mining Patterns

Generate data from existing

Generate fake images

(Computer Vision example, see code

Generate data using GAN

Feature engineering / selection

Basic Feature Engineering With Time Series Data in Python
Zillow Prize - EDA, Data Cleaning & Feature Engineering
Feature-wise transformations
tsfresh - tsfresh is used to to extract characteristics from time series
featuretools - an open source python framework for automated feature engineering

Statistics

Visualisation

See Visualisation

Common mistakes when training models (data related)

Having a lot more training examples of one type of object than the other types
Accidentally testing the neural network using images that were in the training set
Training the neural network on data that is easier to recognize or more consistent than the real-world data it will be used to classify later on

Cheatsheets

See under Cheatsheets

Course / books

Best practices / rules / an unordered list of high level or low level guidelines

Framework(s) / checklist(s)

Data Science Primer
How to Prepare Data For Machine Learning
What is Data Mining and KDD
The KDD process for extracting useful knowledge from volumes of data
Data Mining: Practical ML Tools and Techniques by Witten, Frank and Mark 3rd edition
Foundational Methodology for Data Science - IBM Analytics Whitepaper
Coursera Data Science Methodology course
- From Problem to Approach and From Requirements to Collection
  - Business Understanding
  - Analytic Approach
  - Data Requirements
  - Data Collection
- From Understanding to Preparation and From Modeling to Evaluation
  - Data Understanding
  - Data Preparation
  - Modeling
  - Model Evaluation

Notebooks

Programs and Tools

See Programs and Tools

Databases

Grakn and Graql - not just graphs, or graph database knowledge graphs | Docs | Quick start | GitHub
- See example in the ../examples/data/databases/graph/grakn folder
Redis Graph | Blogs | Videos | Skillsmatter: how redis enterprise made redis highly available, scalable, durable and cloudnative
Neo4j
Gun: A realtime, decentralized, offline-first, mutable graph database engine
Cayley: An open-source graph database

References

Contributing

Contributions are very welcome, please share back with the wider community (and get credited for it)!

Please have a look at the CONTRIBUTING guidelines, also have a read about our licensing policy.

Back to main page (table of contents)

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
Trackener-physics-functions-usage-example.pptx		Trackener-physics-functions-usage-example.pptx
about-Dataiku.md		about-Dataiku.md
about-Google-Data-Studio.md		about-Google-Data-Studio.md
about-H2O-Driverless-AI.md		about-H2O-Driverless-AI.md
about-Microstrategy.md		about-Microstrategy.md
about-ModeAnalytics.md		about-ModeAnalytics.md
about-Pipeline.ai.md		about-Pipeline.ai.md
about-Tableau-Prep.md		about-Tableau-Prep.md
about-Weights-and-Biases.md		about-Weights-and-Biases.md
about-fast.ai.md		about-fast.ai.md
programs-and-tools.md		programs-and-tools.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

README.md

Data

Ethics / altruistic motives

Datasets and sources of raw data

Data Exploratory Analysis

Data preparation

Data cleaning

Data preprocessing / Data Wrangling

Misc

Data Generation

Generate numeric data fitting a model/distribution (to fit linear model / ring / etc)

Generate random data matching a rule or type (people’s names / phone numbers / etc, financial data, etc)

Generate data from existing

Generate fake images

Generate data using GAN

Feature engineering / selection

Statistics

Visualisation

Common mistakes when training models (data related)

Cheatsheets

Course / books

Best practices / rules / an unordered list of high level or low level guidelines

Framework(s) / checklist(s)

Notebooks

Programs and Tools

Databases

References

Contributing

Files

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

README.md

Data

Ethics / altruistic motives

Datasets and sources of raw data

Data Exploratory Analysis

Data preparation

Data cleaning

Data preprocessing / Data Wrangling

Misc

Data Generation

Generate numeric data fitting a model/distribution (to fit linear model / ring / etc)

Generate random data matching a rule or type (people’s names / phone numbers / etc, financial data, etc)

Generate data from existing

Generate fake images

Generate data using GAN

Feature engineering / selection

Statistics

Visualisation

Common mistakes when training models (data related)

Cheatsheets

Course / books

Best practices / rules / an unordered list of high level or low level guidelines

Framework(s) / checklist(s)

Notebooks

Programs and Tools

Databases

References

Contributing