Page dedicated to data exploratory analysis, preparation, cleaning, pre-processing / wrangling, generation, feature engineering and other related topics
The question to ask ourselves: Do we know our data...?
- Ethics / altruistic motives
- Data Science
- Datasets and sources of raw data
- Data Exploratory Analysis
- Data preparation
- Data Generation
- Feature Selection
- Feature Engineering
- Post model-creation analysis, ML interpretation/explainability
- Statistics
- Visualisation
- Common mistakes when training models (data related)
- Presentations
- Cheatsheets
- Course / books
- Best practices / rules / an unordered list of high level or low level guidelines
- Framework(s) / checklist(s)
- Notebooks
- Programs and Tools
- Databases
- References
- Credits
- Contributing
See Ethics / altruistic motives
See Datasets
See Data Generation
See Post model-creation analysis, ML interpretation/explainability
See Statistics.md
- Having a lot more training examples of one type of object than the other types
- Accidentally testing the neural network using images that were in the training set
- Training the neural network on data that is easier to recognize or more consistent than the real-world data it will be used to classify later on
- A Rubric for ML Production Readiness - by Jiameng Gao from Applied Deep Learning Meetup in Feb 2019 (Paper: https://ai.google/research/pubs/pub46555)
- Introduction to Data Analysis and Cleaning presentation by Mark Bell
- Do we know our data, as good as we know our tools by Jeremie Charlet and Mani Sarkar
See under Cheatsheets
See Courses / books
- 12 Best Practices for Modern Data Ingestion
- A Rubric for ML Production Readiness - by Jiameng Gao from Applied Deep Learning Meetup in Feb 2019 (Paper: https://ai.google/research/pubs/pub46555)
- Rules of Machine Learning: Best Practices for ML Engineering
See Framework(s) / checklist(s)
See Notebooks
See Databases
- How to build a data science project from scratch
- Common mistakes when carrying out machine learning and data science
- A Rubric for ML Production Readiness - Breck et al. 2017 by Jiameng Gao (28 rules to follow, suggested by Google) | Original Paper by Google
- Understanding Data Science Problems - template of questions to ask
- eBook: How to Succeed in Data Science [deadlink]
- Data Fallacies by Nabih Bawazir
Big thanks to Jeremie Charlet for his contributions to many of the resources on this page. Not forgetting the others who have also helped support in the building of the above resources.
Contributions are very welcome, please share back with the wider community (and get credited for it)!
Please have a look at the CONTRIBUTING guidelines, also have a read about our licensing policy.
Back to main page (table of contents)