Page dedicated to data exploratory analysis, preparation, cleaning, pre-processing / wrangling, generation, feature engineering and other related topics
The question to ask ourselves: Do we know our data...?
- Ethics / altruistic motives
- Data Science
- Datasets and sources of raw data
- Data Collection
- Hypothesis
- Data Exploratory Analysis
- Data preparation
- Data Generation
- Feature Selection
- Feature Importance
- Feature Engineering
- Hyperparameter tuning
- Post model-creation analysis, ML interpretation/explainability
- Model deployment
- Statistics
- Visualisation
- Common mistakes when training models (data related)
- Presentations
- Cheatsheets
- Course / books
- Best practices / rules / an unordered list of high level or low level guidelines
- Framework(s) / checklist(s)
- Notebooks
- Programs and Tools
- Databases
- References
- Credits
- Contributing
See Ethics / altruistic motives
- The Data Science Process
- JustCause package/framework - framework to foster good scientific practice in the research of causality methods | PyPu | GitHub
- “Metaflow is a human-friendly Python library” LinkedIn Post
- 5 free books for learning Python for DS
- 7 advanced tricks in pandas for data science
- The Ultimate NumPy Tutorial for Data Science Beginners
- Top 10 Data science podcast must follow for learn new things
- Top 20 Youtube Channels for Data Science
- Advanced Data Science from IBM
- 𝟏𝟐 𝐒𝐭𝐞𝐩𝐬 𝐭𝐨 𝐏𝐫𝐨𝐝𝐮𝐜𝐭𝐢𝐨𝐧-𝐐𝐮𝐚𝐥𝐢𝐭𝐲 𝐃𝐚𝐭𝐚 𝐒𝐜𝐢𝐞𝐧𝐜𝐞 𝐂𝐨𝐝𝐞
- Top 10 Popular GitHub Repositories to learn about Data Science
- The difference between Statistics and Data Science: Big Data and Inferential Statistics
- DataScience resources (in the form of a book) from Eric
- Data Exploration and API First Design: Deep Learning Hands-On Series with Eric Schles
- Augmented Analytics Engine
- Putting an end to Unreliable Analytics by David Yaffe
See Datasets
- Correlation, causation, multicollinearity, confounding features or variables
- How to approach Hypothesis Testing
- Does Your Hypothesis Development Canvas Tell a Story?
- A Complete Guide to Hypothesis Testing
- An introduction to Statistical Inference and Hypothesis testing
- A set of descriptive statistics and hypothesis tests across different types of data
- The statistical analysis t-test explained for beginners and experts
See Data Generation
- Example: Feature Importance implementation (python)
- How to Calculate Feature Importance With Python
- RFPimp:
- Catboost model and W&B
- LightGBM model and W&B
- The 4 types of additive Feature Importances
- The Math of Random Forests and Feature Importance in Scikit-learn and Spark
- Path Explain - toolkit for feature attributions: GitHub | PyPI | Path Explain on MWML
- Pruning: DL models
- [Pruning models](https://app.wandb.ai/authors/pruning/reports/Plunging-into-Model-Pruning-in-Deep-Learning--VmlldzoxMzcyMDg](https://app.wandb.ai/authors/pruning/reports/Scooping-into-Model-Pruning-in-Deep-Learning--VmlldzoxMzcyMDg?utm_source=social_slack&utm_medium=slack&utm_campaign=report_author)
- Poor Man’s BERT • Exploring Pruning as an Alternative to Knowledge Distillation. See Post model-creation analysis, ML interpretation/explainability
- Model Deployment Methods and Techniques - Part 1
- Model Deployment Methods and Techniques - Part 2
- Model Deployment Methods and Techniques - Part 3
- Model Deployment Methods and Techniques - Part 4
- Model Deployment Methods and Techniques - Part 5
- Having a lot more training examples of one type of object than the other types
- Accidentally testing the neural network using images that were in the training set
- Training the neural network on data that is easier to recognize or more consistent than the real-world data it will be used to classify later on
- A Rubric for ML Production Readiness - by Jiameng Gao from Applied Deep Learning Meetup in Feb 2019 (Paper: https://ai.google/research/pubs/pub46555)
- Introduction to Data Analysis and Cleaning presentation by Mark Bell
- Do we know our data, as good as we know our tools by Jeremie Charlet and Mani Sarkar
See under Cheatsheets
See Courses / books
- 12 Best Practices for Modern Data Ingestion
- A Rubric for ML Production Readiness - by Jiameng Gao from Applied Deep Learning Meetup in Feb 2019 (Paper: https://ai.google/research/pubs/pub46555)
- Rules of Machine Learning: Best Practices for ML Engineering
See Framework(s) / checklist(s)
See Notebooks
See Databases
- How to build a data science project from scratch
- Common mistakes when carrying out machine learning and data science
- A Rubric for ML Production Readiness - Breck et al. 2017 by Jiameng Gao (28 rules to follow, suggested by Google) | Original Paper by Google
- Understanding Data Science Problems - template of questions to ask
- eBook: How to Succeed in Data Science [deadlink]
- Data Fallacies by Nabih Bawazir
Big thanks to Jeremie Charlet for his contributions to many of the resources on this page. Not forgetting the others who have also helped support in the building of the above resources.
Contributions are very welcome, please share back with the wider community (and get credited for it)!
Please have a look at the CONTRIBUTING guidelines, also have a read about our licensing policy.
Back to main page (table of contents)