Skip to content

Commit

Permalink
Data: updating data related links, created a new topic Model Creation
Browse files Browse the repository at this point in the history
  • Loading branch information
neomatrix369 committed Apr 12, 2020
1 parent 23b3703 commit 5123b8b
Show file tree
Hide file tree
Showing 10 changed files with 114 additions and 2 deletions.
10 changes: 10 additions & 0 deletions data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ The question to ask ourselves: _Do we know our data...?_
+ [Data preprocessing / Data Wrangling](./data-preparation.md#data-preprocessing--data-wrangling)
- [Data Generation](./README.md#data-generation)
- [Feature Selection](./README.md#feature-selection)
- [Feature Importance](./README.md#feature-importance)
- [Feature Engineering](./README.md#feature-engineering)
- [Post model-creation analysis, ML interpretation/explainability](./README.md#post-model-creation-analysis-ml-interpretationexplainability)
- [Model deployment](./README.md#model-deployment)
Expand Down Expand Up @@ -49,6 +50,7 @@ See [Ethics / altruistic motives](../README-details.md#ethics--altruistic-motive
- [Starting a Data Project](https://github.com/virgili0/Virgilio/blob/master/serving/purgatorio/define-the-scope-and-ask-questions/starting-a-data-project/starting-a-data-project.md)
- [WorkSpace Setup and Cloud Computing](https://github.com/virgili0/Virgilio/blob/master/serving/purgatorio/define-the-scope-and-ask-questions/workspace-setup-and-cloud-computing/workspace-setup-and-cloud-computing.md)
- [JustCause package/framework - framework to foster good scientific practice in the research of causality methods](https://www.linkedin.com/posts/florianwilhelm_introduction-activity-6624318058347405312-fdBa) | [PyPu](https://pypi.org/project/JustCause/) | [GitHub](https://github.com/inovex/justcause)
- [“Metaflow is a human-friendly Python library”](https://github.com/Netflix/metaflow) [LinkedIn Post](https://www.linkedin.com/posts/eric-feuilleaubois-ph-d-43ab0925_netflixmetaflow-activity-6638658912201527296-1QqW)

## Datasets and sources of raw data

Expand Down Expand Up @@ -80,6 +82,14 @@ See [Data Generation](./data-generation.md#data-generation)

See [Feature Selection](./feature-selection.md)

## Feature Importance

- [Example: Feature Importance implementation (python)](../examples/data/feature-importance-filtering)
- [How to Calculate Feature Importance With Python](https://machinelearningmastery.com/calculate-feature-importance-with-python/)
- RFPimp:
- [RF Importance](https://explained.ai/rf-importance/index.html)
- [Explaining Feature Importance by example of a Random Forest](https://towardsdatascience.com/explaining-feature-importance-by-example-of-a-random-forest-d9166011959e)

## Feature engineering

See [Feature engineering](./feature-engineering.md)
Expand Down
34 changes: 34 additions & 0 deletions data/data-exploratory-analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,22 @@ aka *_Exploratory Data Analysis_*
- [Introduction of Python for Data Analysis](https://www.linkedin.com/posts/nabihbawazir_python-for-data-analysis-activity-6605350672881721344-aE0_)
- [Data Analysis Method: Mathematics Optimization to Build Decision Making](https://www.linkedin.com/posts/data-science-central_data-analysis-method-mathematics-optimization-activity-6606212768343347201-vzat)

### Tools

- [Pandas Profiling](https://pandas-profiling.github.io/pandas-profiling/)
- [Dabl: Data Analysis Baseline Library (Pandas profiling like tool)](https://dabl.github.io/dev/) | [GitHub](https://github.com/dabl/dabl)
- [Bamboolib](./bamboolib.md)
- [CleverCSV](https://github.com/alan-turing-institute/CleverCSV)
- [edaviz - Python Library for Data Exploration and Visualization in Jupyter Notebook](https://www.youtube.com/watch?v=eYEeYv11YrQ)
- [Getting Started with pandas a powerful Python data analysis toolkit](https://www.datasciencecentral.com/profiles/blogs/getting-started-with-pandas-a-powerful-python-data-analysis) [LinkedIn Post](https://www.linkedin.com/posts/data-science-central_getting-started-with-pandas-a-powerful-python-activity-6651512263251410944-5bSl)
- [A Complete Tutorial to Learn Data Analysis with Python From Scratch](https://www.linkedin.com/posts/iamsivab_introduction-to-programming-in-pythonpdf-activity-6640574471667318784-yt-g)

- See [Data > Programs and Tools](./programs-and-tools.md#programs-and-tools) and [Things to know: Primary tools to analyse data](../things-to-know.md#primary-tools-to-analyse-data)

### Missing values

- [How to Treat Missing Values in Your Data](https://www.linkedin.com/posts/data-science-central_how-to-treat-missing-values-in-your-data-activity-6627609785242046464-A_69)

### Correlations

- [How to detect spurious correlations, and how to find the real ones](https://www.linkedin.com/posts/data-science-central_how-to-detect-spurious-correlations-and-activity-6623713080754913280-dU8f)
Expand All @@ -28,6 +44,24 @@ aka *_Exploratory Data Analysis_*
- [13 Great Articles and Tutorials about Correlation](https://www.linkedin.com/posts/data-science-central_13-great-articles-and-tutorials-about-correlation-activity-6622173938812280832-Fa4a)
- [Testing for Normality using Skewness and Kurtosis](https://www.linkedin.com/posts/ashishpatel2604_artificialintelligence-deeplearning-datascience-activity-6603851612719026176-zx0u)
- [Variable Reduction: An art as well as Science](https://www.linkedin.com/posts/data-science-central_variable-reduction-an-art-as-well-as-science-activity-6607678425375342592-xrSp)
- [[Discussion] How to see relation between categorical columns?](https://www.facebook.com/groups/AnalyticsEdge/permalink/2578728952342061/)

### Clustering

- [Clustering with non numeric data](https://www.linkedin.com/posts/data-science-central_clustering-with-non-numeric-data-activity-6607783116335534080-aWRV)
- [Clustering Python](https://github.com/ACFaul/Clustering-Python)
- [Clustering Matlab](https://github.com/ACFaul/Clustering-Matlab)
- [Clustering with non numeric data](https://www.linkedin.com/posts/data-science-central_clustering-with-non-numeric-data-activity-6607783116335534080-aWRV)
- [Have u ever heard about Bounded Clustering?](https://towardsdatascience.com/bounded-clustering-7ac02128c893) [LinkedIn Post](https://www.linkedin.com/posts/ashishpatel2604_bounded-clustering-activity-6604231470691217408-Fhyn)
- [Spectral Clustering : How Math is Redefining Decision Making](https://www.datasciencecentral.com/profiles/blogs/spectral-clustering-how-math-is-redefining-decision-making) [LinkedIn Post](https://www.linkedin.com/posts/data-science-central_spectral-clustering-how-math-is-redefining-activity-6644369189828120576-R50H)
- [Python: Implementing a k-means algorithm with sklearn](https://www.datasciencecentral.com/profiles/blogs/python-implementing-a-k-means-algorithm-with-sklearn) [LinkedIn Post](https://www.linkedin.com/posts/vincentg_python-implementing-a-k-means-algorithm-activity-6646407378474450944-wDzH)
- [Journey to Machine Learning – K-Means Clustering](https://www.linkedin.com/pulse/all-cheatsheets-one-place-vipul-patel/) [LinkedIn Post](https://www.linkedin.com/posts/vipulppatel_data-analytics-businessintelligence-activity-6640085732100710400-oGp7)
- [Comparison of Segmentation Approaches using Clustering (9 pages)](https://www.linkedin.com/feed/update/urn:li:activity:6540091805428518912?lipi=urn%3Ali%3Apage%3Ad_flagship3_pulse_read%3BmoauZl5XRFyXpGV91RiG2w%3D%3D)
- [Guide to HIERARCHICAL Clustering (23 pages) and how to Perform it in Python](https://www.linkedin.com/feed/update/urn:li:activity:6539263090955997184/)
- [K-Means Clustering — One rule to group them all](https://www.linkedin.com/posts/towards-data-science_k-means-clusteringone-rule-to-group-them-activity-6654401590067245056-9k9d)
- [10 Clustering Algorithms With Python](https://machinelearningmastery.com/clustering-algorithms-with-python/)
- [Finding organic clusters in complex data-networks](https://www.datasciencecentral.com/profiles/blogs/finding-organic-clusters-in-complex-data-networks) [LinkedIn Post](https://www.linkedin.com/posts/data-science-central_finding-organic-clusters-in-complex-data-networks-activity-6650907272413274112-p-H7)
- [How to Develop an Encoder-Decoder Model for Sequence-to-Sequence Prediction in Keras](https://machinelearningmastery.com/develop-encoder-decoder-model-sequence-sequence-prediction-keras/)

### Outliers

Expand Down
5 changes: 5 additions & 0 deletions data/data-generation.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,11 @@
### Generate data from existing

- [SMOTE with Imbalance Data](https://www.kaggle.com/qianchao/smote-with-imbalance-data)
- SMOTE library:
- [PyPi](https://pypi.org/search/?q=smote&o=-zscore)
- [Docs](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html)
- [SMOTE explained](http://rikunert.com/SMOTE_explained)
- [ML | Handling Imbalanced Data with SMOTE and Near Miss Algorithm in Python](https://www.geeksforgeeks.org/ml-handling-imbalanced-data-with-smote-and-near-miss-algorithm-in-python/)
- [imbalanced-learn library](https://imbalanced-learn.readthedocs.io/en/stable/introduction.html)
- [SO discussion on using Python libraries](https://stackoverflow.com/questions/51322554/smote-with-missing-values)
- [Simple example of how stock prices can be generated](https://stackoverflow.com/questions/8597731/are-there-known-techniques-to-generate-realistic-looking-fake-stock-data)
Expand Down
7 changes: 7 additions & 0 deletions data/data-preparation.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,11 @@
- [Working with missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)
- [Missing values in a dataset](https://www.datasciencecentral.com/profiles/blogs/how-to-treat-missing-values-in-your-data-1)

### Imbalanced data

- [Develop a Model for the Imbalanced Classification of Good and Bad Credit](https://machinelearningmastery.com/imbalanced-classification-of-good-and-bad-credit/)
- [One-Class Classification Algorithms for Imbalanced Datasets](https://machinelearningmastery.com/one-class-classification-algorithms/)
- [Step-By-Step Framework for Imbalanced Classification Projects](https://machinelearningmastery.com/framework-for-imbalanced-classification-projects/)

## Data preprocessing / Data wrangling / Data manipulation

Expand All @@ -33,6 +38,8 @@
- [SQLAlchemy](https://lnkd.in/gjvbm7h)
- [Extract online data (code in Java)](https://www.datasciencecentral.com/profiles/blogs/java-coding-sample-to-extract-online-data)
- [Data preparation for factor analysis](https://www.linkedin.com/posts/data-science-central_data-preparation-for-factor-analysis-activity-6608507889915092992-T1Vv)
- [7 Pandas Functions to Reduce Your Data Manipulation Stress by Andre Ye](https://towardsdatascience.com/7-pandas-functions-to-reduce-your-data-manipulation-stress-25981e44cc7d) [LinkedIn Post](https://www.linkedin.com/posts/towards-data-science_7-pandas-functions-to-reduce-your-data-manipulation-activity-6655006784069214208-R9Zn)


### Scaling and normalisation

Expand Down
11 changes: 11 additions & 0 deletions data/databases.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,17 @@
- [Time-scale](https://www.timescale.com/)
- [kdb+](https://en.wikipedia.org/wiki/Kdb%2B) - is a column-based relational time series database (TSDB) with in-memory (IMDB) abilities, developed and marketed by [Kx Systems](https://kx.com/)

### Posts on Graph/Graph Networks/Graph Databases

- Technical comparison between the two most popular knowledge bases, #Grakn and #neo4j: [Part 1/2](https://towardsdatascience.com/neo4j-vs-grakn-part-i-basics-f2fe3511ce88) [Part 2/2](https://towardsdatascience.com/neo4j-vs-grakn-part-ii-semantics-11a0847ae7a2) [LinkedIn Post](https://www.linkedin.com/posts/duygu-altinok-4021389a_neo4j-vs-grakn-part-i-basics-activity-6638014291217793024-pdnV)
- [Information Extraction from Receipts with Graph Convolutional Networks and how to implement it](https://www.linkedin.com/posts/philipvollet_machinelearning-python-dataengineer-activity-6636513160427786240-Bkk1)
- [Intro to Graph Representation Learning](https://www.linkedin.com/posts/montrealai_pytorch-graph-representationlearning-activity-6637936272298033152-MXlA)
- [Memory-Based Graph Networks](https://deepai.org/publication/memory-based-graph-networks)
- [Universal Invariant and Equivariant Graph Neural Networks](https://www.linkedin.com/posts/eric-feuilleaubois-ph-d-43ab0925_universal-invariant-and-equivariant-graph-activity-6636212749133246464-_xqf)
- [Auto-Generated KG](https://www.linkedin.com/posts/bo-li-8503b896_auto-generated-knowledge-graphs-activity-6637543428051828736-jVdT)
- [Graph Convolutional Neural Networks for Molecule Generation | NTU Graph Deep Learning Lab](https://www.linkedin.com/posts/eric-feuilleaubois-ph-d-43ab0925_graph-convolutional-neural-networks-for-molecule-activity-6640244313009737728-IdCP)


## Misc.
- [Difference between JOIN and UNION in SQL](https://www.geeksforgeeks.org/difference-between-join-and-union-in-sql/)
- [Difference between COMMIT and ROLLBACK in SQL](https://www.geeksforgeeks.org/difference-between-commit-and-rollback-in-sql/)
Expand Down
6 changes: 5 additions & 1 deletion data/datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,10 @@
## Clean / ready-to-use datasets

- [Boston Housing Dataset (archive contains clean dataset)](https://github.com/neomatrix369/awesome-ai-ml-dl/releases/tag/v0.1) | [Download](https://github.com/neomatrix369/awesome-ai-ml-dl/releases/download/v0.1/boston_housing_dataset.zip)
- [Google Dataset Search](https://toolbox.google.com/datasetsearch)
- Google Dataset Search:
- https://toolbox.google.com/datasetsearch
- https://blog.google/products/search/discovering-millions-datasets-web
- https://datasetsearch.research.google.com
- [Kaggle Datasets](https://www.kaggle.com/datasets) | Blogs: [1](https://towardsdatascience.com/interesting-datasets-on-kaggle-com-3a4a250b0b85) [2](http://blog.kaggle.com/2016/01/19/introducing-kaggle-datasets/) [3](https://medium.com/@benhamner/introducing-kaggle-datasets-a935f9f76f5) [4](https://stackoverflow.com/questions/52681196/kaggle-datasets-into-jupyter-notebook)
- [Carnegie Mellon University Datasets](http://lib.stat.cmu.edu/datasets/)
- [GeoPlatform Data.gov Search ](https://data.geoplatform.gov/)
Expand All @@ -42,6 +45,7 @@
- [Surprising Uses of Synthetic Random Data Sets](https://www.linkedin.com/posts/data-science-central_surprising-uses-of-synthetic-random-data-activity-6612404601515765760-J0AY)
- [How to find datasets for Artificial Intelligence training](https://medium.com/shallow-thoughts-about-deep-learning/how-to-find-datasets-for-artificial-intelligence-9131b2e72e88?fbclid=IwAR1up1xYvKUX4-7DJFs62hTqrfhfLuY9TdNXK56mnmTiUocvv0hgPj6vf4k)
- [Great Github list of public data sets](https://www.linkedin.com/posts/data-science-central_great-github-list-of-public-data-sets-activity-6620739516317646849-YMxO)
- [Sklearn provides direct access to openml datasets which hosts around 20,000 datasets and you can access it directly in your python code](https://lnkd.in/g-YYFay) [LinkedIn Post](https://www.linkedin.com/posts/srivatsan-srinivasan-b8131b_datascience-machinelearning-ml-activity-6653512803644768256-w1mM)

## Courses

Expand Down
3 changes: 3 additions & 0 deletions data/feature-engineering.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,13 @@
- Chi2 test: Feature selection: [Quora](https://www.quora.com/How-is-chi-test-used-for-feature-selection-in-machine-learning) | [NLP Stanford Group](https://nlp.stanford.edu/IR-book/html/htmledition/feature-selectionchi2-feature-selection-1.html) | [Learn for Master](http://www.learn4master.com/machine-learning/chi-square-test-for-feature-selection)
- [Feature engineering and Dimensionality reduction](https://towardsdatascience.com/dimensionality-reduction-for-machine-learning-80a46c2ebb7e)
- [Seven Techniques for Data Dimensionality Reduction](https://www.kdnuggets.com/2015/05/7-methods-data-dimensionality-reduction.html)
- [Accelerating TSNE with GPUs: From hours to seconds](https://www.linkedin.com/posts/montrealai_machinelearning-datavisualization-datascience-activity-6628828524566331392-Cua_)
- [Feature Engineering and Feature Selection](https://media.licdn.com/dms/document/C511FAQF45u2wk4WYKQ/feedshare-document-pdf-analyzed/0?e=1570834800&v=beta&t=lNVqtm3JJYvvPHpsl0uc6mZJjVGWgJ8Toz29tNJA4GI) [deadlink]
- [Hands-on Guide to Automated Feature Engineering - Prateek Joshi](https://www.linkedin.com/posts/vipulppatel_hands-on-guide-to-automated-feature-engineering-ugcPost-6612564773705924608-Utyb)
- [Feature Engineering and Selection](https://www.linkedin.com/posts/nabihbawazir_feature-engineering-and-selection-ugcPost-6603534412548280320-XTIX)
- [What is feature engineering and why do we need it?](https://www.linkedin.com/posts/srivatsan-srinivasan-b8131b_datascience-machinelearning-ml-activity-6623556433189363712-O7c4)
- [FEATURE-ENGINE: AN OPEN SOURCE PYTHON PACKAGE TO CREATE REPRODUCIBLE FEATURE ENGINEERING STEPS AND SMOOTH MODEL DEPLOYMENT](https://www.trainindata.com/feature-engine)
- [Feature Engineering with Tidyverse](https://www.datasciencecentral.com/profiles/blogs/feature-engineering-with-tidyverse) [LinkedIn Post](https://www.linkedin.com/posts/data-science-central_feature-engineering-with-tidyverse-activity-6645714064209166337-4szB)
- [ML topics expanded by Chris Albon](https://chrisalbon.com/#machine_learning) - look for topics: Feature Engineering • Feature Selection

# Contributing
Expand Down
2 changes: 2 additions & 0 deletions data/feature-selection.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
- [What is dimensionality reduction? What is the difference between feature selection and extraction?](https://datascience.stackexchange.com/questions/130/what-is-dimensionality-reduction-what-is-the-difference-between-feature-selecti)
- [Feature Engineering and Feature Selection](https://media.licdn.com/dms/document/C511FAQF45u2wk4WYKQ/feedshare-document-pdf-analyzed/0?e=1570834800&v=beta&t=lNVqtm3JJYvvPHpsl0uc6mZJjVGWgJ8Toz29tNJA4GI) [deadlink]
- [Feature Selection Techniques in Machine Learning with Python - Raheel Shaikh](https://www.linkedin.com/posts/vipulppatel_feature-selection-techniques-in-ml-with-python-ugcPost-6603482535081062400-3ZH9)
- [Fast Combinatorial Feature Selection with New Definition of Predictive Power](https://www.datasciencecentral.com/profiles/blogs/feature-selection-based-on-predictive-power) [Tweet](https://twitter.com/analyticbridge/status/1237759942544822272)


# Contributing

Expand Down
Loading

0 comments on commit 5123b8b

Please sign in to comment.