Splitting data/README into datasets.md and data-preparation.md

UmaGunturi · Nov 3, 2019 · 33a0e89 · 33a0e89
1 parent 242d67d
commit 33a0e89
Show file tree

Hide file tree

Showing 3 changed files with 72 additions and 43 deletions.
diff --git a/data/README.md b/data/README.md
@@ -41,31 +41,7 @@ See [Ethics / altruistic motives](../README-details.md#ethics--altruistic-motive
 
 ## Datasets and sources of raw data
 
-### Raw / unclean datasets
-
-- [Boston Housing Dataset (archive contains unclean dataset)](https://github.com/neomatrix369/awesome-ai-ml-dl/releases/tag/v0.1) | [Download](https://github.com/neomatrix369/awesome-ai-ml-dl/releases/download/v0.1/boston_housing_dataset.zip)
-- [Datasets for Data cleaning practise](https://makingnoiseandhearingthings.com/2018/04/19/datasets-for-data-cleaning-practice/)
-- [Cleaning up and combining data, a dataset for practice](https://www.r-bloggers.com/cleaning-up-and-combining-data-a-dataset-for-practice/)
-- [Datasets from various themes and domains: retail, government. Datasets with a good mix of incorrect, wrongly-input, missing data](https://www.ud-intl.com/dataset)
-- [(Specific) Web Traffic Time Series Forecasting](https://www.kaggle.com/c/web-traffic-time-series-forecasting)
-- [Quora: What are some dirty/untidy datasets to clean for data analysis? I have just finished datacamp's course on cleaning data using R. I want to practice](https://www.quora.com/What-are-some-dirty-untidy-datasets-to-clean-for-data-analysis-I-have-just-finished-datacamps-course-on-cleaning-data-using-R-I-want-to-practice)
-
-### Clean / ready-to-use datasets
-
-- [Boston Housing Dataset (archive contains clean dataset)](https://github.com/neomatrix369/awesome-ai-ml-dl/releases/tag/v0.1) | [Download](https://github.com/neomatrix369/awesome-ai-ml-dl/releases/download/v0.1/boston_housing_dataset.zip)
-- [Google Dataset Search](https://toolbox.google.com/datasetsearch)
-- [Kaggle Datasets](https://www.kaggle.com/datasets) | Blogs: [1](https://towardsdatascience.com/interesting-datasets-on-kaggle-com-3a4a250b0b85) [2](http://blog.kaggle.com/2016/01/19/introducing-kaggle-datasets/) [3](https://medium.com/@benhamner/introducing-kaggle-datasets-a935f9f76f5) [4](https://stackoverflow.com/questions/52681196/kaggle-datasets-into-jupyter-notebook)
-- [Carnegie Mellon University Datasets](http://lib.stat.cmu.edu/datasets/)
-- [GeoPlatform Data.gov Search ](https://data.geoplatform.gov/)
-- [Data.gov - Data Catalog](https://catalog.data.gov/dataset)
-- [TidyTuesday projects on GitHub](https://github.com/rfordatascience/tidytuesday)
-- [Enron Email Digest Dataset](https://www.cs.cmu.edu/~enron/)
-- [Mathematics Datasets](https://github.com/deepmind/mathematics_dataset)
-- [Data.world datasets](https://data.world)
-- [Microsoft Research Open Data](https://msropendata.com/)
-- [Free Datasets recommended by r-directory](https://r-dir.com/reference/datasets.html)
-- [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)
-- [Google Research: A large-scale dataset of manually annotated audio events](https://research.google.com/audioset/index.html)
+See [Datasets](./datasets.md)
 
 ## Data Exploratory Analysis
 
@@ -84,24 +60,7 @@ See [Ethics / altruistic motives](../README-details.md#ethics--altruistic-motive
 
 ## Data preparation
 
-### Data cleaning
-
-- [Data cleaning](https://elitedatascience.com/data-cleaning)
-- [Spend Less Time Cleaning Data with Machine Learning](https://www.dataversity.net/spend-less-time-cleaning-data-with-machine-learning/#)
-- [Helpful Python Code Snippets for Data Exploration in Pandas](https://medium.com/@msalmon00/helpful-python-code-snippets-for-data-exploration-in-pandas-b7c5aed5ecb9) - lots of python snippets to select / clean / prepare
-- [Working with missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)
-- [Journal of Statistical Software - TidyData](https://www.jstatsoft.org/article/view/v059i10/)
-- [5 Steps to correctly prepare your data for your machine learning model](https://towardsdatascience.com/5-steps-to-correctly-prep-your-data-for-your-machine-learning-model-c06c24762b73?gi=6b4a6895ab1)
-- [Introduction to Data Analysis and Cleaning presentation](../presentations/data/Introduction_to_Data_Analysis_and_Cleaning.pdf) by [Mark Bell](http://www.nationalarchives.gov.uk/about/our-research-and-academic-collaboration/our-research-and-people/staff-profiles/mark-bell/)
-
-### Data preprocessing / Data wrangling / Data manipulation
-
-- [Data Preprocessing vs. Data Wrangling in Machine Learning Projects](https://www.infoq.com/articles/ml-data-processing)
-- [Improve Model Accuracy with Data Pre-Processing](https://machinelearningmastery.com/improve-model-accuracy-with-data-pre-processing/)
-- [5 Steps to correctly prepare your data for your machine learning model](https://towardsdatascience.com/5-steps-to-correctly-prep-your-data-for-your-machine-learning-model-c06c24762b73?gi=6b4a6895ab1)
-- [Introduction to Data Analysis and Cleaning presentation](../presentations/data/Introduction_to_Data_Analysis_and_Cleaning.pdf) by [Mark Bell](http://www.nationalarchives.gov.uk/about/our-research-and-academic-collaboration/our-research-and-people/staff-profiles/mark-bell/)
-- [Pandas](https://lnkd.in/gxSgfuQ)
-- [SQLAlchemy](https://lnkd.in/gjvbm7h)
+See [Data preparation](./data-preparation.md)
 
 ### Misc
 

diff --git a/data/data-preparation.md b/data/data-preparation.md
@@ -0,0 +1,31 @@
+# Data preparation
+
+## Data cleaning
+
+- [Data cleaning](https://elitedatascience.com/data-cleaning)
+- [Spend Less Time Cleaning Data with Machine Learning](https://www.dataversity.net/spend-less-time-cleaning-data-with-machine-learning/#)
+- [Helpful Python Code Snippets for Data Exploration in Pandas](https://medium.com/@msalmon00/helpful-python-code-snippets-for-data-exploration-in-pandas-b7c5aed5ecb9) - lots of python snippets to select / clean / prepare
+- [Working with missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)
+- [Journal of Statistical Software - TidyData](https://www.jstatsoft.org/article/view/v059i10/)
+- [5 Steps to correctly prepare your data for your machine learning model](https://towardsdatascience.com/5-steps-to-correctly-prep-your-data-for-your-machine-learning-model-c06c24762b73?gi=6b4a6895ab1)
+- [Introduction to Data Analysis and Cleaning presentation](../presentations/data/Introduction_to_Data_Analysis_and_Cleaning.pdf) by [Mark Bell](http://www.nationalarchives.gov.uk/about/our-research-and-academic-collaboration/our-research-and-people/staff-profiles/mark-bell/)
+
+## Data preprocessing / Data wrangling / Data manipulation
+
+- [Data Preprocessing vs. Data Wrangling in Machine Learning Projects](https://www.infoq.com/articles/ml-data-processing)
+- [Improve Model Accuracy with Data Pre-Processing](https://machinelearningmastery.com/improve-model-accuracy-with-data-pre-processing/)
+- [5 Steps to correctly prepare your data for your machine learning model](https://towardsdatascience.com/5-steps-to-correctly-prep-your-data-for-your-machine-learning-model-c06c24762b73?gi=6b4a6895ab1)
+- [Introduction to Data Analysis and Cleaning presentation](../presentations/data/Introduction_to_Data_Analysis_and_Cleaning.pdf) by [Mark Bell](http://www.nationalarchives.gov.uk/about/our-research-and-academic-collaboration/our-research-and-people/staff-profiles/mark-bell/)
+- [Pandas](https://lnkd.in/gxSgfuQ)
+- [SQLAlchemy](https://lnkd.in/gjvbm7h)
+
+# Contributing
+
+Contributions are very welcome, please share back with the wider community (and get credited for it)!
+
+Please have a look at the [CONTRIBUTING](CONTRIBUTING.md) guidelines, also have a read about our [licensing](LICENSE.md) policy.
+
+---
+
+Back to [Data page (table of contents)](README.md)</br>
+Back to [main page (table of contents)](../README.md)
diff --git a/data/datasets.md b/data/datasets.md
@@ -0,0 +1,39 @@
+# Datasets and sources of raw data
+
+## Raw / unclean datasets
+
+- [Boston Housing Dataset (archive contains unclean dataset)](https://github.com/neomatrix369/awesome-ai-ml-dl/releases/tag/v0.1) | [Download](https://github.com/neomatrix369/awesome-ai-ml-dl/releases/download/v0.1/boston_housing_dataset.zip)
+- [Datasets for Data cleaning practise](https://makingnoiseandhearingthings.com/2018/04/19/datasets-for-data-cleaning-practice/)
+- [Cleaning up and combining data, a dataset for practice](https://www.r-bloggers.com/cleaning-up-and-combining-data-a-dataset-for-practice/)
+- [Datasets from various themes and domains: retail, government. Datasets with a good mix of incorrect, wrongly-input, missing data](https://www.ud-intl.com/dataset)
+- [(Specific) Web Traffic Time Series Forecasting](https://www.kaggle.com/c/web-traffic-time-series-forecasting)
+- [Quora: What are some dirty/untidy datasets to clean for data analysis? I have just finished datacamp's course on cleaning data using R. I want to practice](https://www.quora.com/What-are-some-dirty-untidy-datasets-to-clean-for-data-analysis-I-have-just-finished-datacamps-course-on-cleaning-data-using-R-I-want-to-practice)
+
+## Clean / ready-to-use datasets
+
+- [Boston Housing Dataset (archive contains clean dataset)](https://github.com/neomatrix369/awesome-ai-ml-dl/releases/tag/v0.1) | [Download](https://github.com/neomatrix369/awesome-ai-ml-dl/releases/download/v0.1/boston_housing_dataset.zip)
+- [Google Dataset Search](https://toolbox.google.com/datasetsearch)
+- [Kaggle Datasets](https://www.kaggle.com/datasets) | Blogs: [1](https://towardsdatascience.com/interesting-datasets-on-kaggle-com-3a4a250b0b85) [2](http://blog.kaggle.com/2016/01/19/introducing-kaggle-datasets/) [3](https://medium.com/@benhamner/introducing-kaggle-datasets-a935f9f76f5) [4](https://stackoverflow.com/questions/52681196/kaggle-datasets-into-jupyter-notebook)
+- [Carnegie Mellon University Datasets](http://lib.stat.cmu.edu/datasets/)
+- [GeoPlatform Data.gov Search ](https://data.geoplatform.gov/)
+- [Data.gov - Data Catalog](https://catalog.data.gov/dataset)
+- [TidyTuesday projects on GitHub](https://github.com/rfordatascience/tidytuesday)
+- [Enron Email Digest Dataset](https://www.cs.cmu.edu/~enron/)
+- [Mathematics Datasets](https://github.com/deepmind/mathematics_dataset)
+- [Data.world datasets](https://data.world)
+- [Microsoft Research Open Data](https://msropendata.com/)
+- [Free Datasets recommended by r-directory](https://r-dir.com/reference/datasets.html)
+- [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)
+- [Google Research: A large-scale dataset of manually annotated audio events](https://research.google.com/audioset/index.html)
+
+
+# Contributing
+
+Contributions are very welcome, please share back with the wider community (and get credited for it)!
+
+Please have a look at the [CONTRIBUTING](CONTRIBUTING.md) guidelines, also have a read about our [licensing](LICENSE.md) policy.
+
+---
+
+Back to [Data page (table of contents)](README.md)</br>
+Back to [main page (table of contents)](../README.md)