Skip to content

Commit

Permalink
Splitting data/README into datasets.md and data-preparation.md
Browse files Browse the repository at this point in the history
  • Loading branch information
neomatrix369 committed Nov 3, 2019
1 parent 242d67d commit 33a0e89
Show file tree
Hide file tree
Showing 3 changed files with 72 additions and 43 deletions.
45 changes: 2 additions & 43 deletions data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,31 +41,7 @@ See [Ethics / altruistic motives](../README-details.md#ethics--altruistic-motive

## Datasets and sources of raw data

### Raw / unclean datasets

- [Boston Housing Dataset (archive contains unclean dataset)](https://github.com/neomatrix369/awesome-ai-ml-dl/releases/tag/v0.1) | [Download](https://github.com/neomatrix369/awesome-ai-ml-dl/releases/download/v0.1/boston_housing_dataset.zip)
- [Datasets for Data cleaning practise](https://makingnoiseandhearingthings.com/2018/04/19/datasets-for-data-cleaning-practice/)
- [Cleaning up and combining data, a dataset for practice](https://www.r-bloggers.com/cleaning-up-and-combining-data-a-dataset-for-practice/)
- [Datasets from various themes and domains: retail, government. Datasets with a good mix of incorrect, wrongly-input, missing data](https://www.ud-intl.com/dataset)
- [(Specific) Web Traffic Time Series Forecasting](https://www.kaggle.com/c/web-traffic-time-series-forecasting)
- [Quora: What are some dirty/untidy datasets to clean for data analysis? I have just finished datacamp's course on cleaning data using R. I want to practice](https://www.quora.com/What-are-some-dirty-untidy-datasets-to-clean-for-data-analysis-I-have-just-finished-datacamps-course-on-cleaning-data-using-R-I-want-to-practice)

### Clean / ready-to-use datasets

- [Boston Housing Dataset (archive contains clean dataset)](https://github.com/neomatrix369/awesome-ai-ml-dl/releases/tag/v0.1) | [Download](https://github.com/neomatrix369/awesome-ai-ml-dl/releases/download/v0.1/boston_housing_dataset.zip)
- [Google Dataset Search](https://toolbox.google.com/datasetsearch)
- [Kaggle Datasets](https://www.kaggle.com/datasets) | Blogs: [1](https://towardsdatascience.com/interesting-datasets-on-kaggle-com-3a4a250b0b85) [2](http://blog.kaggle.com/2016/01/19/introducing-kaggle-datasets/) [3](https://medium.com/@benhamner/introducing-kaggle-datasets-a935f9f76f5) [4](https://stackoverflow.com/questions/52681196/kaggle-datasets-into-jupyter-notebook)
- [Carnegie Mellon University Datasets](http://lib.stat.cmu.edu/datasets/)
- [GeoPlatform Data.gov Search ](https://data.geoplatform.gov/)
- [Data.gov - Data Catalog](https://catalog.data.gov/dataset)
- [TidyTuesday projects on GitHub](https://github.com/rfordatascience/tidytuesday)
- [Enron Email Digest Dataset](https://www.cs.cmu.edu/~enron/)
- [Mathematics Datasets](https://github.com/deepmind/mathematics_dataset)
- [Data.world datasets](https://data.world)
- [Microsoft Research Open Data](https://msropendata.com/)
- [Free Datasets recommended by r-directory](https://r-dir.com/reference/datasets.html)
- [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)
- [Google Research: A large-scale dataset of manually annotated audio events](https://research.google.com/audioset/index.html)
See [Datasets](./datasets.md)

## Data Exploratory Analysis

Expand All @@ -84,24 +60,7 @@ See [Ethics / altruistic motives](../README-details.md#ethics--altruistic-motive

## Data preparation

### Data cleaning

- [Data cleaning](https://elitedatascience.com/data-cleaning)
- [Spend Less Time Cleaning Data with Machine Learning](https://www.dataversity.net/spend-less-time-cleaning-data-with-machine-learning/#)
- [Helpful Python Code Snippets for Data Exploration in Pandas](https://medium.com/@msalmon00/helpful-python-code-snippets-for-data-exploration-in-pandas-b7c5aed5ecb9) - lots of python snippets to select / clean / prepare
- [Working with missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)
- [Journal of Statistical Software - TidyData](https://www.jstatsoft.org/article/view/v059i10/)
- [5 Steps to correctly prepare your data for your machine learning model](https://towardsdatascience.com/5-steps-to-correctly-prep-your-data-for-your-machine-learning-model-c06c24762b73?gi=6b4a6895ab1)
- [Introduction to Data Analysis and Cleaning presentation](../presentations/data/Introduction_to_Data_Analysis_and_Cleaning.pdf) by [Mark Bell](http://www.nationalarchives.gov.uk/about/our-research-and-academic-collaboration/our-research-and-people/staff-profiles/mark-bell/)

### Data preprocessing / Data wrangling / Data manipulation

- [Data Preprocessing vs. Data Wrangling in Machine Learning Projects](https://www.infoq.com/articles/ml-data-processing)
- [Improve Model Accuracy with Data Pre-Processing](https://machinelearningmastery.com/improve-model-accuracy-with-data-pre-processing/)
- [5 Steps to correctly prepare your data for your machine learning model](https://towardsdatascience.com/5-steps-to-correctly-prep-your-data-for-your-machine-learning-model-c06c24762b73?gi=6b4a6895ab1)
- [Introduction to Data Analysis and Cleaning presentation](../presentations/data/Introduction_to_Data_Analysis_and_Cleaning.pdf) by [Mark Bell](http://www.nationalarchives.gov.uk/about/our-research-and-academic-collaboration/our-research-and-people/staff-profiles/mark-bell/)
- [Pandas](https://lnkd.in/gxSgfuQ)
- [SQLAlchemy](https://lnkd.in/gjvbm7h)
See [Data preparation](./data-preparation.md)

### Misc

Expand Down
31 changes: 31 additions & 0 deletions data/data-preparation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Data preparation

## Data cleaning

- [Data cleaning](https://elitedatascience.com/data-cleaning)
- [Spend Less Time Cleaning Data with Machine Learning](https://www.dataversity.net/spend-less-time-cleaning-data-with-machine-learning/#)
- [Helpful Python Code Snippets for Data Exploration in Pandas](https://medium.com/@msalmon00/helpful-python-code-snippets-for-data-exploration-in-pandas-b7c5aed5ecb9) - lots of python snippets to select / clean / prepare
- [Working with missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)
- [Journal of Statistical Software - TidyData](https://www.jstatsoft.org/article/view/v059i10/)
- [5 Steps to correctly prepare your data for your machine learning model](https://towardsdatascience.com/5-steps-to-correctly-prep-your-data-for-your-machine-learning-model-c06c24762b73?gi=6b4a6895ab1)
- [Introduction to Data Analysis and Cleaning presentation](../presentations/data/Introduction_to_Data_Analysis_and_Cleaning.pdf) by [Mark Bell](http://www.nationalarchives.gov.uk/about/our-research-and-academic-collaboration/our-research-and-people/staff-profiles/mark-bell/)

## Data preprocessing / Data wrangling / Data manipulation

- [Data Preprocessing vs. Data Wrangling in Machine Learning Projects](https://www.infoq.com/articles/ml-data-processing)
- [Improve Model Accuracy with Data Pre-Processing](https://machinelearningmastery.com/improve-model-accuracy-with-data-pre-processing/)
- [5 Steps to correctly prepare your data for your machine learning model](https://towardsdatascience.com/5-steps-to-correctly-prep-your-data-for-your-machine-learning-model-c06c24762b73?gi=6b4a6895ab1)
- [Introduction to Data Analysis and Cleaning presentation](../presentations/data/Introduction_to_Data_Analysis_and_Cleaning.pdf) by [Mark Bell](http://www.nationalarchives.gov.uk/about/our-research-and-academic-collaboration/our-research-and-people/staff-profiles/mark-bell/)
- [Pandas](https://lnkd.in/gxSgfuQ)
- [SQLAlchemy](https://lnkd.in/gjvbm7h)

# Contributing

Contributions are very welcome, please share back with the wider community (and get credited for it)!

Please have a look at the [CONTRIBUTING](CONTRIBUTING.md) guidelines, also have a read about our [licensing](LICENSE.md) policy.

---

Back to [Data page (table of contents)](README.md)</br>
Back to [main page (table of contents)](../README.md)
39 changes: 39 additions & 0 deletions data/datasets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Datasets and sources of raw data

## Raw / unclean datasets

- [Boston Housing Dataset (archive contains unclean dataset)](https://github.com/neomatrix369/awesome-ai-ml-dl/releases/tag/v0.1) | [Download](https://github.com/neomatrix369/awesome-ai-ml-dl/releases/download/v0.1/boston_housing_dataset.zip)
- [Datasets for Data cleaning practise](https://makingnoiseandhearingthings.com/2018/04/19/datasets-for-data-cleaning-practice/)
- [Cleaning up and combining data, a dataset for practice](https://www.r-bloggers.com/cleaning-up-and-combining-data-a-dataset-for-practice/)
- [Datasets from various themes and domains: retail, government. Datasets with a good mix of incorrect, wrongly-input, missing data](https://www.ud-intl.com/dataset)
- [(Specific) Web Traffic Time Series Forecasting](https://www.kaggle.com/c/web-traffic-time-series-forecasting)
- [Quora: What are some dirty/untidy datasets to clean for data analysis? I have just finished datacamp's course on cleaning data using R. I want to practice](https://www.quora.com/What-are-some-dirty-untidy-datasets-to-clean-for-data-analysis-I-have-just-finished-datacamps-course-on-cleaning-data-using-R-I-want-to-practice)

## Clean / ready-to-use datasets

- [Boston Housing Dataset (archive contains clean dataset)](https://github.com/neomatrix369/awesome-ai-ml-dl/releases/tag/v0.1) | [Download](https://github.com/neomatrix369/awesome-ai-ml-dl/releases/download/v0.1/boston_housing_dataset.zip)
- [Google Dataset Search](https://toolbox.google.com/datasetsearch)
- [Kaggle Datasets](https://www.kaggle.com/datasets) | Blogs: [1](https://towardsdatascience.com/interesting-datasets-on-kaggle-com-3a4a250b0b85) [2](http://blog.kaggle.com/2016/01/19/introducing-kaggle-datasets/) [3](https://medium.com/@benhamner/introducing-kaggle-datasets-a935f9f76f5) [4](https://stackoverflow.com/questions/52681196/kaggle-datasets-into-jupyter-notebook)
- [Carnegie Mellon University Datasets](http://lib.stat.cmu.edu/datasets/)
- [GeoPlatform Data.gov Search ](https://data.geoplatform.gov/)
- [Data.gov - Data Catalog](https://catalog.data.gov/dataset)
- [TidyTuesday projects on GitHub](https://github.com/rfordatascience/tidytuesday)
- [Enron Email Digest Dataset](https://www.cs.cmu.edu/~enron/)
- [Mathematics Datasets](https://github.com/deepmind/mathematics_dataset)
- [Data.world datasets](https://data.world)
- [Microsoft Research Open Data](https://msropendata.com/)
- [Free Datasets recommended by r-directory](https://r-dir.com/reference/datasets.html)
- [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)
- [Google Research: A large-scale dataset of manually annotated audio events](https://research.google.com/audioset/index.html)


# Contributing

Contributions are very welcome, please share back with the wider community (and get credited for it)!

Please have a look at the [CONTRIBUTING](CONTRIBUTING.md) guidelines, also have a read about our [licensing](LICENSE.md) policy.

---

Back to [Data page (table of contents)](README.md)</br>
Back to [main page (table of contents)](../README.md)

0 comments on commit 33a0e89

Please sign in to comment.