forked from neomatrix369/awesome-ai-ml-dl
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Splitting data/README into datasets.md and data-preparation.md
- Loading branch information
1 parent
242d67d
commit 33a0e89
Showing
3 changed files
with
72 additions
and
43 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# Data preparation | ||
|
||
## Data cleaning | ||
|
||
- [Data cleaning](https://elitedatascience.com/data-cleaning) | ||
- [Spend Less Time Cleaning Data with Machine Learning](https://www.dataversity.net/spend-less-time-cleaning-data-with-machine-learning/#) | ||
- [Helpful Python Code Snippets for Data Exploration in Pandas](https://medium.com/@msalmon00/helpful-python-code-snippets-for-data-exploration-in-pandas-b7c5aed5ecb9) - lots of python snippets to select / clean / prepare | ||
- [Working with missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html) | ||
- [Journal of Statistical Software - TidyData](https://www.jstatsoft.org/article/view/v059i10/) | ||
- [5 Steps to correctly prepare your data for your machine learning model](https://towardsdatascience.com/5-steps-to-correctly-prep-your-data-for-your-machine-learning-model-c06c24762b73?gi=6b4a6895ab1) | ||
- [Introduction to Data Analysis and Cleaning presentation](../presentations/data/Introduction_to_Data_Analysis_and_Cleaning.pdf) by [Mark Bell](http://www.nationalarchives.gov.uk/about/our-research-and-academic-collaboration/our-research-and-people/staff-profiles/mark-bell/) | ||
|
||
## Data preprocessing / Data wrangling / Data manipulation | ||
|
||
- [Data Preprocessing vs. Data Wrangling in Machine Learning Projects](https://www.infoq.com/articles/ml-data-processing) | ||
- [Improve Model Accuracy with Data Pre-Processing](https://machinelearningmastery.com/improve-model-accuracy-with-data-pre-processing/) | ||
- [5 Steps to correctly prepare your data for your machine learning model](https://towardsdatascience.com/5-steps-to-correctly-prep-your-data-for-your-machine-learning-model-c06c24762b73?gi=6b4a6895ab1) | ||
- [Introduction to Data Analysis and Cleaning presentation](../presentations/data/Introduction_to_Data_Analysis_and_Cleaning.pdf) by [Mark Bell](http://www.nationalarchives.gov.uk/about/our-research-and-academic-collaboration/our-research-and-people/staff-profiles/mark-bell/) | ||
- [Pandas](https://lnkd.in/gxSgfuQ) | ||
- [SQLAlchemy](https://lnkd.in/gjvbm7h) | ||
|
||
# Contributing | ||
|
||
Contributions are very welcome, please share back with the wider community (and get credited for it)! | ||
|
||
Please have a look at the [CONTRIBUTING](CONTRIBUTING.md) guidelines, also have a read about our [licensing](LICENSE.md) policy. | ||
|
||
--- | ||
|
||
Back to [Data page (table of contents)](README.md)</br> | ||
Back to [main page (table of contents)](../README.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
# Datasets and sources of raw data | ||
|
||
## Raw / unclean datasets | ||
|
||
- [Boston Housing Dataset (archive contains unclean dataset)](https://github.com/neomatrix369/awesome-ai-ml-dl/releases/tag/v0.1) | [Download](https://github.com/neomatrix369/awesome-ai-ml-dl/releases/download/v0.1/boston_housing_dataset.zip) | ||
- [Datasets for Data cleaning practise](https://makingnoiseandhearingthings.com/2018/04/19/datasets-for-data-cleaning-practice/) | ||
- [Cleaning up and combining data, a dataset for practice](https://www.r-bloggers.com/cleaning-up-and-combining-data-a-dataset-for-practice/) | ||
- [Datasets from various themes and domains: retail, government. Datasets with a good mix of incorrect, wrongly-input, missing data](https://www.ud-intl.com/dataset) | ||
- [(Specific) Web Traffic Time Series Forecasting](https://www.kaggle.com/c/web-traffic-time-series-forecasting) | ||
- [Quora: What are some dirty/untidy datasets to clean for data analysis? I have just finished datacamp's course on cleaning data using R. I want to practice](https://www.quora.com/What-are-some-dirty-untidy-datasets-to-clean-for-data-analysis-I-have-just-finished-datacamps-course-on-cleaning-data-using-R-I-want-to-practice) | ||
|
||
## Clean / ready-to-use datasets | ||
|
||
- [Boston Housing Dataset (archive contains clean dataset)](https://github.com/neomatrix369/awesome-ai-ml-dl/releases/tag/v0.1) | [Download](https://github.com/neomatrix369/awesome-ai-ml-dl/releases/download/v0.1/boston_housing_dataset.zip) | ||
- [Google Dataset Search](https://toolbox.google.com/datasetsearch) | ||
- [Kaggle Datasets](https://www.kaggle.com/datasets) | Blogs: [1](https://towardsdatascience.com/interesting-datasets-on-kaggle-com-3a4a250b0b85) [2](http://blog.kaggle.com/2016/01/19/introducing-kaggle-datasets/) [3](https://medium.com/@benhamner/introducing-kaggle-datasets-a935f9f76f5) [4](https://stackoverflow.com/questions/52681196/kaggle-datasets-into-jupyter-notebook) | ||
- [Carnegie Mellon University Datasets](http://lib.stat.cmu.edu/datasets/) | ||
- [GeoPlatform Data.gov Search ](https://data.geoplatform.gov/) | ||
- [Data.gov - Data Catalog](https://catalog.data.gov/dataset) | ||
- [TidyTuesday projects on GitHub](https://github.com/rfordatascience/tidytuesday) | ||
- [Enron Email Digest Dataset](https://www.cs.cmu.edu/~enron/) | ||
- [Mathematics Datasets](https://github.com/deepmind/mathematics_dataset) | ||
- [Data.world datasets](https://data.world) | ||
- [Microsoft Research Open Data](https://msropendata.com/) | ||
- [Free Datasets recommended by r-directory](https://r-dir.com/reference/datasets.html) | ||
- [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) | ||
- [Google Research: A large-scale dataset of manually annotated audio events](https://research.google.com/audioset/index.html) | ||
|
||
|
||
# Contributing | ||
|
||
Contributions are very welcome, please share back with the wider community (and get credited for it)! | ||
|
||
Please have a look at the [CONTRIBUTING](CONTRIBUTING.md) guidelines, also have a read about our [licensing](LICENSE.md) policy. | ||
|
||
--- | ||
|
||
Back to [Data page (table of contents)](README.md)</br> | ||
Back to [main page (table of contents)](../README.md) |