Setup

You may set up the scrubber either with conda (recommended) or manually.

With conda (recommended)

If you haven't already, install miniconda
Run git clone https://github.com/people3k/p3k14c-data-scrubbing && cd p3k14c-data-scrubbing
Run conda env create -f environment.yml
Activate the environment with conda activate c14scrub

Manual setup

Install Python 3.7.6 along with Pip
Install the required packages by running pip install numpy pandas ftfy tqdm pyshp shapely matplotlib pyproj in your command line. For more info on installing packages, see this tutorial.

Usage

Scrubbing

Ensure that your raw records file is saved using UTF-8 encoding. This can be accomplished by most CSV-handling programs.
Execute the program by running python scrub.py in_file.csv out_file.csv in the command line, where the input file is the name of the raw records file.

The cleaned records will be saved to your specified filename, a list of unknown lab codes will be saved to unknown_codes.csv, and a list of all deleted records with their reason for removal will be saved to graveyard.csv.

Optionally, the graveyard path may be specified as the third parameter. E.g., python scrub.py in_file.csv out_file.csv myGraveyard.csv

Fuzzing

Fuzzing is required for all dates in the USA, Canada, and the GuedesBocinsky2018 dataset. We utilize the GeoBoundaries 25% shapefile to obscure all date coordinates to Admin2 centroids (county centroids in the US, census division centroids in Canada, etc.). The program is run using python fuzz/fuzz.py scrubbed_data.csv scrubbed_and_fuzzed_data.csv. Additionally, one may visually verify the correctness of the fuzzing process by plotting the results with python fuzz/visualize.py scrubbed_and_fuzzed_data.csv.

Independent "remove duplicates" feature

If you need to remove duplicate records from a certain dataset without necessarily running the entire scrubbing process on it, this is achievable through removeDuplicates.py. Simply run python removeDuplicates.py infile_name.csv outfile_name.csv and the program will run only the duplicate removal subroutine on infile_name.csv and save the resulting dataset to outfile_name.csv.

Fixing corrupted Unicode in SiteNames

Included in this package is a series of tools for fixing Unicode errors within the "SiteName" column. It utilizes the GeoNames dataset to suggest and make substitutions for detected anomalous SiteNames, turning a once onorous manual process into one that is largely automated.

Located within the charfix directory, the correct.py script may be used to begin the SiteName correction process. This script will output a table of substitutions to make, which applyFixes.py may then be used to apply the substitutions to a particular file.

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
centroids		centroids
charfix		charfix
fuzz		fuzz
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
Labs.csv		Labs.csv
README.md		README.md
clean_cities500.txt		clean_cities500.txt
common.py		common.py
environment.yml		environment.yml
removeDuplicates.py		removeDuplicates.py
scrub.py		scrub.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setup

With conda (recommended)

Manual setup

Usage

Scrubbing

Fuzzing

Independent "remove duplicates" feature

Fixing corrupted Unicode in SiteNames

About

Releases

Packages

Languages

License

people3k/p3k14c-data-scrubbing

Folders and files

Latest commit

History

Repository files navigation

Setup

With conda (recommended)

Manual setup

Usage

Scrubbing

Fuzzing

Independent "remove duplicates" feature

Fixing corrupted Unicode in SiteNames

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages