-
Notifications
You must be signed in to change notification settings - Fork 8
Mapping LINK IDs to PATIDs
When anonlink matches across data owners / partners, it identifies records by their position in the file. It essentially uses the line number in the extracted PII file as the identifier for the record. When results are returned from the linkage agent, it will assign a LINK_ID to a line number in the pii-timestamp.csv file.
To map the LINK_IDs back to PATIDs, use the linkid_to_patid.py
script. The script takes the following arguments:
usage: linkid_to_patid.py [-h] [--sourcefile SOURCEFILE] [--linkszip LINKSZIP] [--hhsourcefile HHSOURCEFILE] [--hhlinkszip HHLINKSZIP] [-o OUTPUTDIR] [--force]
Tool for translating LINK_IDs back into PATIDs
optional arguments:
-h, --help show this help message and exit
--sourcefile SOURCEFILE
Source pii-TIMESTAMP.csv file
--linkszip LINKSZIP LINK_ID ZIP file from linkage agent
--hhsourcefile HHSOURCEFILE
Household PII csv, either inferred by households.py or provided by data owner
--hhlinkszip HHLINKSZIP
HOUSEHOLD_ID zip file from linkage agent
-o OUTPUTDIR, --outputdir OUTPUTDIR
Specify an output directory for links. Default is './output'
--force, -f Attempt resolution of patids from linkids even if issues are foundin metadata file. USE ONLY AS LAST RESORT
Both --sourcefile
and --linkszip
, or --hhsourcefile
and --hhlinkszip
must be provided together, but it is not necessary to provide all 4 at once.
-
If both the pii-timestamp.csv and LINK_ID CSV file are provided as arguments, the script will create a file called
linkid_to_patid.csv
with the mapping of LINK_IDs to PATIDs in theoutput/
folder by default. If both the household pii-timestamp.csv and LINK_ID CSV file are provided as arguments this will also create ahouseholdid_to_patid.csv
file in theoutput/
folder. -
linkid_to_patid.py
also supports an option (--force
or-f
) to ignore the results of the the metadata validation should they yield any issues. WARNING: THIS SHOULD ONLY BE USED AS A LAST RESORT SHOULD DATA OWNERS BE CERTAIN THEY WISH TO PERFORM PATIENT MAPPING ON TWO SETS OF DATA DEEMED INVALID BY THE VALIDATION SCRIPTS.
Examples:
Mapping LINKIDs to PATIDs
python linkid_to_patid.py --sourcefile pii-20220304.csv --linkszip sitename.zip
# one output file: output/linkid_to_patid.csv
Mapping HOUSEHOLDIDs to PATIDs
python linkid_to_patid.py --hhsourcefile households_pii-20220304.csv --hhlinkszip sitename_households.zip
# one output file: output/householdid_to_patid.csv
Mapping both LINKIDs and HOUSEHOLDIDs to PATIDs, together
python linkid_to_patid.py --sourcefile pii-20220304.csv --linkszip sitename.zip --hhsourcefile households_pii-20220304.csv --hhlinkszip sitename_households.zip
# two output files:
# - output/householdid_to_patid.csv
# - output/linkid_to_patid.csv
The metadata created by the garbling process is used to validate the metadata returned by the linkage agent within the linkid_to_patid.py
script. Additionally, the metadata returned by the linkage agents can be validated outside of the linkid_to_patid.py
script using the validate_metadata.py
script in the utils
directory. The syntax from the root directory is
python utils/validate_metadata.py <path-to-garbled.zip> <path-to-result.zip>
So, assuming that the output of garble.py
is a file, garble.zip
located in the output
directory, and that the results from the linkage agent are received as a zip archive named results.zip
located in the inbox
directory, the syntax would be
python utils/validate_metadata.py output/garble.py inbox/results.zip
By default, the script will only return the number of issues found during the validation process. Use the -v
flag in order to print detailled information about each of the issues encountered during validation.