Skip to content

Mapping LINK IDs to PATIDs

Dylan Hall edited this page Dec 21, 2022 · 2 revisions

When anonlink matches across data owners / partners, it identifies records by their position in the file. It essentially uses the line number in the extracted PII file as the identifier for the record. When results are returned from the linkage agent, it will assign a LINK_ID to a line number in the pii-timestamp.csv file.

To map the LINK_IDs back to PATIDs, use the linkid_to_patid.py script. The script takes the following arguments:

usage: linkid_to_patid.py [-h] [--sourcefile SOURCEFILE] [--linkszip LINKSZIP] [--hhsourcefile HHSOURCEFILE] [--hhlinkszip HHLINKSZIP] [-o OUTPUTDIR] [--force]

Tool for translating LINK_IDs back into PATIDs

optional arguments:
  -h, --help            show this help message and exit
  --sourcefile SOURCEFILE
                        Source pii-TIMESTAMP.csv file
  --linkszip LINKSZIP   LINK_ID ZIP file from linkage agent
  --hhsourcefile HHSOURCEFILE
                        Household PII csv, either inferred by households.py or provided by data owner
  --hhlinkszip HHLINKSZIP
                        HOUSEHOLD_ID zip file from linkage agent
  -o OUTPUTDIR, --outputdir OUTPUTDIR
                        Specify an output directory for links. Default is './output'
  --force, -f           Attempt resolution of patids from linkids even if issues are foundin metadata file. USE ONLY AS LAST RESORT

Both --sourcefile and --linkszip, or --hhsourcefile and --hhlinkszip must be provided together, but it is not necessary to provide all 4 at once.

  • If both the pii-timestamp.csv and LINK_ID CSV file are provided as arguments, the script will create a file called linkid_to_patid.csv with the mapping of LINK_IDs to PATIDs in the output/ folder by default. If both the household pii-timestamp.csv and LINK_ID CSV file are provided as arguments this will also create a householdid_to_patid.csv file in the output/ folder.

  • linkid_to_patid.py also supports an option (--force or -f) to ignore the results of the the metadata validation should they yield any issues. WARNING: THIS SHOULD ONLY BE USED AS A LAST RESORT SHOULD DATA OWNERS BE CERTAIN THEY WISH TO PERFORM PATIENT MAPPING ON TWO SETS OF DATA DEEMED INVALID BY THE VALIDATION SCRIPTS.

Examples:

Mapping LINKIDs to PATIDs

python linkid_to_patid.py --sourcefile pii-20220304.csv --linkszip sitename.zip
# one output file: output/linkid_to_patid.csv

Mapping HOUSEHOLDIDs to PATIDs

python linkid_to_patid.py --hhsourcefile households_pii-20220304.csv --hhlinkszip sitename_households.zip
# one output file: output/householdid_to_patid.csv

Mapping both LINKIDs and HOUSEHOLDIDs to PATIDs, together

python linkid_to_patid.py --sourcefile pii-20220304.csv --linkszip sitename.zip --hhsourcefile households_pii-20220304.csv --hhlinkszip sitename_households.zip
# two output files:
#  - output/householdid_to_patid.csv
#  - output/linkid_to_patid.csv

[Optional] Independently Validate Result Metadata

The metadata created by the garbling process is used to validate the metadata returned by the linkage agent within the linkid_to_patid.py script. Additionally, the metadata returned by the linkage agents can be validated outside of the linkid_to_patid.py script using the validate_metadata.py script in the utils directory. The syntax from the root directory is

python utils/validate_metadata.py <path-to-garbled.zip> <path-to-result.zip>

So, assuming that the output of garble.py is a file, garble.zip located in the output directory, and that the results from the linkage agent are received as a zip archive named results.zip located in the inbox directory, the syntax would be

python utils/validate_metadata.py output/garble.py inbox/results.zip

By default, the script will only return the number of issues found during the validation process. Use the -v flag in order to print detailled information about each of the issues encountered during validation.

Clone this wiki locally