The goal of a pQTL analysis is to identify genomic variants or SNPs that influence the expression of proteins. Similar to GWAS, which identifies genomic variants associated with some phenotype of interest in the case of pQTLs the phenotype is protein expression. Typically researchers will first identify variants that influence some trait of interest, like a disease, using GWAS. Then by layering on the results of a pQTL analysis, which identifies variants that influence changes in protein expression, researchers can look for co-localization or overlaps between these two sets of variants to identify variants that influence both the trait (disease of interest) and protein expression, which could indicate possible mechanisms of action. Therefore, this pQTL analysis can help to prioritize potential candidate/causal variants.
NOTE: This folder contains example code for how to run pQTL analysis on the UKB RAP. This tutorial is not using UKB proteomics data. Here we are using simulated data based on Kivisakk et al publication.
- Run
1_simulate_input_data.ipynb
to generate simulated proteomic expression data. The output of this notebook is a matrix that is samples x proteins. This matrix is passed as the "Phenotypes file" parameter to REGENIE in step 2 below.
Note: Before running REGENIE, you will need to have perform the pre-processing steps outlined in End-to-end genomic target discovery tutorial.
-
Perform sample QC to remove technical noise that could possibly lead to spurious associations.
-
Perform liftover to map array data to the same reference as the imputed data.
-
Perform variant QC to remove any low quality variants and those that deviate from the Hardy-Weinberg equilibrium.
-
Run REGENIE (v1.0.5) using the generated proteomic expression data. There are 2 ways to do this:
Using the UI
We used the following input options to run the regenie app using the UI. If an input option isn’t specified, then the default option was used.
- Genotype BED for Step 1: path to array data after liftOver (see target discovery tutorial)
- Genotype BIM for Step 1: path to array data after liftOver (see target discovery tutorial)
- Genotype FAM for Step 1: path to array data after liftOver (see target discovery tutorial)
- Genotype BGEN files for Step 2: /Bulk/Imputation/Imputation from genotype (GEL)/ukb21008_c22_b0_v1.bgen
- Genotype BGI index files for Step 2: /Bulk/Imputation/Imputation from genotype (GEL)/ukb21008_c22_b0_v1.bgen.bgi
- Sample files for Step 2: /Bulk/Imputation/Imputation from genotype (GEL)/ukb21008_c22_b0_v1.sample (Note: You will need to correct the sample file header following the steps here)
- Phenotypes file: pheno_200.txt (File generated by notebook 1)
- Variant IDs to extract (Step 1): snplist file generated in the Array data QC step (see target discovery tutorial)
- Variant IDs to extract (Step 2): snplist file generated in the Impute data QC step (see target discovery tutorial)
- Sample ID file: pheno_200.txt (File generated by notebook 1)
Using the CLI
dx run regenie \
-iwgr_genotype_bed="path to array data after liftOver .bed" \
-iwgr_genotype_bim="path to array data after liftOver .bim" \
-iwgr_genotype_fam="path to array data after liftOver .fam" \
-igenotype_bgens="/Bulk/Imputation/Imputation from genotype (GEL)/ukb21008_c22_b0_v1.bgen" \
-igenotype_bgis="/Bulk/Imputation/Imputation from genotype (GEL)/ukb21008_c22_b0_v1.bgen.bgi" \
-igenotype_samples="/Bulk/Imputation/Imputation from genotype (GEL)/ukb21008_c22_b0_v1.sample" \
-ipheno_txt="pheno_200.txt" \
-istep1_extract_txt="snplist file generated in the Array data QC step" \
-istep2_extract_txt="snplist file generated in the Impute data QC step" \
-isample_ids_txt="pheno_200.txt" \
--destination "/output/"
To run REGENIE on 1 chromosome (chromosome 22) using 200 protein traits, this took 19.5 hours and cost 1.04 pounds. This job used mem3_ssd2_v2_x4 (4500 MB/30000 MB) for step 1 and mem1_ssd1_v2_x2 (800 MB/47000 MB) for step 2. This was using the default instance type that was automatically selected by REGENIE using the formula described below. Based on this run, you may consider adjusting the instance type to something smaller and cheaper depending on your needs.
Instance type:
The REGENIE app internally estimates the instance type to use based on the following formula: RAM (Storage in MB) =(#samples*(#vas/block_size)/1048576)* 40 * #traits + 7000. You can use this formula to approximate the cost of running this analysis
We would like to thank Ondrej Klempir, Anastazie Sedlakova and Arkarachai Fungtammasan for insightful discussions, testing and code review