The pipeline for analyzing cooperative genomic context protein binding microarray (cooperative gcPBM) data.
All input and processed files, including figures, are available in: https://duke.box.com/s/cnbo6gjg223mtdun3cnemycwep414wgd
The inputs required for the pipeline are in data/probe_files/raw and data/sitemodels
Python version >= 3.6.0
Make sure that you have R in your path in order to install and import rpy2
To install R in MacOS, run brew install R
Optionally, before installing the requirements, create a virtual environment
python3.9 venv coop-gcpbm-venv
Activate the virtual environment
source coop-gcpbm-venv/bin/activate
Then run pip install -r requirements.txt
Code: clean_file.py
Take as input the raw probe files and generate csv files containing the required information for the pipeline:
- Name: Probe name
- Sequence: Probe sequence
- intensity: TF binding levels
- type: Mutation type: wt (wild type), m1/m2 (one site mutated), m3 (two sites mutated). Or negctrl for negative controls.
- rep: Replicates
- ori: Orientation of the sequence
Output files:
- A clean probe file with the fields mentioned above
- A negative control probe file
Run:
- ETS1-ETS1:
python3 clean_file.py data/probe_files/raw/ETS1_ETS1.txt -k "ets1" -e "dist|weak" -g -o "data/probe_files/clean"
- ETS1-RUNX1:
- ETS1 only chamber:
python3 clean_file.py data/probe_files/raw/ETS1_only.txt -k "all_clean_seqs" -n "negative_controls" -f -o "data/probe_files/clean"
- ETS1-RUNX1 chamber:
python3 clean_file.py data/probe_files/raw/ETS1_RUNX1.txt -k "all_clean_seqs" -n "negative_controls" -f -o "data/probe_files/clean"
- ETS1 only chamber:
- RUNX1-ETS1:
- RUNX1 only chamber:
python3 clean_file.py data/probe_files/raw/RUNX1_only.txt -k "all_clean_seqs" -n "negative_controls" -f -o "data/probe_files/clean"
- RUNX1-ETS1 chamber:
python3 clean_file.py data/probe_files/raw/RUNX1_ETS1.txt -k "all_clean_seqs" -n "negative_controls" -f -o "data/probe_files/clean"
- RUNX1 only chamber:
Description: Label each sequence as cooperative/ambiguous/independent
Code: label_pr_ets_ets.py
Run: python3 label_pr_ets_ets.py data/probe_files/clean/ETS1_ETS1_pr_clean.csv -n data/probe_files/clean/ETS1_ETS1_neg_clean.csv -f -o "data/analysis_files/ETS1_ETS1/labeled"
Additional arguments: python3 label_pr_ets_ets.py -h
Output files:
ets_ets_seqlabeled.csv
: Sequences with labels, this file is used as the main input for the subsequent analysisETS1_ETS1_indiv.csv
: intensity for each combination of ETS1 binding to individual sites (i.e. m1+m2)ETS1_ETS1_two.csv
: intensity for each combination of ETS1 binding to two sites (i.e. wt)labeled_ets_ets_scatter.pdf
: scatter plot for cooperative vs. independent sequences
Example outputs, see: data/analysis_files/ETS1-ETS1/labeled
Description: Generate training data with all the features and labels for the sequences containing two ETS1 sites
Code: traingen_ets_ets.py
Run: python3 traingen_ets_ets.py data/analysis_files/ETS1_ETS1/labeled/ets_ets_seqlabeled.csv -p data/sitemodels/ETS1.txt -k data/sitemodels/ETS1_kmer_alignment.txt -o "data/analysis_files/ETS1_ETS1/training"
Output files:
train_ets1_ets1.tsv
: Training data for ETS1-ETS1- Three figure files with the distributions for distance, orientation, and strength features
Example outputs, see: data/analysis_files/ETS1-ETS1/training
Code: genmodel_ets_ets.py
Run: python3 genmodel_ets_ets.py data/analysis_files/ETS1_ETS1/training/train_ETS1_ETS1.tsv -o "data/analysis_files/ETS1_ETS1/model"
Note: rf_param_grid
is currently hardcoded, please change the parameters directly in the code as needed
Output files:
ETS1_ETS1_rfmodel.sav
: pickle file with the random forest model trained on ETS1-ETS1 data using distance, orientation, and strength featuresauc_all.png
: AUC curve with the model performancesauc_all.log
: A text file with the mean accuracy, mean AUC, and confusion matrices for all the models tested.
Example outputs, see: data/analysis_files/ETS1-ETS1/model
Description: Label each sequence as cooperative/ambiguous/independent
Code: label_pr_ets_runx.py
Run: python3 label_pr_ets_runx.py
Note: there are a lot of parameters for the script and currently they are still hardcoded, please check the header in main
. To change between ETS1-RUNX1 and RUNX1-ETS1 please use the relevant commented part provided in the code.
Output files using ETS1 as the main TF and RUNX1 as the cooperator TF (i.e. ETS1-RUNX1):
ets1_runx1_seqlbled.tsv
: Sequences with labels, this file is used as the main input for the subsequent analysis.ets1_runx1_main.csv
: Intensity for the chamber with the main TF alone.ets1_runx1_main_cooperator.csv
: Intensity for the chamber with the main TF in the presence of the cooperator TF.normalized_ets1_runx1.pdf
: scatter plot for cooperative vs. independent sequences using the normalized data.both_ori_plt_ets1_runx1.csv
: Each column represents the value used to plot (4).seq_er_intensity.csv
: Median binding intensity for the main TF alone; main + cooperator TFs both normalized and unnormalzied.
Description: Generate training data with all the features and labels for the sequences containing ETS1 and RUNX1 sites
Code: traingen_ets_runx.py
Run: python3 traingen_ets_runx.py
Output files (for ETS1-RUNX1):
train_ets1_runx1.tsv
: Training data for ETS1-RUNX1- Three figure files with the distributions for distance, orientation, and strength features
Example outputs, see: data/analysis_files/ETS1_RUNX1/training
Code: genmodel_ets_runx.py
Run:
- ETS1-RUNX1:
python3 genmodel_ets_runx.py data/analysis_files/ETS1_RUNX1/training/train_ets1_runx1.tsv -o "data/analysis_files/ETS1_RUNX1/model"
- RUNX1-ETS1:
python3 genmodel_ets_runx.py data/analysis_files/RUNX1_ETS1/training/train_runx1_ets1.tsv -o "data/analysis_files/RUNX1_ETS1/model"
Output files:
ETS1_RUNX1_rfmodel.sav
: pickle file with the random forest model trained on ETS1-RUNX1 data using distance, orientation, and strength featuresauc.png
: AUC curve with the model performancesauc.log
: A text file with the mean accuracy, mean AUC, and confusion matrices for all the models tested.
Example outputs, see: data/analysis_files/ETS1-RUNX1/model
The code requires DNAShape R package and imported using rpy2
. Please install the package as described in: https://bioconductor.org/packages/release/bioc/html/DNAshapeR.html
Code: gen_posmdl.py
Run:
- ETS1-ETS1:
python3 gen_posmdl.py data/analysis_files/ETS1_ETS1/training/train_ets1_ets1.tsv -a site_str -b site_wk -s relative -r -oh -o "data/analysis_files/ETS1_ETS1/model"
- ETS1-RUNX1:
python3 gen_posmdl.py data/analysis_files/ETS1_RUNX1/training/train_ets1_runx1.tsv -a ets1 -b runx1 -s positional -o "data/analysis_files/ETS1_RUNX1/model"
- RUNX1-ETS1:
python3 gen_posmdl.py data/analysis_files/RUNX1_ETS1/training/train_runx1_ets1.tsv -a runx1 -b ets1 -s positional -o "data/analysis_files/RUNX1_ETS1/model"
Output files:
rfposmodel.sav
: A pickle file with the random forest model trained on ETS1-ETS1 data using distance, orientation, shape, and sequence featuresauc_posfeatures.pdf
: A figure with the ROC curve showing the model performancesauc_all.log
: A text file with the mean accuracy, mean AUC, and confusion matrices for all the models tested.
Create summary motif and shape figures for all sequences in the training data, also outputs the list of sequences for each configuration.
Code: shape_analysis.py
Run:
- ETS1-ETS1:
python3 shape_analysis.py data/analysis_files/ETS1_ETS1/training/train_ets1_ets1.tsv -p site_str_pos,site_wk_pos -o "data/analysis_files/ETS1_ETS1/shape_out"
- ETS1-RUNX1:
python3 shape_analysis.py data/analysis_files/ETS1_RUNX1/training/train_ets1_runx1.tsv -p ets1_pos,runx1_pos -o "data/analysis_files/ETS1_RUNX1/shape_out"
- RUNX1-ETS1:
python3 shape_analysis.py data/analysis_files/RUNX1_ETS1/training/train_runx1_ets1.tsv -p runx1_pos,ets1_pos -o "data/analysis_files/RUNX1_ETS1/shape_out"
Example outputs, see:
- ETS1-ETS1:
data/analysis_files/ETS1_ETS1/shape_out
- ETS1-RUNX1:
data/analysis_files/ETS1_RUNX1/shape_out
- RUNX1-ETS1:
data/analysis_files/RUNX1_ETS1/shape_out