CaryoAnthoBlock: Investigation of the anthocyanin biosynthesis block in the Caryophyllales

The scripts in this repository are the basis for the transcript abundance comparison of the anthocyanin biosynthesis pathways in anthocyanin and betalain producing species. While re-use of most scripts is possible, they are customized for this analysis and might require modifications for other applications. The purposes of this repository is mainly documentation. All scripts are intended to be used with Python 2.7. Running these scripts with Python 3 might work, but was not continuously evaluated. Several (standard) scientific Python packages are required including numpy, scipy, and matplotlib.

1) After downloading datasets from SRA/ENA, `kallisto_pipeline.py` is run to generate count tables per dataset.

Usage:
  python kallisto_pipeline.py --cds <FILE> --reads <DIR>] --out <DIR> --tmp <DIR>

Mandatory:
  --cds STR          Reference file (FASTA)     
  --reads STR        Path to folder containing FASTQ files in subfolder per sample.
  --out STR          Output directory
  --tmp STR          Directory for temporary files

Optional:
  --kallisto STR     Full path to kallisto including the actual file name
  --cpus INT         Number of cores to run kallisto in parallel

--cds specifies a multiple FASTA file containing the coding sequences or mRNA sequences that will be used for calculation of transcript abundances. All sequence IDs in this file need to be unique.

--reads specifies a directory which contains one subfolder per RNAseq sample. Each subfolder is checked for gzip compressed FASTQ files. The identification of paired-end datasets is based on the filenames containing some frequently used tags like 'R1'/'pass_1' and 'R2'/'pass_2'. If no match of paired FASTQ files is possible, the identificaiton of single FASTQ files is performed. An average fragment size of 200bp with a standard deviation of 100bp is assumed for single end datasets.

--out specifies the output folder. A full path should be given. The directory will be created if it does not exist. One TSV file per RNAseq sample will be stored in this folder. See merge_counttables_per_assembly.py for combination of these single files.

--tmp specifies a folder for all temporary files. This folder can be deleted once the run is complete.

--kallisto specifies the location of kallisto. This is required if kallisto is not in the PATH. Default: "kallisto".

--cpus specifies the number of CPUs to run kallisto. Default: 10.

2) After producing single count tables per dataset, `merge_counttables_per_assembly.py` is run to combine all count tables belonging to the same assembly.

Usage:
  python merge_counttables_per_assembly.py --in <DIR> --tpms <FILE> --counts <FILE> --spec <FILE>

Mandatory:
  --in STR          Input folder
  --tpms STR        Output TPM table
  --counts STR      Output count table
  --spec STR        Path to species file

--in specifies the folder of all single TSV files produced by running kallisto.

--tmps specifies the TPM output file.

--counts specifies the raw counts output file.

--spec specifies a species info file which links the individual RNAseq sample (run) IDs to the corresponding assembly ID (the transcriptome assembly).

3) KIPEs is used for the identification of anthocyanin biosynthesis genes: https://github.com/bpucker/KIPEs

4) Run `process_KIPEs_results.py` to extract the relevant sequences from the KIPEs results.

Usage:
  python process_KIPEs_results.py --in <FOLDER> --out <FOLDER> --genes <STR> --roi <STR>

Mandatory:
  --in STR         Input folder(s)
  --out STR        Output folder
  --genes STR      Genes of interest
  --roi STR        Residues of interest

--in specifies the input folder or a comma-separated list of input folders.

--out specifies the output folder for all result files.

--genes specifies the gene(s) of interest. A comma-separated list of all gene IDs of interest should be supplied.

--roi specifies residues of interest in each gene. Comma-separated list of residues for each gene is expected. Underscore is used to separate the residues of different genes. The order of residue groups need to match the order of genes supplied via --genes.

5) Run `filter_samples.py` to clean the combined count tables per assembly.

This removes RNAseq samples which might not be bona fide quantitative analyses (e.g. normalized libraries or mislabeled DNA sequencing experiments). The proportion of all reads/TPMs that fall on the 100 most abundant transcripts (top100) is calculated and used to filter outliers.

Usage:
  python filter_samples.py [--in <FILE> | --indir <FOLDER>] [--out <FILE>|--outdir <FOLDER>] --min <INT> --max <INT>

Mandatory:
  --in STR         Input folder(s)
  --out STR        Output folder
  --min INT        Minimal percentage of transcription on top100 transcripts
  --max INT        Maximal percentage of transcription on top100 transcripts

--in specifies the input file. --indir specifies the output folder.

--out specifies the input file. --outdir specifies the output folder.

--min specifies the minimal percentage of reads/TPMs assigned to the top100 transcripts. An integer (percentage) is expected. Default: 10.

--max specifies the minimal percentage of reads/TPMs assigned to the top100 transcripts. An integer (percentage) is expected. Default: 80.

6) Analysis of the gene expression per gene across two species is based on these scripts.

a) prep_TPM_summary_pairwise.py connects the results of phylogenetic analyses with transcript abundance values. The result are summary tables showing the transcript abundance per species, per sample, and per step in the pathway.

Usage:
  python prep_TPM_summary_pairwise.py --exp1 <FILE> --config1 <FILE> --exp2 <FILE> --config2 <FILE> --out <FOLDER>

Mandatory:
  --exp1    STR         Expression file of species1
  --config1 STR         Config file of species1
  --exp2    STR         Expression file of species2
  --config2 STR         Config file of species2
  --out     STR         Output folder

--exp1 specifies the expression file of species1. This file contains gene IDs in the first columns and the transcript abundance assigned to these genes for samples in the following columns. The first row is the header containing the sample run IDs.

--config1 config file of species1. This file has three columns separated by tabs: pigment status, gene names, and comma-separated list of gene IDs. Example:

B	CHS	Sico@28748,Sico@28746
B	CHI	Sico@179
B	F3H	Sico@31711,Sico@36327
B	DFR	Sico@2593,Sico@9871
B	LAR	Sico@23899
B	ANR	Sico@28328

--exp2 specifies the expression file of species2. See explanations of the file structure above.

--config2 config file of species1. See explanations of the file structure above.

--out specifies the output file. --outdir specifies the output folder.

b) summarize_single_gene_exp.py combines the transcript abundances per species and gene into a summary file and construct a figure to visualize the result.

Usage:
  python summarize_single_gene_exp.py --data <STR> --species <STR> --out <FOLDER>

Mandatory:
  --data    STR         Expression file of species1
  --species STR         Config file of species1
  --out     STR         Output folder

Optional:
  --break   INT         Break point of the Y axis to accommodate all data points
  --cut     INT         Cutoff value to exclude outliers from visualization
  --genes   STR         List of genes

--data specifies all data files that should be included in the analysis. A comma-separated list of file names (full path recommended).

--species specifies the species labels for the output materials (figure and summary file). The order of species names need to match the order of data files supplied via --data.

--out specifies the output folder. This folder will be created if it does not exist.

--break specifies break point of the Y axis. This allows the accommodation of values over a large scale in one figure. Default: 100.

--cut specifies an upper cutoff to exclude large values from the visualization. However, these values are included in the data output file for further statistical analyses. Default: 10000.

--genes specifies the genes to be analysed. A comma-separated list is expected. Default: CHS,DFR,ANS.

7) Analysis of gene expression per gene across evolutionary lineages is based on these scripts.

a) prep_TPM_summary.py connects the results of phylogenetic analyses with transcript abundance values. The result are summary tables showing the transcript abundance per species, per sample, and per step in the pathway.

Usage:
  python prep_TPM_summary.py --exp <FOLDER> --config <FILE> --out <FOLDER>

Mandatory:
  --exp    STR         Folder containing all expression files per assembly
  --config STR         Config file
  --out    STR         Output folder

--exp specifies the folder containing the expression files per transcriptome assembly. Each file contains gene IDs in the first columns and the transcript abundances assigned to these genes for all analysed samples in the following columns with one value per column. The first row is the header containing the sample run IDs.

--config config file of specifies the location of per gene config files which are required for this analysis. This file has three columns separated by tabs: pigment status, gene names, and gene specific config file. Example:

B	CHS	CHS_CONFIG_FILE.txt
B	CHI	CHI_CONFIG_FILE.txt
B	F3H	F3H_CONFIG_FILE.txt

Each per gene config file contains the IDs of all orthologs of the respective gene in all analyzed Caryophyllales species. One ID is given per line.

--out specifies the output folder. This folder is created if it does not exist.

b) summarize_data_for_plots.py combines the transcript abundances for each steps in the pathway across all species in one lineage. The result is used to identify systematic differences between species. analyze_exp_pattern.py is an outdated version of the same script and generates plots with broken Y axis to visualize the results.

Usage:
  python summarize_data_for_plots.py --in <FILE> --out <FOLDER> --taxon <FILE> --genes <STR>

Mandatory:
  --in    STR         Input file
  --taxon STR         Taxon table
  --genes STR         List of genes
  --out   STR         Output folder

--in specifies the input file which is the summary file produced by prep_TPM_summary.py.

--taxon specifies a table connecting species ID to species names and to the RNAseq data sets analysed.

--genes specifies the genes of interest. A comma-separated list of gene names is expected.

--out specifies the output folder. This folder is created if it does not exist.

c) qPCR_ref_gene_exp.py combines the transcript abundances for each qPCR reference gene across all species in one lineage. The result is used to check for systematic differences between species.

Usage:
  python qPCR_ref_gene_exp.py --in <FILE> --out <FOLDER> --taxon <FILE> --genes <STR>

Mandatory:
  --in    STR         Input file
  --taxon STR         Taxon table
  --genes STR         List of genes
  --out   STR         Output folder

--in specifies the input file which is the summary file produced by prep_TPM_summary.py.

--taxon specifies a table connecting species ID to species names and to the RNAseq data sets analysed.

--genes specifies the genes of interest. A comma-separated list of gene names is expected.

--out specifies the output folder. This folder is created if it does not exist.

8) Subfunctionalization of gene copies could be due to divergent expression

coexp_analysis.py was applied to compare the co-expression of different gene copies (e.g. CHS) with other steps in the pathway (e.g. FLS, DFR, and ANS) to investigate sub-functionalization in form of differential transcript abundances.

Usage:
  python coexp_analysis.py --exp <FILE> --refgene <STR> --gene1 <STR> --gene2 <STR> --gene3 <STR> --out <FOLDER>

Mandatory:
  --exp     STR         Expression input file
  --refgene STR         ID of reference gene
  --gene1   STR         ID of gene of interest 1
  --gene2   STR         ID of gene of interest 2
  --gene3   STR         ID of gene of interest 3
  --out     STR         Output folder
  
  optional:
  --refname STR         Displayed label of reference gene
  --name1   STR         Displayed label of gene 1
  --name2   STR         Displayed label of gene 2
  --name3   STR         Displayed label of gene 3

--exp specifies the expression input file. Please see explanations above about the expression file structure.

--refgene specifies the ID of a gene that should be used as reference for the co-expression analysis. Example: CHS gene ID if FLS and DFR co-expression with CHS should be compared.

--gene1 specifies the ID of the first gene of interest to check against the reference gene.

--gene2 specifies the ID of the second gene of interest to check against the reference gene.

--gene3 specifies the ID of the second gene of interest to check against the reference gene.

--out specifies the output folder. This folder will be created if it does not exist.

--refname specifies the label of the reference gene to display in figures.

--name1 specifies the label of gene1 to display in figures.

--name2 specifies the label of gene2 to display in figures.

--name3 specifies the label of gene3 to display in figures.

9) Support for gene loss through synteny analysis

This script support the contruction of synteny plots via the Python package JCVI.

Usage:
  python jcvi_wrapper_genes.py

Mandatory:
  --gff         STR         GFF3 input files
  --cds         STR         CDS input files
  --feature     STR         mRNA features
  --ID          STR         IDs
  --specs       STR         Species names
  --chromosomes STR         Chromosome names

--gff specifies the GFF input files. A comma-separated list of file names is expected. The order of file names need to match the order of the other arguments. Position in the list is used to connect the GFF file to the CDS file and the other parameters/arguments provided for this particular species.

--cds specifies the CDS input files. A comma-separated list of file names is expected. Differences between the CDS and the GFF file cause the exclusion of the effected genes.

--feature specifies the feature names of mRNA that are used in the respective GFF3 file. Usual options are 'mRNA' or 'transcript'.

--ID specifies the ID flag that is used for the mRNA ID in the mRNA row. Usually, this is 'ID'.

--specs specifies names for all analyzed species. These names will be used to name files and to label results in the final figure. Spaces and other special characters should be avoided.

--chromosomes specifies the chromosome names that should be displayed.

Additional scripts

Run get_best_seq_per_spec.py to extract the best candidate sequences of a species without a perfect match for the KIPEs results.

Usage:
  python get_best_seq_per_spec.py --in <FOLDER> --out <FOLDER>

Mandatory:
  --in     STR         Folder of expression input file
  --out    STR         Output folder
  
  optional:
  --ratio  STR         Minimal proportion of matching residues
  --cutoff STR         Minimal number of matching residues

--in specifies the input folder which is the KIPEs output folder.

--out specifies the output folder. This folder will be created if it does not exist.

--ratio specifies the proportion of known amion acid residues that must be matched by a sequence. This value can bet between 0 and 1. Default: 0.

--cutoff specifies the minimal number of perfectly matched amino acid residues. Default: 0.

sample_classifier.py can be used to classify a new RNAseq sample based on the transcript abundances of specific marker genes. It is possible to distinguish between tissues with sufficient contrast e.g. flower, leaf, and root. This classificaiton requires datasets of all these tissue for a related species.

Usage:
  python sample_classifier.py --pep <FILE> --exp <FILE> --ref <FILE> --out <FOLDER>

Mandatory:
  --pep    STR         Peptide input file
  --exp    STR         Expression input file
  --ref    STR         Peptide reference file
  --out    STR         Output folder

--pep specifies the peptide input file (FASTA).

--exp specifies the expression input file. Please have a look at the file structure discription above.

--ref specifies the peptide reference file (FASTA).

--out specifies the output folder. This folder will be created if it does not exist.

get_matching_CDS.py can be used to collect the corresponding coding sequences of a set of peptide sequences.

Usage:
  python get_matching_CDS.py --in <FILE> --out <FILE> --data <FOLDER>

Mandatory:
  --in    STR         Peptide input file
  --out   STR         CDS output file
  --data  STR         List of folders with CDS files

--in specifies the peptide input file (FASTA). Corresponding coding sequences of these peptide sequences will be extracted by this script.

--out specifies the CDS output file (FASTA). All corresponding coding sequences will be stored in this file under the same name as the corresponding peptide sequence.

--data specifies the data folders where the CDS files are located. This can be a comma-separated list to provide multiple folder names.

black_list_cleaning.py can be used to remove a set of sequences from the CDS and PEP file based on their IDs.

Usage:
  python black_list_cleaning.py --cdsin <FILE> --cdsout <FILE> --black <FILE> --pepin <FILE> --pepout <FILE>

Mandatory:
  --cdsin    STR    CDS input file
  --cdsout   STR    CDS output file
  --black    STR    List of IDs to remove
  --pepin    STR    Peptide input file
  --pepout   STR    Peptide output file

--cdsin specifies the CDS input file (FASTA).

--cdsout specifies the CDS output file (FASTA).

--black specifies the black list file. This file contains ID of sequences that will not be transferred to the respective output file.

--pepin specifies the peptide input file (FASTA).

--pepout specifies the peptide output file (FASTA).

check_paralog_exp.py can be used to check the divergence of paralog gene expression. This supports the assumption that paralogs are due to spatial/temporal changes in gene expression rather than neofunctionalization.

Usage:
  python check_paralog_exp.py --data <FILE> --genes <STR> --out <FILE>

Mandatory:
  --data   STR    gene expression input file
  --genes  STR    comma-separated gene list
  --out    STR    output file

--data specifies the gene expression file. One gene per row and one sample per column.

--genes specifies the gene IDs of paralogs. Expression of these genes is compared against each other to show divergence.

--out specifies the figure output file.

Datasets

BPX1 = Halostachys caspica

BPX2 = Myosoton aquaticum (Stellaria aquatica)

BPX3 = Oxyria digyna

BPX4 = Achyranthes bidentata

BPX6 = Dysphania schraderiana

BPX7 = Hammada scoparia

BPX8 = Hololachna songarica (Reaumuria songarica)

BPX9 = Gymnocarpos przewalskii

References

Pucker, B., Walker-Hale, N., Dzurlic, J., Yim, W.C., Cushman, J.C., Crum, A., Yang, Y. and Brockington, S.F. (2023), Multiple mechanisms explain loss of anthocyanins from betalain-pigmented Caryophyllales, including repeated wholesale loss of a key anthocyanidin synthesis enzyme. New Phytol. doi: 10.1111/nph.19341.

Pucker B., Walker-Hale N., Yim W.C., Cushman J.C., Crumm A., Yang Y., Brockington S. (2022). Evolutionary blocks to anthocyanin accumulation and the loss of an anthocyanin carrier protein in betalain-pigmented Caryophyllales. bioRxiv 2022.10.19.512958; doi: 10.1101/2022.10.19.512958.

KIPEs: https://github.com/bpucker/KIPEs

Pucker, B.; Reiher, F.; Schilbert, H.M. Automatic Identification of Players in the Flavonoid Biosynthesis with Application on the Biomedicinal Plant Croton tiglium. Plants 2020, 9, 1103. doi:10.3390/plants9091103

Ya Yang, Michael J. Moore, Samuel F. Brockington, Douglas E. Soltis, Gane Ka-Shu Wong, Eric J. Carpenter, Yong Zhang, Li Chen, Zhixiang Yan, Yinlong Xie, Rowan F. Sage, Sarah Covshoff, Julian M. Hibberd, Matthew N. Nelson, Stephen A. Smith, Dissecting Molecular Evolution in the Highly Diverse Plant Clade Caryophyllales Using Transcriptome Sequencing, Molecular Biology and Evolution, Volume 32, Issue 8, August 2015, Pages 2001–2014, doi:10.1093/molbev/msv081

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
BPX1.cds.fa.gz		BPX1.cds.fa.gz
BPX2.cds.fa.gz		BPX2.cds.fa.gz
BPX3.cds.fa.gz		BPX3.cds.fa.gz
BPX4.cds.fa.gz		BPX4.cds.fa.gz
BPX6.cds.fa.gz		BPX6.cds.fa.gz
BPX7.cds.fa.gz		BPX7.cds.fa.gz
BPX8.cds.fa.gz		BPX8.cds.fa.gz
BPX9.cds.fa.gz		BPX9.cds.fa.gz
BUSCO_completeness_of_Caryophyllales_datasets.txt		BUSCO_completeness_of_Caryophyllales_datasets.txt
CDS_PEP_sequence_collection.zip		CDS_PEP_sequence_collection.zip
README.md		README.md
SAMPLE_OVERVIEW.txt		SAMPLE_OVERVIEW.txt
analyze_exp_pattern.py		analyze_exp_pattern.py
analyze_metadata.py		analyze_metadata.py
black_list_cleaning.py		black_list_cleaning.py
coexp_analysis.py		coexp_analysis.py
filter_samples.py		filter_samples.py
gap_counter.py		gap_counter.py
get_best_seq_per_spec.py		get_best_seq_per_spec.py
get_matching_CDS.py		get_matching_CDS.py
initial_RNAseq_selection.txt		initial_RNAseq_selection.txt
jcvi_wrapper_genes.py		jcvi_wrapper_genes.py
kallisto_pipeline.py		kallisto_pipeline.py
merge_countables_per_assembly.py		merge_countables_per_assembly.py
phytozome_references.txt		phytozome_references.txt
prep_TPM_summary.py		prep_TPM_summary.py
prep_TPM_summary_pairwise.py		prep_TPM_summary_pairwise.py
process_KIPEs_results.py		process_KIPEs_results.py
qPCR_ref_gene_exp.py		qPCR_ref_gene_exp.py
sample_classifier.py		sample_classifier.py
selection_CDS_and_trees.tar.gz		selection_CDS_and_trees.tar.gz
summarize_data_for_plots.py		summarize_data_for_plots.py
summarize_single_gene_exp.py		summarize_single_gene_exp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CaryoAnthoBlock: Investigation of the anthocyanin biosynthesis block in the Caryophyllales

1) After downloading datasets from SRA/ENA, `kallisto_pipeline.py` is run to generate count tables per dataset.

2) After producing single count tables per dataset, `merge_counttables_per_assembly.py` is run to combine all count tables belonging to the same assembly.

3) KIPEs is used for the identification of anthocyanin biosynthesis genes: https://github.com/bpucker/KIPEs

4) Run `process_KIPEs_results.py` to extract the relevant sequences from the KIPEs results.

5) Run `filter_samples.py` to clean the combined count tables per assembly.

6) Analysis of the gene expression per gene across two species is based on these scripts.

7) Analysis of gene expression per gene across evolutionary lineages is based on these scripts.

8) Subfunctionalization of gene copies could be due to divergent expression

9) Support for gene loss through synteny analysis

Additional scripts

Datasets

References

About

Releases

Packages

Contributors 2

Languages

bpucker/CaryoAnthoBlock

Folders and files

Latest commit

History

Repository files navigation

CaryoAnthoBlock: Investigation of the anthocyanin biosynthesis block in the Caryophyllales

1) After downloading datasets from SRA/ENA, kallisto_pipeline.py is run to generate count tables per dataset.

2) After producing single count tables per dataset, merge_counttables_per_assembly.py is run to combine all count tables belonging to the same assembly.

3) KIPEs is used for the identification of anthocyanin biosynthesis genes: https://github.com/bpucker/KIPEs

4) Run process_KIPEs_results.py to extract the relevant sequences from the KIPEs results.

5) Run filter_samples.py to clean the combined count tables per assembly.

6) Analysis of the gene expression per gene across two species is based on these scripts.

7) Analysis of gene expression per gene across evolutionary lineages is based on these scripts.

8) Subfunctionalization of gene copies could be due to divergent expression

9) Support for gene loss through synteny analysis

Additional scripts

Datasets

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

1) After downloading datasets from SRA/ENA, `kallisto_pipeline.py` is run to generate count tables per dataset.

2) After producing single count tables per dataset, `merge_counttables_per_assembly.py` is run to combine all count tables belonging to the same assembly.

4) Run `process_KIPEs_results.py` to extract the relevant sequences from the KIPEs results.

5) Run `filter_samples.py` to clean the combined count tables per assembly.

Packages