scripts for RNA-Seq analysis
This script combines all count tables given in the provided folders. It is assumed that these count tables were generated by featureCounts on the gene level. Multiple folder names can be provided separated by comma. The gene names in all count tables need to match. Raw counts and TPMs (=tags per million assigned tags) can be calculated independent of any annotation. The calculation of FPKMs is based on information about total exon length of a gene. GFF3 annotation is used for this step. The current versions assumes that the GFF3 file was downloaded from the NCBI and contains gene, transcript, and exon features. Overlapping exon features of different transcripts are merged to get the final exon length within a gene.
Requirements:
- Python 2.7.x (other Python 2 versions should work as well)
Usage:
python combine_count_tables.py
--in <FULL_PATH_INPUT_FILE(S)> (multiple paths can be provided comma-separated)
--gff <FULL_PATH_TO_GFF3_FILE>
--out <FULL_PATH_TO_OUTPUT_FILE>
Suggested citation:
this repository
This script identifies genes with very small variation in expression across multiple samples. Suche genes could be used as reference genes for qRT-PCR experiments. The gene expression file used as input should be in the output format of combine_count_tables.py: header line with different samples names, one row per gene starting with the gene name followed by expression values of the different samples. An annotation file can be provided to add a functional description to each gene in the output file.
Note: qPCR_gene_finder.py is an alterantive script for this function.
Requirements:
- Python 2.7.x (other Python 2 versions should work as well)
Usage:
python find_housekeeping_genes.py
--in <FULL_PATH_TO_EXPRESSION_FILE>
--out <FULL_PATH__TO_OUTPUT_FILE>
optional:
--anno <FULL_PATH__TO_ANNOTATION_FILE>
--cutoff <MINIMAL_EXPRESSION_PER_SAMPLE(INTEGER)>
Suggested citation:
this repository
This script produces figures for the expression of selected genes across multiple samples. Plots can be generated for raw counts, TPMs, and FPKMs. Resulting plots can be used to analyze the expression of reference genes after normalization.
Requirements:
- Python 2.7.x (other Python 2 versions should work as well) including matplotlib and NumPy
Usage:
python construct_gene_expression_plots.py
--candidates <FULL_PATH_TO_CANDIDATE_FILE>
--out <FULL_PATH_TO_OUTPUT_DIRECTORY>
at least one expression data file is required:
--counts <FULL_PATH_TO_RAW_COUNT_TABLE>
--tpms <FULL_PATH_TO_TPM_FILE>
--fpkms <FULL_PATH_TO_FPKM_FILE>
optional:
--samples <FULL_PATH_TO_SAMPLE_ORDER_FILE>
Suggested citation:
this repository
This script produces figures for differentially expressed genes sorted by the adjusted p-value to illustrate the log2FC. DESeq2 output can be used as input for this script. It is possible to add an additional column to customize gene names. Otherwise, gene IDs will be used as labels.
Requirements:
- Python 2.7.x (other Python 2 versions should work as well) including matplotlib
Usage:
python DEG_plot.py
--in <FULL_PATH_TO_INPUT_DIRECTORY>
--out <FULL_PATH_TO_OUTPUT_DIRECTORY>
Suggested citation:
this repository
All reports of a STAR mapping in one folder are processed. Relevant read numbers are collected and summarized in a single table.
Requirements:
- python 2.7
Usage:
python get_mapping_stats.py
--in <FULL_PATH_TO_INPUT_DIRECTORY>
--out <FULL_PATH_TO_OUTPUT_FILE>
Suggested citation:
this repository
All FASTQ files in a given folder are subjected to STAR for read mapping against a given reference sequence. Based on a provided GFF3 file the expression of genes is quantified via featureCounts.
Requirements:
- python 2.7
- STAR (Dobin, 2013)
- featureCounts (Liao, 2014)
Usage:
python reads2counts2.py
--fastq_file_dir <FULL_PATH_TO_DIRECTORY>
--tmp_cluster_dir <FULL_PATH_TO_TEMPORARY_DIRECTORY_ON_CLUSTER_VOLUME>
--result_dir <FULL_PATH_TO_RESULT_DIRECTORY>
--ref_gff_file <FULL_PATH_TO_GFF_FILE_MATCHING_THE_PROVIDED_GENOME>
--ref_genome_file <FULL_PATH_TO_GENOME_FILE_MATCHING_THE_PROVIDED_GFF_FILE> \
ptional:
--dissimilarity <FLOAT, 1-identity, value between 0.0 and 1.0>[0.05]
--length_fraction <FLOAT, value between 0.0 and 1.0>[0.9]
--para_jobs <INTEGER, number of jobs to be processed at the compute cluster at the same time>[50]
Suggested citation:
this repository