Skip to content

ChipsXu/RNA-Seq_analysis

 
 

Repository files navigation

RNA-Seq_analysis

scripts for RNA-Seq analysis

combine_count_tables.py

This script combines all count tables given in the provided folders. It is assumed that these count tables were generated by featureCounts on the gene level. Multiple folder names can be provided separated by comma. The gene names in all count tables need to match. Raw counts and TPMs (=tags per million assigned tags) can be calculated independent of any annotation. The calculation of FPKMs is based on information about total exon length of a gene. GFF3 annotation is used for this step. The current versions assumes that the GFF3 file was downloaded from the NCBI and contains gene, transcript, and exon features. Overlapping exon features of different transcripts are merged to get the final exon length within a gene.

Requirements:

  1. Python 2.7.x (other Python 2 versions should work as well)

Usage:

python combine_count_tables.py
--in <FULL_PATH_INPUT_FILE(S)> (multiple paths can be provided comma-separated)
--gff <FULL_PATH_TO_GFF3_FILE>
--out <FULL_PATH_TO_OUTPUT_FILE>

Suggested citation:

this repository

find_housekeeping_genes.py

This script identifies genes with very small variation in expression across multiple samples. Suche genes could be used as reference genes for qRT-PCR experiments. The gene expression file used as input should be in the output format of combine_count_tables.py: header line with different samples names, one row per gene starting with the gene name followed by expression values of the different samples. An annotation file can be provided to add a functional description to each gene in the output file.

Note: qPCR_gene_finder.py is an alterantive script for this function.

Requirements:

  1. Python 2.7.x (other Python 2 versions should work as well)

Usage:

python find_housekeeping_genes.py
--in <FULL_PATH_TO_EXPRESSION_FILE>
--out <FULL_PATH__TO_OUTPUT_FILE>

optional: --anno <FULL_PATH__TO_ANNOTATION_FILE>
--cutoff <MINIMAL_EXPRESSION_PER_SAMPLE(INTEGER)>

Suggested citation:

this repository

construct_gene_expression_plot.py

This script produces figures for the expression of selected genes across multiple samples. Plots can be generated for raw counts, TPMs, and FPKMs. Resulting plots can be used to analyze the expression of reference genes after normalization.

Requirements:

  1. Python 2.7.x (other Python 2 versions should work as well) including matplotlib and NumPy

Usage:

python construct_gene_expression_plots.py
--candidates <FULL_PATH_TO_CANDIDATE_FILE>
--out <FULL_PATH_TO_OUTPUT_DIRECTORY>

at least one expression data file is required:

--counts <FULL_PATH_TO_RAW_COUNT_TABLE>
--tpms <FULL_PATH_TO_TPM_FILE>
--fpkms <FULL_PATH_TO_FPKM_FILE>

optional:

--samples <FULL_PATH_TO_SAMPLE_ORDER_FILE>

Suggested citation:

this repository

DEG_plot.py

This script produces figures for differentially expressed genes sorted by the adjusted p-value to illustrate the log2FC. DESeq2 output can be used as input for this script. It is possible to add an additional column to customize gene names. Otherwise, gene IDs will be used as labels.

Requirements:

  1. Python 2.7.x (other Python 2 versions should work as well) including matplotlib

Usage:

python DEG_plot.py
--in <FULL_PATH_TO_INPUT_DIRECTORY>
--out <FULL_PATH_TO_OUTPUT_DIRECTORY>

Suggested citation:

this repository

get_mapping_stats.py

All reports of a STAR mapping in one folder are processed. Relevant read numbers are collected and summarized in a single table.

Requirements:

  1. python 2.7

Usage: python get_mapping_stats.py
--in <FULL_PATH_TO_INPUT_DIRECTORY>
--out <FULL_PATH_TO_OUTPUT_FILE>

Suggested citation:

this repository

reads2counts.py

All FASTQ files in a given folder are subjected to STAR for read mapping against a given reference sequence. Based on a provided GFF3 file the expression of genes is quantified via featureCounts.

Requirements:

  1. python 2.7
  2. STAR (Dobin, 2013)
  3. featureCounts (Liao, 2014)

Usage: python reads2counts2.py
--fastq_file_dir <FULL_PATH_TO_DIRECTORY>
--tmp_cluster_dir <FULL_PATH_TO_TEMPORARY_DIRECTORY_ON_CLUSTER_VOLUME>
--result_dir <FULL_PATH_TO_RESULT_DIRECTORY>
--ref_gff_file <FULL_PATH_TO_GFF_FILE_MATCHING_THE_PROVIDED_GENOME>
--ref_genome_file <FULL_PATH_TO_GENOME_FILE_MATCHING_THE_PROVIDED_GFF_FILE> \

ptional: --dissimilarity <FLOAT, 1-identity, value between 0.0 and 1.0>[0.05]
--length_fraction <FLOAT, value between 0.0 and 1.0>[0.9]
--para_jobs <INTEGER, number of jobs to be processed at the compute cluster at the same time>[50]

Suggested citation:

this repository

About

scripts for RNA-Seq analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 87.1%
  • R 12.9%