This repository contains scripts and data sets associated with the 'Data Literacy in Plant Sciences' course.
Please ensure a proper documentation of all analyses. This includes the version of applied tools, the parameters, input files, and output files.
File names should not include spaces or other special characters. Important commands are:
cd <FOLDER>
changes the working directory to the specified folder
mkdir <FOLDER>
creates a new folder in the working directory
ls
list the content of the working directory
scp
this function can be used to transfer files. Please see the documentation for details.
wget
this function can be used to download files. Please see the documentation for details.
md5sum <FILENAME>
calculates a hash (finger print) for a given file. This hash can be used to check the complete transfer of files.
Helpful data sets are available here: here: https://lnk.tu-bs.de/pE3yxC.
Here you can find an online tool for co-expression: http://pbb.bot.nat.tu-bs.de/CoExp/
MAFFT can be used for glogabl alignments.
Tools for phylogenetic tree construction: FastTree2, RAxML.
iTOL is helpful for the visualization of phylogenetic trees.
fastq-dump of the SRA toolkit can be used to download FASTQ files from the Sequence Read Archieve.
GBIF is a valuable source of species distribution data sets. A tutorial explains how to download and clean GBIF data sets. Additional cleaning with CoordinateCleaner is recommended.
Running these cleaning steps requires the installation of several tools:
sudo apt-get install r-base
sudo apt-get install gdebi-core
wget https://download2.rstudio.org/server/bionic/amd64/rstudio-server-2022.07.0-548-amd64.deb
sudo gdebi rstudio-server-2022.07.0-548-amd64.deb
sudo apt install liblapack-dev libopenblas-dev
sudo apt-get install libcurl4-openssl-dev
sudo apt-get install libxml2-dev
sudo apt-get install libssl-dev
sudo apt-get install libpng-dev
sudo apt install libgdal-dev
sudo apt-get install -y libudunits2-dev
Several R packages are required to run these cleaning analyses:
"rgbif", "remotes", "slam", "qlcMatrix", "curl", "crul", "ropensci/scrubr", "openssl", "httr", "maps", "CoordinateCleaner"
The script statistic_test.py contains functions to run some basic statistic tests.
Usage
python3 statistic_test.py --in <FILE>
Mandatory:
--in STR Input data file
Optional:
--test STR Statistic test
--paired NONE Indicates paired samples
--in
or --input
specifies the input data file that contains the values for statistical tests. The first row must contain the sample names. All following rows are considered data rows. Two columns with one value each are expected. If the samples are unpaired and one sample is larger than the other, the larger sample must be in the first column.
--test
specifies the statistical test to run.
--paired
indicates that the samples are paired. This will trigger paired tests if available for the selection.
The script construct_DESeq2_input.py generates to text files that are required for the DEG identification with DESeq2.
Usage
python construct_DESeq2_input.py --counts <FILE> --info <FILE> --out <DIR>
Mandatory:
--counts FILE Input data file
--info FILE Sample info file
--out DIR Output folder
--counts
specifies a count table that contains the number of reads assigned to each gene in each sample. This table is generated by kallisto.
--info
specifies a text file that contains metadata about the data sets that should be analyzed in the next step. Three columns are required to compare the gene expression between tissues. The first column need to contain the sample ID as it is given in the counts table mentioned above. The second column contains the tissue type. The third column indicates the replicate (e.g. rep1, rep2, rep3).
--out
specifies the output folder. This folder will be created if it does not exist already. Output files will be stored in this folder.
RNA-seq analysis: https://doi.org/10.1038/s42255-022-00561-5