Skip to content

scripts and data sets associated with Data Literacy in Plant Sciences course

License

Notifications You must be signed in to change notification settings

bpucker/DataLiteracyInPlantSciences

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataLiteracyInPlantSciences

This repository contains scripts and data sets associated with the 'Data Literacy in Plant Sciences' course.

(1) Introduction

(1.1) Establishing access to de.NBI cloud

(1.2) Documentation

Please ensure a proper documentation of all analyses. This includes the version of applied tools, the parameters, input files, and output files.

(1.3) Introduction to Linux

File names should not include spaces or other special characters. Important commands are:

cd <FOLDER> changes the working directory to the specified folder

mkdir <FOLDER> creates a new folder in the working directory

ls list the content of the working directory

(1.4) Downloading and transferring files

scp this function can be used to transfer files. Please see the documentation for details.

wget this function can be used to download files. Please see the documentation for details.

md5sum <FILENAME> calculates a hash (finger print) for a given file. This hash can be used to check the complete transfer of files.

(2) Understanding the mysteries of plant pigmentation

Helpful data sets are available here: here: https://lnk.tu-bs.de/pE3yxC.

Here you can find an online tool for co-expression: http://pbb.bot.nat.tu-bs.de/CoExp/

MAFFT can be used for glogabl alignments.

Tools for phylogenetic tree construction: FastTree2, RAxML.

iTOL is helpful for the visualization of phylogenetic trees.

(3) Re-using gene expression data sets

fastq-dump of the SRA toolkit can be used to download FASTQ files from the Sequence Read Archieve.

(4) Distribution of Caryophyllales

GBIF is a valuable source of species distribution data sets. A tutorial explains how to download and clean GBIF data sets. Additional cleaning with CoordinateCleaner is recommended.

Running these cleaning steps requires the installation of several tools:

sudo apt-get install r-base

sudo apt-get install gdebi-core

wget https://download2.rstudio.org/server/bionic/amd64/rstudio-server-2022.07.0-548-amd64.deb

sudo gdebi rstudio-server-2022.07.0-548-amd64.deb

sudo apt install liblapack-dev libopenblas-dev

sudo apt-get install libcurl4-openssl-dev

sudo apt-get install libxml2-dev

sudo apt-get install libssl-dev

sudo apt-get install libpng-dev

sudo apt install libgdal-dev

sudo apt-get install -y libudunits2-dev

Several R packages are required to run these cleaning analyses:

"rgbif", "remotes", "slam", "qlcMatrix", "curl", "crul", "ropensci/scrubr", "openssl", "httr", "maps", "CoordinateCleaner"

(5) Finding flaws in publications

(6) Statistical tests

The script statistic_test.py contains functions to run some basic statistic tests.

Usage
python3 statistic_test.py --in <FILE>
Mandatory:
--in      STR   Input data file

Optional:
--test    STR   Statistic test
--paired  NONE  Indicates paired samples

--in or --input specifies the input data file that contains the values for statistical tests. The first row must contain the sample names. All following rows are considered data rows. Two columns with one value each are expected. If the samples are unpaired and one sample is larger than the other, the larger sample must be in the first column.

--test specifies the statistical test to run.

--paired indicates that the samples are paired. This will trigger paired tests if available for the selection.

The script construct_DESeq2_input.py generates to text files that are required for the DEG identification with DESeq2.

Usage
python construct_DESeq2_input.py --counts <FILE> --info <FILE> --out <DIR>
Mandatory:
--counts  FILE   Input data file
--info    FILE   Sample info file
--out     DIR    Output folder

--counts specifies a count table that contains the number of reads assigned to each gene in each sample. This table is generated by kallisto.

--info specifies a text file that contains metadata about the data sets that should be analyzed in the next step. Three columns are required to compare the gene expression between tissues. The first column need to contain the sample ID as it is given in the counts table mentioned above. The second column contains the tissue type. The third column indicates the replicate (e.g. rep1, rep2, rep3).

--out specifies the output folder. This folder will be created if it does not exist already. Output files will be stored in this folder.

References

RNA-seq analysis: https://doi.org/10.1038/s42255-022-00561-5

About

scripts and data sets associated with Data Literacy in Plant Sciences course

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published