Matt Gibson (gibsomat@indiana.edu)
Gibson, MJS & Moyle, LC. (2020). Regional differences in the abiotic environment contribute to genomic divergence within a wild tomato species. Molecular Ecology https://doi.org/10.1111/mec.15477
Stacks pipeline. 140 Solanum pimpinellifolium individuals (assembled into two ddRAD libraries using PstI and EcoRI) were sequenced across two NextSeq flowcells (mid output; PE; 150bp).
-
Seems easiest to filter adapters prior to running process radtags. So I'm doing that with
fastp
first, but not trimming for quality. -
See
./pimpGEA/scripts/adapter_trimming
#PBS -k oe #Keep the output and errors from the run
#PBS -l nodes=1:ppn=6,vmem=50gb,walltime=20:00:00
#PBS -M gibsomat@iu.edu
#PBS -m abe # Get notifications for the job
#PBS -N trim_adapted_pimp_poolA
cd /N/dc2/projects/gibsonTomato/pimpGEA/data/rawdata/poolA
#ACAGTG
./N/dc2/projects/gibsonTomato/fastp -i GSF1870-Moyle-PoolA-Index-ACAGTG_S5_R1_001.fastq -I GSF1870-Moyle-PoolA-Index-ACAGTG_S5_R2_001.fastq -o ../../adapter_clean_rawdata/GSF1870-Moyle-PoolA-Index-ACAGTG_S5_R1_001_noadapters.fastq -O ../../adapter_clean_rawdata/GSF1870-Moyle-PoolA-Index-ACAGTG_S5_R2_001_noadapters.fastq -Q -L -w 6
#ATCACG
./N/dc2/projects/gibsonTomato/fastp -i GSF1870-Moyle-PoolA-Index-ATCACG_S1_R1_001.fastq -I GSF1870-Moyle-PoolA-Index-ATCACG_S1_R2_001.fastq -o ../../adapter_clean_rawdata/GSF1870-Moyle-PoolA-Index-ATCACG_S1_R1_001_noadapters.fastq -O ../../adapter_clean_rawdata/GSF1870-Moyle-PoolA-Index-ATCACG_S1_R2_001_noadapters.fastq -Q -L -w 6
#CGATGT
./N/dc2/projects/gibsonTomato/fastp -i GSF1870-Moyle-PoolA-Index-CGATGT_S2_R1_001.fastq -I GSF1870-Moyle-PoolA-Index-CGATGT_S2_R2_001.fastq -o ../../adapter_clean_rawdata/GSF1870-Moyle-PoolA-Index-CGATGT_S2_R1_001_noadapters.fastq -O ../../adapter_clean_rawdata/GSF1870-Moyle-PoolA-Index-CGATGT_S2_R2_001_noadapters.fastq -Q -L -w 6
#TGACCA
./N/dc2/projects/gibsonTomato/fastp -i GSF1870-Moyle-PoolA-Index-TGACCA_S4_R1_001.fastq -I GSF1870-Moyle-PoolA-Index-TGACCA_S4_R2_001.fastq -o ../../adapter_clean_rawdata/GSF1870-Moyle-PoolA-Index-TGACCA_S4_R1_001_noadapters.fastq -O ../../adapter_clean_rawdata/GSF1870-Moyle-PoolA-Index-TGACCA_S4_R2_001_noadapters.fastq -Q -L -w 6
#TTAGGC
./N/dc2/projects/gibsonTomato/fastp -i GSF1870-Moyle-PoolA-Index-TTAGGC_S3_R1_001.fastq -I GSF1870-Moyle-PoolA-Index-TTAGGC_S3_R2_001.fastq -o ../../adapter_clean_rawdata/GSF1870-Moyle-PoolA-Index-TTAGGC_S3_R1_001_noadapters.fastq -O ../../adapter_clean_rawdata/GSF1870-Moyle-PoolA-Index-TTAGGC_S3_R2_001_noadapters.fastq -Q -L -w 6
#Same done for poolB
- See
./pimpGEA/scripts/recoverREsite
#Read 1
python recoverRE.py --truesite TGCAG --startpos 6 --endpos 10 --out /N/dc2/projects/gibsonTomato/pimpGEA/data/recoverREsite/poolA/GSF1870-Moyle-PoolA-Index-ACAGTG_S5_R1_001_noadapters_refix.fastq /N/dc2/projects/gibsonTomato/pimpGEA/data/adapter_clean_rawdata/poolA/GSF1870-Moyle-PoolA-Index-ACAGTG_S5_R1_001_noadapters.fastq
#Read 2
python recoverRE.py --truesite AATTC --startpos 1 --endpos 5 --out /N/dc2/projects/gibsonTomato/pimpGEA/data/recoverREsite/poolA/GSF1870-Moyle-PoolA-Index-ACAGTG_S5_R2_001_noadapters_refix.fastq /N/dc2/projects/gibsonTomato/pimpGEA/data/adapter_clean_rawdata/poolA/GSF1870-Moyle-PoolA-Index-ACAGTG_S5_R2_001_noadapters.fastq
#...done for all index libraries
At this point, I continue here with the stacks pipeline that I ran. See README_pyrad_refix.md
for the new ipyrad pipeline that incorporates the RE-site recovered reads + more relaxed parameters.
- See
./pimpGEA/STACKS
#PBS -k oe #Keep the output and errors from the run
#PBS -l nodes=1:ppn=2,vmem=20gb,walltime=20:00:00
#PBS -M gibsomat@iu.edu
#PBS -m abe # Get notifications for the job
#PBS -N process_radtags_poolA
cd /N/dc2/projects/gibsonTomato/pimpGEA/STACKS/
module load stacks
process_radtags -P -p /N/dc2/projects/gibsonTomato/pimpGEA/data/recoverREsite/poolA/ -b /N/dc2/projects/gibsonTomato/pimpGEA/data/barcodes/poolA/poolA_stacks_barcodes.txt -o ./demultiplexed/poolA/ -r --inline_index --renz_1 pstI --renz_2 ecoRI -y gzfastq -s 5 --len_limit 30
#PBS -k oe #Keep the output and errors from the run
#PBS -l nodes=1:ppn=4,vmem=50gb,walltime=08:00:00
#PBS -M gibsomat@iu.edu
#PBS -m abe # Get notifications for the job
#PBS -N process_radtags_poolB
cd /N/dc2/projects/gibsonTomato/pimpGEA/STACKS/
module load stacks
process_radtags -P -p /N/dc2/projects/gibsonTomato/pimpGEA/data/recoverREsite/poolB/ -b /N/dc2/projects/gibsonTomato/pimpGEA/data/barcodes/poolB/poolB_stacks_barcodes.txt -o ./demultiplexed/poolB/ -r --inline_index --renz_1 pstI --renz_2 ecoRI -y gzfastq -s 5 --len_limit 30
See ./pimpGEA/STACKS/map/
for scripts and mapped reads
#pimpGEA-stacks-map_no_rem.sh
#PBS -k oe #Keep the output and errors from the run
#PBS -l nodes=1:ppn=8,vmem=140gb,walltime=08:00:00
#PBS -M gibsomat@iu.edu
#PBS -m abe # Get notifications for the job
#PBS -N pimpGEA-stacks-map-bwa-no_rem
cd /N/dc2/projects/gibsonTomato/pimpGEA/STACKS/demultiplexed/demultiplexed_all/
module load bwa
for r1 in *.1.fq.gz
do
echo "Mapping..."
echo "$r1"
toremove='.1.'
newsubstr='.2.'
r2="${r1/$toremove/$newsubstr}"
echo "$r2"
outname="/N/dc2/projects/gibsonTomato/pimpGEA/STACKS/map/mapped_reads_no_rem/${r1::-8}.sam"
echo "$outname"
bwa mem -t 8 /N/dc2/projects/gibsonTomato/pimpGEA/data/genome/S_lycopersicum_chromosomes.3.00.fa $r1 $r2 > $outname
done
wait
- See
./pimpGEA/STACKS/filter_mapped/
for scripts
#filter-bams-stacks-assembly_no_rem.sh
#PBS -k oe #Keep the output and errors from the run
#PBS -l nodes=1:ppn=6,vmem=30gb,walltime=05:00:00
#PBS -M gibsomat@iu.edu
#PBS -m abe # Get notifications for the job
#PBS -N filterBams-STACKS-assembly_no_rem
module load samtools
cd /N/dc2/projects/gibsonTomato/pimpGEA/STACKS/filter_mapped/
./filter_bams.sh /N/dc2/projects/gibsonTomato/pimpGEA/STACKS/map/mapped_reads_no_rem/
#filter_bams.sh
#!/usr/bin/env bash
#Filters bams (sams in this case actually) based on specific criteria. In this case MQ > 3, no supplementary reads, and proper PE sequencing
cd $1
for f in *.sam
do
echo "$f" 1>&2
outname="/N/dc2/projects/gibsonTomato/pimpGEA/STACKS/filter_mapped/filtered_bams_no_rem/${f::-4}.filtered.bam"
samtools view -f 2 -q 4 -h -b -o $outname $f &
done
wait
cd /N/dc2/projects/gibsonTomato/pimpGEA/STACKS/filter_mapped/filtered_bams_no_rem/
for f in *.filtered.bam
do
outname="/N/dc2/projects/gibsonTomato/pimpGEA/STACKS/filter_mapped/filtered_bams_no_rem/${f::-4}.filtered.bam"
samtools flagstat $f > "$outname.stats" &
done
wait
- See
./pimpGEA/STACKS/sort_bams
for scripts
Code for sorting and indexing--I didnt index these bams, though.
#sort_index_bams.sh
#PBS -k oe #Keep the output and errors from the run
#PBS -l nodes=1:ppn=1,vmem=60gb,walltime=08:00:00
#PBS -M gibsomat@iu.edu
#PBS -m abe # Get notifications for the job
#PBS -N sortNindexbams
cd /N/dc2/projects/gibsonTomato/pimpGEA/STACKS/filter_mapped/filtered_bams_no_rem
module load samtools
echo "Sorting..."
for f in *.bam
do
echo "$f"
outname="/N/dc2/projects/gibsonTomato/pimpGEA/STACKS/sorted_bams/${f::-4}.sorted.bam"
samtools sort $f > $outname
done
#echo "Indexing..."
#cd /N/dc2/projects/gibsonTomato/pimpGEA/data/sorted_bams/
#for f in *.sorted.bam
#do
# samtools index $f &
#done
#wait
- See
./pimpGEA/STACKS/gstacks/
#pimpGEA-stacks-refmap.sh
#PBS -k oe #Keep the output and errors from the run
#PBS -l nodes=1:ppn=6,vmem=100gb,walltime=10:00:00
#PBS -M gibsomat@iu.edu
#PBS -m abe # Get notifications for the job
#PBS -N STACKS_refmap_no_rem
cd /N/dc2/projects/gibsonTomato/pimpGEA/STACKS/gstacks
#module load stacks
/N/dc2/projects/gibsonTomato/bin/ref_map.pl -T 6 --samples /N/dc2/projects/gibsonTomato/pimpGEA/STACKS/sorted_bams/ --popmap /N/dc2/projects/gibsonTomato/pimpGEA/STACKS/gstacks/popmap.txt -o /N/dc2/projects/gibsonTomato/pimpGEA/STACKS/gstacks/outfiles/
- See
./pimpGEA/STACKS/populations/
I pretty much requested all output files with minimal filters applied:
- Site needs to be in at least 1 population
- Within a population, the site much be typed in at least 20% of individuals.
#PBS -k oe #Keep the output and errors from the run
#PBS -l nodes=1:ppn=6,vmem=60gb,walltime=2:00:00
#PBS -M gibsomat@iu.edu
#PBS -m abe # Get notifications for the job
#PBS -N STACKS_populations
cd /N/dc2/projects/gibsonTomato/pimpGEA/STACKS/populations
#module load stacks
/N/dc2/projects/gibsonTomato/bin/populations -P /N/dc2/projects/gibsonTomato/pimpGEA/STACKS/gstacks/outfiles/ -M /N/dc2/projects/gibsonTomato/pimpGEA/STACKS/gstacks/popmap.txt -t 6 -O /N/dc2/projects/gibsonTomato/pimpGEA/STACKS/populations/outfiles1/ -p 1 -r 0.20 --hwe --fstats --fasta_loci --fasta_samples --vcf --vcf_haplotypes --structure --fasta_samples_raw
First, I changed 0772-101 to 0722-101, since it was a mislabeled plant from the get-go
sed -i '' 's/.filtered.sorted//g' populations.snps.vcf
Total # snps: 366,466 All biallelic
Total # snps: 359,796
grep -Ev 'SL3.0ch00' populations.snps.vcf > populations.snps.filter1.vcf
Total # of snps: 359,796
Doesn't remove sites, but changes genotypes to missing if DP < 4
vcftools --vcf populations.snps.filter1.vcf --recode --recode-INFO-all --out populations.snps.filter2 --minDP 4
- With all species
- With just arc
- With just chl
- With just chm
- With just hab
- With just pimp
Remove sites when fewer than 20% of individuals are typed.
Total # of snps full dataset: 109,184 Total # of snps just pimp: 112,252
#For all
vcftools --vcf populations.snps.filter2.alltaxa.vcf --recode --recode-INFO-all --out populations.snps.filter3.alltaxa --max-missing-count 128
#For pimp
vcftools --vcf populations.snps.filter2.pimp.vcf --recode --recode-INFO-all --out populations.snps.filter3.pimp --max-missing-count 120
Total # of snps full dataset: 106,440 Total # of snps just pimp: 44,533
vcftools --vcf populations.snps.filter3.alltaxa.recode.vcf --non-ref-ac-any 1 --recode --recode-INFO-all --out populations.snps.filter4.alltaxa.recode
vcftools --vcf populations.snps.filter3.pimp.recode.vcf --non-ref-ac-any 1 --recode --recode-INFO-all --out populations.snps.filter4.pimp
Total # of snps: 44,243
Done in TASSEL Max heterozygous = 0.6
142 individuals remaining. Total # of SNPs: 44,064
vcftools --vcf populations.snps.filter5.pimp.vcf --remove-indv 365911-101 --remove-indv 1601-101 --remove-indv 1416-101 --remove-indv 1987-101 --remove-indv 2649-101 --remove-indv 2933-101 --remove-indv 390695-101 --remove-indv 503517-101 --recode --recode-INFO-all --out populations.snps.filter6.pimp
#Remove invariant sites post indv filter
vcftools --vcf populations.snps.filter6.pimp.recode.vcf --non-ref-ac-any 1 --recode --recode-INFO-all --out populations.snps.filter7.pimp
Total # of SNPs: 17,358
bcftools +prune -l 0.9 -w 1000 populations.snps.filter7.pimp.recode.vcf -Ov -o populations.snps.filter8.pimp.vcf