Skip to content

Commit

Permalink
More proper build (#42)
Browse files Browse the repository at this point in the history
* Building taxonomy (#38)

* building taxonomy files but this script will be deprecated right away

* deprecated

* script to build taxonomy with src files

* m

* move old taxonomy to deprecated

* remove old 'versioned' files outside of git versioning

* filter taxonomy script

* complete the taxonomy

* updated scripts for compiling databases

* dev branch testing

* fix lmono test a bit

* .

* Fix the taxonomy tests (#39)

* building taxonomy files but this script will be deprecated right away

* deprecated

* script to build taxonomy with src files

* m

* move old taxonomy to deprecated

* remove old 'versioned' files outside of git versioning

* filter taxonomy script

* complete the taxonomy

* updated scripts for compiling databases

* dev branch testing

* fix lmono test a bit

* .

* fix paths

* updated PATH

* updated PATH

* troubleshooting

* fix PATH again

* fix ls path

* remove that step

* updated tests to reflect build-taxonomy (#40)

* fix path to taxonomy files

* download and build taxonomy

* merge Listeria into Yersinia matrix

* m

* updated output directory as matrix.GENUS

* kraken1 tests patches

* m

* Fixed two more tests (#41)

* update yml

* query fallback

* debugging msg

* fix path to taxonomydb

* print first two lines of fasta files

* helpful cut statement

* remove head statement in last step

* bump version
  • Loading branch information
lskatz authored May 9, 2024
1 parent 51fea40 commit 8bdf873
Show file tree
Hide file tree
Showing 21 changed files with 273 additions and 445 deletions.
19 changes: 11 additions & 8 deletions .github/workflows/unit-testing.Listeria.Kraken1.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# This is a subsampling unit test to get early results
on:
push:
branches: [master]
branches: [master, dev, build-taxonomy]
name: Listeria-with-Kraken1

env:
Expand All @@ -23,7 +23,7 @@ jobs:
perl-version: ${{ matrix.perl }}
multi-thread: "true"
- name: checkout my repo
uses: actions/checkout@v2
uses: actions/checkout@v4
with:
path: Kalamari

Expand All @@ -48,7 +48,9 @@ jobs:
perl -MNet::FTP -e '$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1); $ftp->login; $ftp->binary; $ftp->get("/entrez/entrezdirect/edirect.tar.gz");'
gunzip -cv edirect.tar.gz | tar xf -
rm -v edirect.tar.gz
export PATH=${PATH}:$HOME/edirect >& /dev/null || setenv PATH "${PATH}:$HOME/edirect"
echo $GITHUB_WORKSPACE/edirect >> $GITHUB_PATH
echo $GITHUB_WORKSPACE/Kalamari/bin >> $GITHUB_PATH
#export PATH=${PATH}:$HOME/edirect >& /dev/null || setenv PATH "${PATH}:$HOME/edirect"
yes Y | ./edirect/setup.sh
tree edirect
- name: check-env
Expand All @@ -64,10 +66,6 @@ jobs:
run: perl Kalamari/bin/downloadKalamari.pl --outdir ${{ env.OUTDIR }} ${{ env.TSV }}
- name: check-results
run: tree ${{ env.OUTDIR }}
#- name: download-more
# run: perl Kalamari/bin/downloadKalamari.pl --outdir ${{ env.OUTDIR }} ${{ env.TSV }} --and protein --and nucleotide
#- name: check-results
# run: tree ${{ env.OUTDIR }}
- name: install kraken
run: |
wget https://github.com/DerrickWood/kraken/archive/refs/tags/v1.1.1.tar.gz -O kraken-v1.1.1.tar.gz
Expand All @@ -76,12 +74,17 @@ jobs:
chmod -v +x kraken-1.1.1/kraken-src/*
echo $(realpath kraken-1.1.1/kraken-src) >> $GITHUB_PATH
tree $(realpath) kraken-1.1.1
- name: build taxonomy
run: |
export PATH=$PATH:Kalamari/bin
buildTaxonomy.sh
ls -lh Kalamari/share
- name: Kraken1 database
run: |
echo $PATH
which kraken-build
mkdir -pv kraken
cp -rv Kalamari/src/taxonomy kraken/taxonomy
cp -rv Kalamari/share/kalamari-*/taxonomy kraken/taxonomy
find ${{ env.OUTDIR }} -name '*.fasta' -exec kraken-build --db kraken --add-to-library {} \;
tree kraken
# Some super debugging here with -x
Expand Down
79 changes: 0 additions & 79 deletions .github/workflows/unit-testing.Listeria.Kraken2.yml

This file was deleted.

38 changes: 22 additions & 16 deletions .github/workflows/unit-testing.Yersinia.Kraken2.yml
Original file line number Diff line number Diff line change
@@ -1,14 +1,12 @@
# This is a subsampling unit test to get early results
on:
push:
branches: [master]
name: Yersinia-with-Kraken2
branches: [master, dev, build-taxonomy]
name: Genera-with-Kraken2

env:
TSV: "Kalamari/src/genus.tsv"
OUTDIR: "Yersinia.out"
DB: "kraken2"
SRC_TAX: "Kalamari/src/taxonomy"
SRC_CHR: "Kalamari/src/chromosomes.tsv"
SRC_PLD: "Kalamari/src/plasmids.tsv"
GENUS: Yersinia
Expand All @@ -20,15 +18,16 @@ jobs:
matrix:
os: ['ubuntu-20.04' ]
perl: [ '5.32' ]
name: Perl ${{ matrix.perl }} on ${{ matrix.os }}
GENUS: [ 'Yersinia', 'Listeria']
name: ${{ matrix.GENUS }} Perl ${{ matrix.perl }} on ${{ matrix.os }}
steps:
- name: Set up perl
uses: shogo82148/actions-setup-perl@v1
with:
perl-version: ${{ matrix.perl }}
multi-thread: "true"
- name: checkout my repo
uses: actions/checkout@v2
uses: actions/checkout@v4
with:
path: Kalamari

Expand All @@ -40,29 +39,37 @@ jobs:
- name: select for only for this genus
run: |
head -n 1 ${{ env.SRC_CHR }} > ${{ env.TSV }}
grep -m 2 ${{ env.GENUS }} ${{ env.SRC_CHR }} >> ${{ env.TSV }}
grep -m 2 ${{ env.GENUS }} ${{ env.SRC_PLD }} >> ${{ env.TSV }}
echo "These are the ${{ env.GENUS }} genomes for downstream tests"
grep -m 2 ${{ matrix.GENUS }} ${{ env.SRC_CHR }} >> ${{ env.TSV }}
grep -m 2 ${{ matrix.GENUS }} ${{ env.SRC_PLD }} >> ${{ env.TSV }}
echo "These are the ${{ matrix.GENUS }} genomes for downstream tests"
column -ts $'\t' ${{ env.TSV }}
hexdump -c ${{ env.TSV }}
- name: download
run: perl Kalamari/bin/downloadKalamari.pl --outdir ${{ env.OUTDIR }} ${{ env.TSV }}
run: perl Kalamari/bin/downloadKalamari.pl --outdir ${{ matrix.GENUS }} ${{ env.TSV }}
- name: check-results
run: tree ${{ env.OUTDIR }}
run: |
tree ${{ matrix.GENUS }}
echo "First two lines of each fasta file:"
find ${{ matrix.GENUS }} -name '*.fasta' | xargs head -n 2 | cut -c 1-60
- name: install kraken
run: |
wget https://github.com/DerrickWood/kraken2/archive/refs/tags/v2.1.2.tar.gz -O kraken-v2.1.2.tar.gz
tar zxvf kraken-v2.1.2.tar.gz
cd kraken2-2.1.2 && bash install_kraken2.sh target && cd -
ls -lhS kraken2-2.1.2/target
chmod +x kraken2-2.1.2/target/*
- name: build taxonomy
run: |
export PATH=$PATH:Kalamari/bin
buildTaxonomy.sh
ls -lh Kalamari/share
- name: Kraken2 database
run: |
export PATH=$PATH:kraken2-2.1.2/target
which kraken2-build
mkdir -pv ${{ env.DB }}
cp -rv ${{ env.SRC_TAX }} ${{ env.DB }}/taxonomy
find ${{ env.OUTDIR }} -name '*.fasta' -exec kraken2-build --db ${{ env.DB }} --add-to-library {} \;
cp -rv Kalamari/share/kalamari-*/taxonomy ${{ env.DB }}/taxonomy
find ${{ matrix.GENUS }} -name '*.fasta' -exec kraken2-build --db ${{ env.DB }} --add-to-library {} \;
tree ${{ env.DB }}
echo ".....Building the database....."
kraken2-build --build --db ${{ env.DB }} --threads 2
Expand All @@ -71,10 +78,9 @@ jobs:
export PATH=$PATH:kraken2-2.1.2/target
tree ${{ env.DB }}
ls -lhSR ${{ env.DB }}
QUERY=$(find ${{ env.OUTDIR }} -name '*.fasta' | head -n 1)
QUERY=$(find ${{ matrix.GENUS }} -name '*.fasta' | head -n 1)
echo "QUERY is $QUERY"
head -n 2 $QUERY
kraken2 --db ${{ env.DB }} --report kraken2.report --use-mpa-style --output kraken2.raw $QUERY
set -x; kraken2 --db ${{ env.DB }} --report kraken2.report --use-mpa-style --output kraken2.raw $QUERY; set +x;
head kraken2.report kraken2.raw
2 changes: 1 addition & 1 deletion .github/workflows/unit-testing.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
on:
push:
branches: [master]
branches: [master, dev]
name: Pull-down-all-accessions

jobs:
Expand Down
25 changes: 18 additions & 7 deletions .github/workflows/validateTaxonomy.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
on:
push:
branches: [fix-CI, master]
branches: [master, dev, build-taxonomy]
name: Validate taxonomy

jobs:
Expand All @@ -18,15 +18,26 @@ jobs:
perl-version: ${{ matrix.perl }}
multi-thread: "true"
- name: checkout my repo
uses: actions/checkout@v2
uses: actions/checkout@v4
with:
path: Kalamari

- name: validate taxonomy
- name: update PATH
run: |
echo $GITHUB_WORKSPACE/Kalamari/bin >> $GITHUB_PATH
echo $PATH
echo ""
cat $GITHUB_PATH
- name: build taxonomy
run: |
perl Kalamari/bin/validateTaxonomy.pl Kalamari/src
echo $PATH
bash Kalamari/bin/buildTaxonomy.sh
ls -lhR Kalamari/share/kalamari-*/taxonomy
#- name: validate taxonomy
# run: |
# perl Kalamari/bin/validateTaxonomy.pl Kalamari/share/kalamari-*/taxonomy/nodes.dmp Kalamari/share/kalamari-*/taxonomy/names.dmp
- name: matching taxids
run: |
export taxdir=$(\ls -d Kalamari/share/kalamari-*/taxonomy)
echo "Making sure that all taxids in chromosomes.tsv and plasmids.tsv are present in nodes.tsv and names.tsv"
tail -n +2 Kalamari/src/chromosomes.tsv Kalamari/src/plasmids.tsv -q | perl -F'\t' -lane 'BEGIN{@node=`cat Kalamari/src/taxonomy/nodes.dmp`; for $n(@node){($taxid)=split(/\t/, $n); $taxid{$taxid}++; } } for my $t($F[2], $F[3]){ if(!$taxid{$t}){ print "Could not find $t taxid";} }'
tail -n +2 Kalamari/src/chromosomes.tsv Kalamari/src/plasmids.tsv -q | perl -F'\t' -lane 'BEGIN{@name=`cat Kalamari/src/taxonomy/names.dmp`; for $n(@name){($taxid)=split(/\t/, $n); $taxid{$taxid}++; } } for my $t($F[2], $F[3]){ if(!$taxid{$t}){ print "Could not find $t taxid";} }'
tail -n +2 Kalamari/src/chromosomes.tsv Kalamari/src/plasmids.tsv -q | perl -F'\t' -lane 'BEGIN{@node=`cat $ENV{taxdir}/nodes.dmp`; for $n(@node){($taxid)=split(/\t/, $n); $taxid{$taxid}++; } } for my $t($F[2], $F[3]){ if(!$taxid{$t}){ print "Could not find $t taxid";} }'
tail -n +2 Kalamari/src/chromosomes.tsv Kalamari/src/plasmids.tsv -q | perl -F'\t' -lane 'BEGIN{@name=`cat $ENV{taxdir}/names.dmp`; for $n(@name){($taxid)=split(/\t/, $n); $taxid{$taxid}++; } } for my $t($F[2], $F[3]){ if(!$taxid{$t}){ print "Could not find $t taxid";} }'
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
edirect
share
26 changes: 14 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,20 +39,22 @@ using your own email address instead of `my@email.address`.

## Download instructions

For usage, run `perl bin/downloadKalamari.pl --help`
First, build the taxonomy.
The script `buildTaxonomy.sh` uses the diffs in Kalamari to enhance the default NCBI taxonomy.
Next, `filterTaxonomy.sh` reduces the taxonomy files to just those found in Kalamari.
`filterTaxonomy.sh` uses `taxonkit` and so this needs to be in your
environment before starting.

SRC=Kalamari
perl bin/downloadKalamari.pl -o $SRC src/chromosomes.tsv
bash bin/buildTaxonomy.sh
bash bin/filterTaxonomy.sh

### ...with plasmids
To download the chromosomes and plasmids, use the `.tsv` files, respectively, with `downloadKalamari.pl`.
Run `downloadKalamari.pl --help` for usage.
However, to download the files to a standard location,
please simply use `downloadKalamari.sh` which uses
`downloadKalamari.pl` internally.

SRC=Kalamari
perl bin/downloadKalamari.pl -o $SRC src/chromosomes.tsv src/plasmids.tsv

### taxonomy

The taxonomy files `nodes.dmp` and `names.dmp` are under `src/taxonomy-VER`
where `VER` is the version of Kalamari.
bash bin/downloadKalamari.pl

## Database formatting instructions

Expand Down Expand Up @@ -80,4 +82,4 @@ Please see [CONTRIBUTING.md](CONTRIBUTING.md)

## Citation

Please refer to the ASM 2018 poster under docs
Please refer to the ASM 2018 poster under docs.
22 changes: 22 additions & 0 deletions bin/buildKraken1.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/bin/bash

set -eu

thisdir=$(dirname $0)
KALAMARI_VER=$(downloadKalamari.pl --version)

sharedir=$thisdir/../share/kalamari-$KALAMARI_VER
SRC="$sharedir/kalamari"
TAXDIR="$sharedir/taxonomy/filtered"

# Test prereqs
which kraken-build
which jellyfish

DB="$sharedir/kalamari-kraken1"
mkdir -pv $DB
cp -rv $TAXDIR $DB/taxonomy
find $SRC -name '*.fasta' \
-exec kraken-build --db $DB --add-to-library {} \;
kraken-build --db $DB --build --threads 1
kraken-build --db $DB --clean
22 changes: 22 additions & 0 deletions bin/buildKraken2.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/bin/bash

set -eu

thisdir=$(dirname $0)
KALAMARI_VER=$(downloadKalamari.pl --version)

sharedir=$thisdir/../share/kalamari-$KALAMARI_VER
SRC="$sharedir/kalamari"
TAXDIR="$sharedir/taxonomy/filtered"

# Test prereqs
which kraken2-build
which jellyfish

DB="$sharedir/kalamari-kraken2"
mkdir -pv $DB
cp -rv $TAXDIR $DB/taxonomy
find $SRC -name '*.fasta' \
-exec kraken2-build --db $DB --add-to-library {} \;
kraken2-build --db $DB --build --threads 1
kraken2-build --db $DB --clean
Loading

0 comments on commit 8bdf873

Please sign in to comment.