Skip to content

Commit

Permalink
Provenance2 into paper (#56)
Browse files Browse the repository at this point in the history
* GitHub actions (#13)

* unit-testing actions

* unit-testing actions

* unit-testing actions

* unit-testing actions

* installing edirect

* installing edirect

* installing edirect

* installing edirect

* installing edirect

* rm travis

* edirect through apt

* edirect through apt

* Add files via upload

* adding taxonomy_v3.5.1

* More formats (#17)

* new files for individual genes and coordinates

* m

* new flag to include optional files with --and

* Listeria unit testing (#18)

* Listeria unit testing draft

* m

* debug

* debug

* debug

* update kalamari script; add --and flags

* kraken1 db

* m

* m

* m

* m

* editing PATH

* editing PATH

* fixing src path

* m

* fixing installation dir

* jellyfish1

* jellyfish1

* m

* just two genomes

* tree kraken

* added threads 2

* added threads 2

* build kraken -x

* work on disk in kraken

* debug

* trying out kraken2

* m

* removed rebuild and work-on-disk

* kraken report

* kraken report

* more inspection of kraken output

* more inspection of kraken output

* done with unit testing for now

Co-authored-by: Lee Katz - Aspen <gzu2@cdc.gov>

* new parent id

* a get taxonomy script for a reduced set of dmp files

* reduced taxonomy

* testing v3.9.2

* added parentid to plasmids

* Updating some Yersinia taxid (#16)

* Add files via upload

* adding taxonomy_v3.5.1

* adding v3.9.3 taxonomy

* m

* adding in Scott's Yersinia genomes

* cleanup

* updated to correct src tax dir

* Update unit-testing.yml

* Create CITATION.cff (#20)

* Create CITATION.cff

* Update CITATION.cff

* Kraken1 unit test (#21)

* with fixed taxonomy, unit test kraken1

* shortened the minimizer length to 9

* kraken1 query

* m

* adding a query is $query statement

Co-authored-by: Lee Katz - Aspen <gzu2@cdc.gov>

* Database doc update (#22)

* with fixed taxonomy, unit test kraken1

* shortened the minimizer length to 9

* kraken1 query

* m

* adding a query is $query statement

* Update DATABASES.md

* added blast and ANI instructions

* updated docs to reflect more comprehensive DATABASES.md

* m

Co-authored-by: Lee Katz - Aspen <gzu2@cdc.gov>

* mash database

* Define contributions (#23)

* validate taxonomy script

* unit testing for taxonomy

* unit testing for taxonomy

* moved XXXXXX entries to a todo file

* validating names.dmp and added new entries to make taxonomy more complete

* Contributing.md doc

* link to contributing.md

* more description under contributions

Co-authored-by: Lee Katz - Aspen <gzu2@cdc.gov>

* mmseqs2 just for fun

* m

* Sepia

* fixed bacillus genus back to bacteria in the plasmids (#24)

Co-authored-by: Lee Katz - Aspen <gzu2@cdc.gov>

* Build sepia (#25)

* fixed bacillus genus back to bacteria in the plasmids

* sepia building v1

* m

* sepia documentation and reference generation script

* m

Co-authored-by: Lee Katz - Aspen <gzu2@cdc.gov>

* fixed a bug where the same fasta file would be downloaded twice and given the parent taxid in addition to its own

* validate a kraken database better

* MIDAS

* m

* m

* Update README.md with reqs and recs (#29)

* Update chromosomes.tsv

* using GITHUB_PATH to solve CI problems

* m

* m

* limit tests to target branches

* jellyfish now in path

* m

* remove -x statement

* allow this workflow to work on master

* trying out taxonomy validator workflow

* remove kraken1 from testing on this branch

* fix path to taxonomy

* Fix ci (#31)

* using GITHUB_PATH to solve CI problems

* m

* m

* limit tests to target branches

* jellyfish now in path

* m

* remove -x statement

* allow this workflow to work on master

* trying out taxonomy validator workflow

* remove kraken1 from testing on this branch

* fix path to taxonomy

* check file sizes after pulling down accessions

* more debugging in the ci just in case

* change cryptosporidium parent taxids to cryptosporidium the genus

* marged new kalamari download script

* upped the version

* getExactTaxonomy.pl: better error messages

* downloadKalamari.pl: add in retmax 1

* only accept one sequence per insdc accession

* script to download kalamari from source

* numcpus option added; new bash script to download and format

* bash downloadKalamari.sh

* update to ubuntu 20

* 2 cpus in test

* add spreadsheet as a strategy variable

* m

* m

* split jobs between runners

* fix math

* adding more retries

* switch to 1 cpu for testing

* bump tag to v5.3.0

* std output for downloadKalamari.sh

* removed bioperl

* bump version; add more standard conda db location

* trying to speed up downloads
rd conda db location

* vast speed increase with batch downloads; cleaned up chromosomes.tsv

* moved version information to the script from Makefile.PL; removed --and; won't make kraken db in shell script

* m

* remove edirect setup unit test

* update unit tests

* just two chunks of tests

* batch more

* fix file sizes check

* just make the damn thing work

* bash file uses local repo files instead of curl; default buffer size 100

* More proper build (#42)

* Building taxonomy (#38)

* building taxonomy files but this script will be deprecated right away

* deprecated

* script to build taxonomy with src files

* m

* move old taxonomy to deprecated

* remove old 'versioned' files outside of git versioning

* filter taxonomy script

* complete the taxonomy

* updated scripts for compiling databases

* dev branch testing

* fix lmono test a bit

* .

* Fix the taxonomy tests (#39)

* building taxonomy files but this script will be deprecated right away

* deprecated

* script to build taxonomy with src files

* m

* move old taxonomy to deprecated

* remove old 'versioned' files outside of git versioning

* filter taxonomy script

* complete the taxonomy

* updated scripts for compiling databases

* dev branch testing

* fix lmono test a bit

* .

* fix paths

* updated PATH

* updated PATH

* troubleshooting

* fix PATH again

* fix ls path

* remove that step

* updated tests to reflect build-taxonomy (#40)

* fix path to taxonomy files

* download and build taxonomy

* merge Listeria into Yersinia matrix

* m

* updated output directory as matrix.GENUS

* kraken1 tests patches

* m

* Fixed two more tests (#41)

* update yml

* query fallback

* debugging msg

* fix path to taxonomydb

* print first two lines of fasta files

* helpful cut statement

* remove head statement in last step

* bump version

* fix a downloading bug where sed stalls

* update for compressed kalamari library and more efficient kraken builds

* update download script

* Validate taxonomy (#43)

* validateTaxonomy update for just taxdirs; add 1 for filtered taxonomy; added DEBUG option for downloadKalamari.sh

* updated unit tests

* updated unit tests

* remove taxonomy stuff from downloadKalamari.sh

* fix validateTaxonomy syscall

* check on filtered tax in unit test

* Add genomes (#45) (#46)

* Corynebacterium diphtheriae

* added Bifidobacterium adolenscentis

* replaced S. enterica IIIa; Added hops (Humulus lupulus)

* added a Citrobacter species

* m

* replaced repressed genome accession for B. faecium

* init paper

* some revisions; taxonomy; downloading

* swap example

* references

* stole Joe's draft-pdf.yml

* update to version 4 of artifacts

* plasmids description

* ignore rendered manuscripts

* some minor fixes; author affiliations; code examples

* added Shatavia; updated example

* m

* revisions from Jess

* refs

* fix list that became italics

* updated Andrew's affiliation

* plasmid defined species

* gave a name to the JOSS rendering

* try experimental docx file creation

* try 2 with container

* correct artifact Action

* m

* upload artifact v4

* branch agnostic

* try multiple formats; multiple uploads

* fix some citations

* fixed Dr. Lauer's info

* remove format arg

* shatavia's orcid

* added Rebecca's and Jess's orcids

* updated DOIs

* fixed comment line

* added Entrez Edirect URL

* more Entrez citation with help from CoPilot

* Andrew's orcid

* misc

* remove random single quotes

* bump version

* helpful log messages

* v5.6.3

* updated revisions from coauthors

* entered Taylor's revisiosn

* move Katie to acknowledgements due to her request

* update genome list; stable efetching (#49)

* Add genomes (#45)

* Corynebacterium diphtheriae

* added Bifidobacterium adolenscentis

* replaced S. enterica IIIa; Added hops (Humulus lupulus)

* added a Citrobacter species

* m

* replaced repressed genome accession for B. faecium

* Esearch input (#47)

* Add genomes (#45) (#46)

* Corynebacterium diphtheriae

* added Bifidobacterium adolenscentis

* replaced S. enterica IIIa; Added hops (Humulus lupulus)

* added a Citrobacter species

* m

* replaced repressed genome accession for B. faecium

* remove random single quotes

* bump version

* helpful log messages

* v5.6.3

* make symlink to avoid naming mistakes

* check whether taxonkit is loaded

* use efetch -input

* fix tr bug

* Esearch input flag (#48)

* Add genomes (#45) (#46)

* Corynebacterium diphtheriae

* added Bifidobacterium adolenscentis

* replaced S. enterica IIIa; Added hops (Humulus lupulus)

* added a Citrobacter species

* m

* replaced repressed genome accession for B. faecium

* remove random single quotes

* bump version

* helpful log messages

* v5.6.3

* make symlink to avoid naming mistakes

* check whether taxonkit is loaded

* use efetch -input

* fix tr bug

* get latest edirect

* update installation instructions

* update installation instructions: fix PATH

* bring in other tests

* update installation method for search with unit-testing

* update installation method for search with kraken2

* debug the ls statement

* debug the ls statement

* debug the ls statement

* debug building taxonomy

* exclusive unit testing for taxonomy for right now

* install taxonkit

* changes from cdc clearance process

* disable buggy docx creation

* fix blast+ formatting typo

* Change to MIT license

* Update README.md: remove CC license sticker

* update entrez ref

* MRA

* MRA

* misc

* 500 words or less

* nix example

* abstract

* abbreviate genera

* another paper revision

* added asm pandoc template

* provenance

* Leptospira interrogans => CP020414

* some progress

* downloadKalamari.sh: nuccleotideAcc bug fuxed

* v5.7.2

* another round of provenance

* cleared out the unknowns list

* fixed chromosomes with sources

* chromosomes

* try to run CI

* fix wildcard

* better named sources for each assembly

* polish this directory

* assembly-complete.gz

---------

Co-authored-by: Scott Nguyen <svn.phd@gmail.com>
Co-authored-by: Scott Nguyen <15314705+SVN-PhD@users.noreply.github.com>
Co-authored-by: Curtis Kapsak <kapsakcj@gmail.com>
  • Loading branch information
4 people authored Nov 5, 2024
1 parent b841e88 commit 684bec6
Show file tree
Hide file tree
Showing 36 changed files with 120,940 additions and 335 deletions.
15 changes: 4 additions & 11 deletions .github/workflows/unit-testing.Listeria.Kraken1.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
on:
push:
branches: [master, dev, validate-taxonomy]
pull_request:
name: Listeria-with-Kraken1

env:
Expand Down Expand Up @@ -41,18 +42,10 @@ jobs:
tree $(realpath .)
- name: install-edirect
run: |
sudo apt-get install ncbi-entrez-direct
echo "installed edirect the apt way"
exit
cd $HOME
perl -MNet::FTP -e '$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1); $ftp->login; $ftp->binary; $ftp->get("/entrez/entrezdirect/edirect.tar.gz");'
gunzip -cv edirect.tar.gz | tar xf -
rm -v edirect.tar.gz
echo $GITHUB_WORKSPACE/edirect >> $GITHUB_PATH
sh -c "$(curl -fsSL https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"
echo $HOME/edirect >> $GITHUB_PATH
echo $GITHUB_WORKSPACE/Kalamari/bin >> $GITHUB_PATH
#export PATH=${PATH}:$HOME/edirect >& /dev/null || setenv PATH "${PATH}:$HOME/edirect"
yes Y | ./edirect/setup.sh
tree edirect
tree $HOME/edirect
- name: check-env
run: echo "$PATH"
- name: select for only Listeria
Expand Down
7 changes: 7 additions & 0 deletions .github/workflows/unit-testing.Yersinia.Kraken2.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
on:
push:
branches: [master, dev, validate-taxonomy]
pull_request:
name: Genera-with-Kraken2

env:
Expand Down Expand Up @@ -34,6 +35,12 @@ jobs:
- name: env check
run: |
echo $PATH | tr ':' '\n' | sort
- name: install-edirect
run: |
sh -c "$(curl -fsSL https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"
echo $HOME/edirect >> $GITHUB_PATH
echo $GITHUB_WORKSPACE/Kalamari/bin >> $GITHUB_PATH
tree $HOME/edirect
- name: apt-get install
run: sudo apt-get install ca-certificates tree jellyfish ncbi-entrez-direct
- name: select for only for this genus
Expand Down
15 changes: 5 additions & 10 deletions .github/workflows/unit-testing.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
on:
push:
branches: [master, dev, validate-taxonomy]
pull_request:
name: Pull-down-all-accessions

jobs:
Expand All @@ -27,16 +28,10 @@ jobs:
run: sudo apt-get install ca-certificates tree
- name: install-edirect
run: |
sudo apt-get install ncbi-entrez-direct
echo "installed edirect the apt way"
exit
cd $HOME
perl -MNet::FTP -e '$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1); $ftp->login; $ftp->binary; $ftp->get("/entrez/entrezdirect/edirect.tar.gz");'
gunzip -cv edirect.tar.gz | tar xf -
rm -v edirect.tar.gz
export PATH=${PATH}:$HOME/edirect >& /dev/null || setenv PATH "${PATH}:$HOME/edirect"
yes Y | ./edirect/setup.sh
tree edirect
sh -c "$(curl -fsSL https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"
echo $HOME/edirect >> $GITHUB_PATH
echo $GITHUB_WORKSPACE/Kalamari/bin >> $GITHUB_PATH
tree $HOME/edirect
- name: check-env
run: echo "$PATH"
- name: download
Expand Down
14 changes: 11 additions & 3 deletions .github/workflows/validateTaxonomy.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
on:
push:
branches: [master, dev, validate-taxonomy]
branches: [master, dev, esearch-input]
pull_request:
name: Validate taxonomy

jobs:
Expand All @@ -27,11 +28,18 @@ jobs:
echo $PATH
echo ""
cat $GITHUB_PATH
- name: install taxonkit
run: |
wget https://github.com/shenwei356/taxonkit/releases/download/v0.16.0/taxonkit_linux_amd64.tar.gz
tar -xvf taxonkit_linux_amd64.tar.gz
rm -v taxonkit_linux_amd64.tar.gz
chmod +x taxonkit
echo $(realpath .) >> $GITHUB_PATH
- name: build taxonomy
run: |
echo $PATH
bash Kalamari/bin/buildTaxonomy.sh
bash Kalamari/bin/filterTaxonomy.sh
bash -x Kalamari/bin/buildTaxonomy.sh
bash -x Kalamari/bin/filterTaxonomy.sh
ls -lhR Kalamari/share/kalamari-*/taxonomy
- name: validate taxonomy
run: |
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,6 @@ edirect
share
paper/paper.html
paper/paper.doc
# pixi environments
.pixi
*.egg-info
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2024 Lee Katz

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
3 changes: 1 addition & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
# Kalamari
A database of completed assemblies for metagenomics-related tasks

[![Creative Commons License v4](https://licensebuttons.net/l/by-sa/4.0/88x31.png)](LICENSE.md)
A database of completed assemblies for metagenomics-related tasks

## Synopsis

Expand Down
8 changes: 8 additions & 0 deletions bin/buildKraken1.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,16 @@ cp -rv $TAXDIR $DB/taxonomy

# Make --add-to-library more efficient with
# concatenated fasta files
export nl=$'\n'
find $SRC -name '*.fasta.gz' | \
xargs -n 100 -P 1 bash -c '
for i in "$@"; do
gzip -cd $i
done > $tmpfile
echo -ne "ADDING to library:\n "
zgrep "^>" $tmpfile | sed "s/^>//" | tr "$nl" " "
echo
echo "^^ contents of $tmpfile ^^"
kraken-build --db $DB --add-to-library $tmpfile
'

Expand All @@ -35,3 +40,6 @@ kraken-build --db $DB --build --threads 1
# Reduce the size of the database
kraken-build --db $DB --clean

if [ ! -e "$sharedir/kalamari-kraken1" ]; then
ln -sv kalamari-kraken "$sharedir/kalamari-kraken1"
fi
26 changes: 12 additions & 14 deletions bin/downloadKalamari.pl
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
use IO::Compress::Gzip;
use version 0.77;

our $VERSION = version->parse("5.6.0");
our $VERSION = version->parse("5.7.2");

use threads;

Expand Down Expand Up @@ -167,27 +167,25 @@ sub downloadEntries{
my $numEntries = scalar(@$entries);
my @acc = map{$$_{nuccoreAcc}} @$entries;
logmsg "Downloading ".scalar(@acc)." accessions";
my $queryArg = join("[accession] OR ", sort(@acc))."[accession]";
my $dir = tempdir("download.XXXXXX", DIR=>$$settings{tempdir});

# Make the input file for efetch
my $inputAcc = "$dir/input.acc";
open(my $fh, ">", $inputAcc) or die "ERROR: could not write to $inputAcc: $!";
print $fh join("\n", @acc)."\n";
close $fh;

# Accessions that had errors
my @err;

# Get the esearch xml in place for at least one downstream query
my $esearchXml = "$dir/esearch.xml";
my $esearchCmd = "esearch -db nuccore -query '$queryArg' > $esearchXml";
command($esearchCmd);
# Get started on the comprehensive assembly file
my $outfile = "$dir/all.fasta";
logmsg "Downloading all accessions to $outfile using input accessions in $inputAcc";
command("efetch -db nuccore -input $inputAcc -format fasta > $dir/all.fasta");
if($?){
die "ERROR running: $esearchCmd: $!";
die "ERROR: could not download all accessions";
}

# Get started on the assembly file
my $outfile = "$dir/all.fasta";

# Main query: efetch
my $efetchCmd = "cat $esearchXml | efetch -format fasta > $outfile";
system($efetchCmd);

my $seqsWithVersion = readSeqs($outfile);
my $seqs = {};
while(my($acc, $seq) = each(%$seqsWithVersion)){
Expand Down
4 changes: 2 additions & 2 deletions bin/downloadKalamari.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@ echo "TEMPDIR is $tempdir" >&2
echo "OUTDIR is $outdir_prefix" >&2

TSV="$tempdir/in.tsv"
cat $thisdir/../src/chromosomes.tsv > $TSV
cat $thisdir/../src/plasmids.tsv >> $TSV
cat $thisdir/../src/chromosomes.tsv > $TSV
tail -n +2 $thisdir/../src/plasmids.tsv >> $TSV

cp -rv $thisdir/../src/taxonomy $tempdir/taxonomy

Expand Down
5 changes: 5 additions & 0 deletions bin/filterTaxonomy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@

set -eu

# Check for dependencies
echo "Check for dependencies"
which taxonkit
echo

thisdir=$(dirname $0)
thisfile=$(basename $0)
KALAMARI_VER=$(downloadKalamari.pl --version)
Expand Down
6 changes: 4 additions & 2 deletions paper/mra.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,9 @@ Kalamari also contains a custom taxonomy and software for downloading and format

## Announcement

Public Health laboratories sequence microbial pathogens daily for genomic epidemiology, i.e., to track pathogen spread [@armstrong2019pathogen].
Public Health laboratories sequence microbial pathogens daily for many applications including genomic epidemiology [@armstrong2019pathogen],
species identification [@lindsey2023rapid],
and metagenomic analysis [@huang2017metagenomics].
Relevant databases exist such as RefSeq [@o2016reference] or The Genome Taxonomy Database (GTDB) [@parks2022gtdb].
However, due to their so comprehensive nature,
they are disadvantageous for our specific purposes:
Expand All @@ -64,7 +66,7 @@ All chromosomes and plasmids are complete, i.e., no contig breaks,
and obtained from trusted sources, e.g., FDA-ARGOS [@sichtig2019fda] or the NCTC 3000 collection [@dicks2023nctc3000], or provided and reviewed by a CDC subject matter expert.

We obtained the list of plasmids from the Mob-Suite project [@robertsonMobsuite]
and clustered them at 97% average nucleotide identity (ANI) [@lindsey2023rapid].
and clustered them at 97% average nucleotide identity using edlb_ani_mummer v1 with default options [@lindsey2023rapid].
For each cluster, the taxonomy identifier was raised to the lowest common tier of taxonomy.
For example, if a cluster of plasmids were identified in both _Escherichia coli_ and _Salmonella enterica_, then taxonomy identifiers for all the plasmids in the cluster were changed to their common family, _Enterobacteriaceae_.
As a result, any taxonomic signature from these plasmids
Expand Down
Loading

0 comments on commit 684bec6

Please sign in to comment.