We use a segmentation model to extract traits from minnows (Family: Cyprinidae).
This repository serves as a case study of an automated workflow and extraction of morphological traits using machine learning on image data.
We expand upon work already done by BGNN, including metadata collection by the Tulane Team and the Drexel Team (see Leipzig et al. 2021, Pepper et al. 2021, and Narnani et al. 2022), and a segmentation model developed by the Virginia Tech Team. We developed morphology extraction tools (Morphology-analysis) with the help of the Tulane Team. We incorporate these tools into BGNN_Core_Workflow.
Finally, with the help of the Duke Team, we create an automated workflow.
- Create a use case for using an automated workflow
- Show best practices for interacting with other repositories
- Show utility of using a machine learning segmentation model to accelerate trait extraction from images of specimens
Scripts
- Data_Manipulation.R: code for manipulating and merging data files
- Minnow_Selection_Image_Quality_Metadata.R: code for image selection
- Presence_Absence_Analysis.R: code for analyzing machine learning outputs
- init.R: code to load functions in Functions
Files
- Previous_Measurements: a file of measurements of minnow traits by found in the supplemental information. See Burress.md for more details.
Results
- a folder for the outputs from the workflow
- tables of results from analyses
- /Figures contains all figures created from analyses
Config
- contains the config.yml file
- the user can change the file inputs or number of images under
limit_images
- the user can change the file inputs or number of images under
The Previous_Measurements file is included in this repository.
The Fish-AIR input files will be downloaded from the Fish-AIR API.
This requires a Fish-AIR API key be added to Fish_AIR_API_Key
in config/config.yaml
.
Alternatively you can download the Fish-AIR input files from Zenodo and place them in the Files/Fish-AIR/Tulane
directory.
The total size of the components are 5.6G (as of 5 May 2023).
All weights and dependencies for all components of the workflow are uploaded to Hugging Face or Zenodo.
-
Metadata by Drexel Team
- Object detection of fish and rule from fish images
- Repository
- Model Archive
-
Reformatting of metadata
- Trim metadata output from Metadata step to only the values necessary for this project
- Repository
- Code Archive
-
Crop Image
- Extract bounding box information from metadata file
- Resizes and crops fish from image
- Repository
- Code Archive
-
Segmentation Model by Virginia Tech Team
- Segments fish traits from fish images
- Repository
- Model Archive
-
Morphology analysis by Tulane Team and Battelle Team
- Tool to calculate presence of traits
- Repository
- Code Archive
-
Machine Learning Workflow by Battelle Team and Duke Team
- Calls all the above containers
- Repository
- Code Archive
The fish images are from the Great Lakes Invasives Network (GLIN) and stored on Fish-AIR. We are using images specifically from the Illinois Natural History Survey (INHS images).
R code (Minnow_Selection_Image_Quality_Metadata.R) was used to filter out high quality, minnow images using the IQM and IM metadata files.
All image metadata files are downloaded from Fish-AIR and the version used is stored on the OSC data commons under the Fish Traits dataverse. The metadata files have been generated using the Tulane worflow.
Criteria for selection of an image was based on findings from Pepper et al. 2021.
Criteria chosen:
- family == "Cyprinidae"
- specimenView == "left"
- specimenCurved == "straight"
- allPartsVisible == "True"
- partsOverlapping == "True"
- partsFolded == "False"
- uniformBackground == "True"
- partsMissing == "False"
- brightness == "normal"
- onFocus == "True"
- colorIssues == "none"
- containsScaleBar == "True"
- from either INHS or UWZM institutions
See more details in Morphology-analysis.
Each segmented image has the following traits: trunk, head, eye, dorsal fin, caudal fin, anal fin, pelvic fin, and pectoral fin. For each segmented trait, there may be more than one "blob", or group of pixels identifying a trait. We created a matrix of presence.absence.matrix.csv.
For each trait, we counted the number of "blobs" and the percentage of the largest blob as a proportion of all blobs for a trait.
All intermediate tables will be saved in the folder "Results".
We created a heat map to show the success of the segmentation to detect traits from the images.
Figures are in the folder "Results".
Instructions are provided for running the workflow on a single computer or a SLURM cluster.
The run time for 20 images is about 45 minutes and the run time for all the images is about 2 hours.
To run the workflow conda and Singularity (now Apptainer) must to be installed.
This workflow will automatically download and setup the software dependencies required by the workflow components. These dependencies are provided using either Singularity Containers or Conda Environments. Singularity Containers are used to provide the machine learning components essential to this workflow. Singularity Containers enable highly reproducible and portable software components. However, using Singularity Containers can pose challenges for script development by domain scientists. Therefore, we use Conda Environments for the domain scientist scripts included in this workflow.
Minimally the workflow requires 1 CPU, 5 GB memory, and 30 GB disk space. A Linux machine is required for this workflow to provide Singularity containerization.
To run the workflow Snakemake v7 with mamba must be installed. (The workflow definition is not compatible with Snakemake v8+.) To handle this we create a new conda environment named "snakemake".
If you are running the workflow on a cluster that provides a conda environment module you should load that module
(eg. module load miniconda3
).
Run the following command to create a conda environment named "snakemake" with the required workflow dependencies.
conda create -c conda-forge -c bioconda -n snakemake snakemake=7 mamba
Enter "Y" when prompted to install snakemake and mamba.
If you loaded an environment module you should unload it (eg. module purge
).
See the official instructions for installing snakemake for more options.
In the config/config.yaml file, the user can limit the number of images for a test run by change the integer under limit_images
, or run them all by entering ""
. Be aware that downloading all the images and running the workflow takes a couple of hours.
Run the following commands to activate the conda environment and run the workflow:
source activate snakemake
snakemake --jobs 1 --use-singularity --use-conda
The --jobs
argument specifies how many processes the snakemake can run at a time.
Running the workflow on a SLURM cluster enables scaling beyond a single machine. The run-workflow.sh sbatch script is provided to run the workflow using sbatch and will process up to 20 jobs simultaneously.
If your SLURM cluster provides a conda environment module you should load that module before running the next step(eg. module load miniconda3
).
Run the following commmand to activate the snakemake conda environment:
source activate snakemake
Running on the workflow in the background:
sbatch run-workflow.sh
Then you can monitor the job progress as you would with any SLURM background job.
Some SLURM clusters require providing sbatch
a SLURM account name via the --account
command line argument.
See the Run-on-OSC wiki article for the commands used to run the workflow on OSC.
In some cases it is possible to run the workflow using Docker. See the experimental Docker Instructions for more details.