Skip to content

Instantly share code, notes, and snippets.

View ckandoth's full-sized avatar

Cyriac Kandoth ckandoth

View GitHub Profile
@ckandoth
ckandoth / dgn_np10_errors.md
Last active March 7, 2025 15:42
Reproduce sporadic "double free" errors by Dragen on Azure NP10 VMs

Purpose

Reproduce the double free or corruption (fasttop) error from Dragen 4.3.6 on their CentOS 7.9 image on Azure NP10 VMs.

Prerequisites

  1. Sign up for an Azure subscription if you don't already have one.
  2. Visit Quotas in Azure Portal, login if needed, and increase Standard NPS Family vCPUs in your preferred region. These VMs are only available in some regions per the FAQs here. Based on demand for these SKUs in your region, you may also need to submit a service request and justify your use-case to a person before that quota gets approved. Also note that a quota of 40 vCPUs lets you run 4 NP10 VMs at a time, 2 NP20 VMs, or 1 NP40 VM.
  3. Visit [this page](https://portal.azure.com/#view/Microsoft_Azure_Marketplace/LegalTermsSkuProgrammaticA
@ckandoth
ckandoth / az-dgn.md
Last active February 23, 2025 03:04
Test Illumina Dragen software on an Azure NP10 VM

Purpose

Test an NP-series VM Scale-Set (VMSS) on Azure with Dragen's pay-as-you-go (PAYG) license

Prerequisites

  1. Sign up for an Azure subscription if you don't already have one.

  2. Visit Quotas in Azure Portal, login if needed, and increase the NP series quota to 40, so we can operate up to 4 NP10 VMs at a time, or 2 NP20 VMs. Based on demand for these SKUs in your region, you may also need to submit a service request and justify your use-case to a person before that quota gets approved.

@ckandoth
ckandoth / dx-ngs.md
Last active December 5, 2024 20:26
Clinical NGS bioinformatics server at reasonable cost and TAT

Purpose

A proof-of-concept high-performance server for primary and secondary NGS analyses with reasonable cost and TAT.

Hardware and OS

Acquired a Dell Precision 5820 tower workstation in mid 2018 with the following specs. Minimally, you want fast single-thread performance, at least 64GB RAM preferably ECC, and very speedy disks. A GPU with at least 16GB VRAM allows you to run Nvidia's Parabricks v4.4 or at least 12GB VRAM for Parabricks v3.8.

  • Intel Xeon W-2145 (supports ECC memory and AVX-512; decent single-thread performance)
  • 208GB DDR4-2666 ECC Memory (ECC reduces odds of data corruption)
@ckandoth
ckandoth / ensembl_vep_112_with_offline_cache.md
Created May 30, 2024 19:27
Install Ensembl's VEP v112 with local cache for running offline

Ensembl's VEP (Variant Effect Predictor) is popular for how it picks a single effect per gene as detailed here, its CLIA-compliant HGVS variant format, and Sequence Ontology nomenclature for variant effects.

The official instructions to install VEP have never worked well from the United States because of the flaky network connection to their FTP servers in the UK. So, we will instead use conda to install VEP and its dependencies and then manually download VEP caches and reference genomes using rsync.

If you don't already have conda, download and install it into $HOME/miniconda3:

curl -sL https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o miniconda.sh
bash miniconda.sh -bup $HOME/minic
@ckandoth
ckandoth / gnomad_vcf_prep.txt
Created June 8, 2022 21:39
Smaller gnomAD 3.1.2 VCF
# Fetch the WGS gnomAD 3.1.2 per-chrom VCFs (the large size is mostly due to INFO fields):
mkdir gnomad
gsutil -m cp gs://gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr*.vcf.bgz gnomad
gsutil -m cp gs://gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr*.vcf.bgz.tbi gnomad
# Shortlist INFO fields we want to keep when merging these into a single VCF of reduced file size:
bcftools view -h gnomad/gnomad.genomes.v3.1.2.sites.chr21.vcf.bgz | grep ^##INFO | cut -f3- -d= | grep -Ev "controls|non_cancer|non_neuro|non_topmed|non_v2|vep" | sort | less -S
cadd_phred
cadd_raw_score
@ckandoth
ckandoth / test_az_sdk_blob_upload.py
Created May 10, 2022 00:38
Test upload to Azure blob using Python SDK and MSAL tokens
#!/usr/bin/env python
# Prereqs: Run "az login" to get a refresh token at "~/.azure/msal_token_cache.json" which expires only if unused for 90 days
# Depends: pip install azure-identity azure-storage-blob
# Sources: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/storage/azure-storage-blob/samples/blob_samples_containers.py
STORAGE_ACCOUNT_URL = "https://blahdiblahdiblah.blob.core.windows.net"
CONTAINER_NAME = "mdlhot"
# Use the MSAL refresh token to get a temporary access token for use with blob storage libraries
@ckandoth
ckandoth / ensembl_vep_106_with_offline_cache.md
Created April 12, 2022 20:34
Install Ensembl's VEP v106 with local cache for running offline

Ensembl's VEP (Variant Effect Predictor) is popular for how it picks a single effect per gene as detailed here, its CLIA-compliant HGVS variant format, and Sequence Ontology nomenclature for variant effects.

Instead of the official instructions, we will use mamba (conda, but faster) to install VEP and its dependencies. If you don't already have mamba, use these steps to download and install it into $HOME/mambaforge, then run a script that adds it to your $PATH:

curl -L https://github.com/conda-forge/miniforge/releases/download/4.12.0-0/Mambaforge-Linux-x86_64.sh -o /tmp/mambaforge.sh
sh /tmp/mambaforge.sh -bfp $HOME/mambaforge && rm -f mambaforge.sh
. $HOME/mambaforge
@ckandoth
ckandoth / prep_grch38_ref.txt
Created September 24, 2021 23:53
Download and prepare GRCh38 reference data useful in NGS analyses
# Prepare a conda environment with tools we will need:
mamba create -n ref; conda activate ref
mamba install -y -c bioconda htslib==1.13 bcftools==1.13 samtools==1.13 picard-slim==2.26.2 bwa-mem2==2.2.1 bwa==0.7.17 gsutil==4.68
# Fetch the alignment-ready human reference FASTA and index:
gsutil -m cp gs://genomics-public-data/references/GRCh38_Verily/GRCh38_Verily_v1.genome.fa{,.fai} .
# Index the reference FASTA for use with various tools:
picard CreateSequenceDictionary -R GRCh38_Verily_v1.genome.fa
bwa-mem2 index GRCh38_Verily_v1.genome.fa
@ckandoth
ckandoth / install_nextflow_singularity.md
Last active October 30, 2024 22:08
Install conda and use it to install nextflow and singularity

This guide will show you how to install conda and then use it to install nextflow and singularity for executing popular bioinformatics workflows. Unfortunately, singularity is not available on Windows or macOS. So, this guide will only target Linux environments. If you have to use Windows 10, then try WSL2. If you have to use macOS, then try a Virtual Machine.

Download the Miniconda3 installer for Linux environments:

curl -L https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o miniconda.sh

Install into a folder named miniconda3 under your home directory, and delete the installer:

bash miniconda.sh -bup $HOME/miniconda3 && rm -f miniconda.sh
@ckandoth
ckandoth / ngs_test_data.sh
Last active February 17, 2025 09:10
Create test data for CI/CD of a FASTQ to gVCF bioinformatics pipeline
# GOAL: Create test data for CI/CD of a germline variant calling pipeline (FASTQ to gVCF)
# Steps below were performed on Ubuntu 24.04, but should be reproducible on any Linux distro
# Download and install micromamba under your home directory and logout:
"${SHELL}" <(curl -L micro.mamba.pm/install.sh)
# Log back in to add micromamba to your $PATH, and then use it to install the tools we need:
micromamba create -y -n bio -c conda-forge -c bioconda htslib==1.21 samtools==1.21 bcftools==1.21 picard-slim==3.3.0
micromamba activate bio