Skip to content

10X single cell Nanopore reads simulation workflow. Complete documentation avialable at: https://GenomiqueENS.github.io/AsaruSim/

License

Notifications You must be signed in to change notification settings

GenomiqueENS/AsaruSim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

version - AsaruSim Made with Docker Codacy Badge License: GPL v3

DOI Mastodon Follow

Documentation

AsaruSim is an automated Nextflow workflow designed for simulating 10x single-cell long read data from the count matrix level to the sequence level. It aimed at creating a gold standard dataset for the assessment and optimization of single-cell long-read methods. Full documentation is avialable here.

$\textcolor{#FF7F00}{Requirements}$

This pipeline is powered by Nextflow workflow manager. All dependencies are automatically managed by Nextflow through a preconfigured Docker container, ensuring a seamless and reproducible installation process.

Before starting, ensure the following tools are installed and properly set up on your system:

$\textcolor{#FF7F00}{Installation}$

Clone the AsaruSim GitHub repository:

git clone https://github.com/GenomiqueENS/AsaruSim.git
cd AsaruSim

$\textcolor{#FF7F00}{Test}$

To test your installation, we provide an automated script to download reference annotations and simulate a subset of human PBMC dataset run_test.sh.

bash run_test.sh

$\textcolor{#FF7F00}{Configuration}$

Customize runs by editing the nextflow.config file and/or specifying parameters at the command line.

Pipeline Input Parameters

Here are the primary input parameters for configuring the workflow:

Main Parameters

Parameter Description Format Default Value
matrix Path to the count matrix csv file (required) .CSV test_data/matrix.csv
transcriptome Path to the reference transcriptome file (required) FASTA test_data/transcriptome.fa
bc_counts Path to the barcode count file (if no matrix provided). .CSV test_data/test_bc.csv

Optional Parameters

Parameter Description Format Default Value
features Matrix feature counts STR transcript_id
cell_types_annotation Path to cell type annotation .csv file CSV null
gtf Path to transcriptom annotation .gtf file GTF null
umi_duplication UMI duplication INT 0
intron_retention Simulate intron retention proces BOOL false
ir_model Intron retention MC model .CSV file CSV bin/models/SC3pv3_GEX_Human_IR_markov_model
unspliced_ratio percentage of transcrits to be unspliced FLOAT 0.0
ref_genome reference genome .fasta file (if IR) FASTA null
full_length Indicates if transcripts are full length BOOL false
truncation_model Path to truncation probabilities .csv file CSV bin/models/truncation_default_model.csv

PCR Parameters

Parameter Description Format Default Value
pcr_cycles Number of PCR amplification cycles INT 0
pcr_error_rate PCR error rate FLOAT "0.0000001"
pcr_dup_rate PCR duplication rate FLOAT 0.7
pcr_total_reads Name of the project INT 1000000

Error/Qscore Parameters

Configuration for error model:

Parameter Description format Default Value
trained_model Badread pre-trained error/Qscore model name STR nanopore2023
badread_identity Comma-separated values for Badread identity parameters STR "98,2,99"
error_model Custom error model file (optional) .TXT null
qscore_model Custom Q-score model file (optional) .TXT null
build_model to build your own error/Qscor model STR false
fastq_model reference real read (.fastq) to train error model (optional) FASTQ false

Additional Parameters

Parameter Description Format Default Value
amp Amplification factor INT 1
outdir Output directory for results PATH "results"
projectName Name of the project STR "test_project"

Run Parameters

Configuration for running the workflow:

Parameter Description Format Default Value
threads Number of threads to use INT 4
container Docker container for the workflow STR 'hamraouii/asarusim:0.1'
docker.runOptions Docker run options to use STR '-u $(id -u):$(id -g)'

For more details about workflow options see the Input parameters section in the documentation.

File format discription

--bc_counts

To simulate specific UMI counts per cell barcode with random transcripts, set the --bc_counts parameter to the path of a UMI counts .CSV file. This parameter eliminates the need for an input matrix, enabling the simulation of UMI counts where transcripts are chosed randomly.

example of UMI counts per CB file:

CB counts
ACGGCGATCGCGAGCC 1260
ACGGCGATCGCGAGCC 1104

--cell_types_annotation

AsaruSim allows user to estimate this characteristic from an existing count table. To do so, the user need to set --sim_celltypes parameter to true and to provide the list of cell barcodes of each group (.CSV file) using --cell_types_annotation parameter:

CB cell_type
ACGGCGATCGCGAGCC type 1
ACGGCGATCGCGAGCC type 2

AsaruSim will then use the provided matrix to estimate characteristic of each cell groups and generate a synthetic count matrix.

$\textcolor{#FF7F00}{Usage}$

User can choose among 4 ways to simulate template reads.

  • use a real count matrix
  • estimated the parameter from a real count matrix to simulate synthetic count matrix
  • specified by his/her own the input parameter
  • a combination of the above options

We use SPARSIM tools to simulate count matrix. for more information a bout synthetic count matrix, please read SPARSIM documentaion.

EXAMPLES

Sample data

A demonstration dataset to initiate this workflow is accessible on zenodo DOI : 10.5281/zenodo.12731408. This dataset is a subsample from a Nanopore run of the 10X 5k human pbmcs.

The human GRCh38 reference transcriptome, gtf annotation and fasta referance genome can be downloaded from Ensembl.

You can use the run_test.sh script to automatically download all required datasets.

BASIC WORKFLOW
 nextflow run main.nf --matrix dataset/sub_pbmc_matrice.csv \
                      --transcriptome dataset/Homo_sapiens.GRCh38.cdna.all.fa \
                      --features gene_name \
                      --gtf dataset/GRCh38-2020-A-genes.gtf
WITH PCR AMPLIFICTION
 nextflow run main.nf --matrix dataset/sub_pbmc_matrice.csv \
                      --transcriptome dataset/Homo_sapiens.GRCh38.cdna.all.fa \
                      --features gene_name \
                      --gtf dataset/GRCh38-2020-A-genes.gtf \
                      --pcr_cycles 2 \
                      --pcr_dup_rate 0.7 \
                      --pcr_error_rate 0.00003
WITH SIMULATED CELL TYPE COUNTS
 nextflow run main.nf --matrix dataset/sub_pbmc_matrice.csv \
                      --transcriptome dataset/Homo_sapiens.GRCh38.cdna.all.fa \
                      --features gene_name \
                      --gtf dataset/GRCh38-2020-A-genes.gtf \
                      --sim_celltypes true \
                      --cell_types_annotation dataset/sub_pbmc_cell_type.csv
USING A SPARSIM PRESET MATRIX (e.g Chu et al. 10X Genomics datasets)
nextflow run main.nf --matrix Chu_param_preset \
                      --transcriptome datasets/Homo_sapiens.GRCh38.cdna.all.fa \
                      --features gene_name \
                      --gtf datasets/Homo_sapiens.GRCh38.112.gtf
WITH PERSONALIZED ERROR MODEL
nextflow run main.nf --matrix dataset/sub_pbmc_matrice.csv \
                     --transcriptome dataset/Homo_sapiens.GRCh38.cdna.all.fa \
                     --features gene_name \
                     --gtf dataset/GRCh38-2020-A-genes.gtf \
                     --build_model true \
                     --fastq_model dataset/sub_pbmc_reads.fq \
                     --ref_genome dataset/GRCh38-2020-A-genome.fa 
COMPLETE WORKFLOW
 nextflow run main.nf --matrix dataset/sub_pbmc_matrice.csv \
                      --transcriptome dataset/Homo_sapiens.GRCh38.cdna.all.fa \
                      --features gene_name \
                      --gtf dataset/GRCh38-2020-A-genes.gtf \
                      --sim_celltypes true \
                      --cell_types_annotation dataset/sub_pbmc_cell_type.csv \
                      --build_model true \
                      --fastq_model dataset/sub_pbmc_reads.fq \
                      --ref_genome dataset/GRCh38-2020-A-genome.fa \
                      --pcr_cycles 2 \
                      --pcr_dup_rate 0.7 \
                      --pcr_error_rate 0.00003

$\textcolor{#FF7F00}{Output}$

After execution, results will be available in the specified --outdir. This includes simulated Nanopore reads simulated.fastq.gz, along with log file and QC report.

QC_report.html                    # final QC report
pipeline_info                     # Pipeline execution trace, timeline and Dag
simulated.fastq.gz                # Simulated reads including sequencing errors
template.fa.gz                    # Simulated template

Cleaning Up

To clean up temporary files generated by Nextflow:

nextflow clean -f

$\textcolor{#FF7F00}{Workflow}$

Workflow Schema

$\textcolor{#FF7F00}{Acknowledgements}$

  • We would like to express our gratitude to Youyupei for the development of SLSim, which has been helpful to the AsaruSim workflow.
  • Additionally, our thanks go to the teams behind Badread, SPARSim and Trans-NanoSim whose tools are integral to the AsaruSim workflow.

$\textcolor{#FF7F00}{Support\ and\ Contributions}$

For support, please open an issue in the repository's "Issues" section. Contributions via Pull Requests are welcome. Follow the contribution guidelines specified in CONTRIBUTING.md.

$\textcolor{#FF7F00}{License}$

AsaruSim is distributed under a specific license. Check the LICENSE file in the GitHub repository for details.

$\textcolor{#FF7F00}{Citation}$

If you use AsaruSim in your research, please cite this manuscript:

Ali Hamraoui, Laurent Jourdren and Morgane Thomas-Chollier. AsaruSim: a single-cell and spatial RNA-Seq Nanopore long-reads simulation workflow. bioRxiv 2024.09.20.613625; doi: https://doi.org/10.1101/2024.09.20.613625