Skip to content

cai584770/GeSeq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GeSeq

GeSeq is an integrated graph database gene sequence data type that handles both storage and management. In GeSeq, gene sequence data is compressed and stored using BBM (Bit-Byte Mapping), which modifies the way gene sequence data is processed based on the storage scheme. GeSeq also provides UDFs (User-Defined Functions) for use in Neo4j, enabling users to work with gene sequence data within the graph database.

USE

Neo4j 4.4.11 Community Edition

BBM

BBM is contained within the org.cai.bbm package and primarily handles the mapping of gene sequence strings to byte data. The mapping rules are as follows: "A" -> 00, "T" -> 11, "C" -> 01, "G" -> 10.

file

The org.cai.file package contains parsing functionality for imported files, with the main purpose of handling all files imported into the graph database and setting rules for exporting.

geseq

The org.cai.file package includes the definition of GeSeq and common usage methods.

case class GeSeq(
                  sequence: Array[Byte],
                  lowercase: List[(Int, Int)],
                  nBase: List[(Int, Int)],
                  otherBASE: List[(Int, String)],
                  sequenceLength: Long,
                  nucleotidesLength: Long
                )

Here, sequence contains the gene sequence data processed by BBM. lowercase contains information about case conversion points, where each tuple (int1, int2) represents the starting point and length of case conversion relative to the sequence. nBase indicates the positions and lengths of the "N" characters in the sequence, with each tuple (int1, int2) representing the relative starting position and length. otherBase contains information about other degenerate bases, with each tuple (Int, String) representing the relative position and a string of consecutive degenerate base sub-sequences. sequenceLength represents the original sequence length, and nucleotidesLength represents the length of the nucleotide bases.

tools

The tools package includes methods for processing byte data, currently focusing on operations such as convert, complement, and translate.

udf

The udf package contains the UDFs available for use in Neo4j. They can be invoked with the following Cypher commands:

geseq.fromFASTQ // Import sequences from a FASTQ file
geseq.fromFASTA // Import sequences from a FASTA file
geseq.translate // Translate sequences
geseq.rev // Reverse sequences
geseq.com // Complement sequences
geseq.rev_com // Reverse complement sequences

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published