-
Notifications
You must be signed in to change notification settings - Fork 2
Build Your Own Project
In this tutorial we will show how a GLUE project can be built up from scratch. This demo project will contain:
- Reference Sequence Annotated with Coding Features
- Sequences Imported from FASTA Files]
- Sequence Metadata Imported from a Tab-Delimited Text File
- Alignment Imported from a File
We will use some data from the example project, but this will be an independent project. Although the demo project will be very basic, following the tutorial should demonstrate some of the patterns and conventions which can be applied to more complex projects.
- The project directory structure
- The build file, project settings and schema extensions
- Module definitions
- Sequence nucleotide data and metadata
- A reference sequence, features and feature locations
- An unconstrained alignment
- Run a command in the project
GLUE does not mandate any particular directory structure but GLUE project developers find it useful to follow this convention. Create the directory structure outlined below:
demo/ alignments/ glue/ modules/ sources/ tabular/ trees/
- The
demo/
directory will contain everything required to build the project in GLUE. You may consider making this directory a single repository in a version control system such as GitHub. - The
demo/alignments/
directory will contain any files for Alignments which are loaded into the project during the project build. - The
demo/glue/
directory will contain.glue
or.js
scripts which are used to execute different phases of the project build. - The
demo/modules/
directory will contain.xml
files containing module configuration, and any associated resource files. - The
demo/sources/
directory will contain sets of sequence data files, organised into Sources. - The
demo/tabular/
directory will contain any tabular data required to build the project, for example sequence metadata. - The
demo/trees/
directory will contain any phylogenetic trees to be loaded in to the project (not used in the demo project).
In this step you should will create three files. The project build file is simply a file containing GLUE commands that can be run from the GLUE command line. The build file deletes any previous version of the project then builds the project from scratch.
# delete any previous version of the demo project which is in the database
delete project demo
# create the demo project, specifying name, description and minimum GLUE version
create project demo "A demonstration GLUE project based on hepatitis E virus"
# add schema extensions to the demo project
run file glue/demoSchemaExtensions.glue
# enter project mode
project demo
# set any project-wide GLUE settings
run file glue/demoProjectSettings.glue
# validate the project objects
validate
exit
Note that at certain points, the build file uses run file to invoke GLUE commands from another file. Generally we suggest using this mechanism to partition GLUE scripts into different files with different purposes. We will now add the two files which are run from the build file. One file defines some schema extensions.
schema-project demo
# add some metadata columns to the sequence table
table sequence
create field collection_year INTEGER
create field length INTEGER
create field isolate VARCHAR
create field country VARCHAR
create field host_species VARCHAR
exit
exit
Another file sets up some project-wide GLUE settings.
# define any project-wide GLUE settings for this project
set setting ignore-nt-sequence-hyphens true
set setting translate-beyond-possible-stop true
set setting translate-beyond-definite-stop true
The validate command in Project mode is the final step in the project build. This is a check which ensures the consistency of many of the objects in the project. It can pick up various configuration problems.
After you have created the three files you can check that your project build is working by running it from the GLUE command line. Launch GLUE from the demo/
directory and run the command shown below:
Mode path: /
GLUE> run file demoProject.glue
You can re-run this project build at various points while you are developing and modifying the project definition, to check that it is working as intended.
The demo project will contain two modules. Create these two module XML configuration files in the modules/
directory:
<blastFastaAlignmentImporter/>
<textFilePopulator>
<columnDelimiterRegex>\t</columnDelimiterRegex>
<textFileColumn>
<identifier>true</identifier>
<header>Isolate</header>
<property>sequenceID</property>
</textFileColumn>
<textFileColumn>
<header>Isolate</header>
<property>isolate</property>
</textFileColumn>
<textFileColumn>
<header>Sequence length</header>
<property>length</property>
</textFileColumn>
<textFileColumn>
<header>Country</header>
<property>country</property>
</textFileColumn>
<textFileColumn>
<header>Host species</header>
<property>host_species</property>
</textFileColumn>
<textFileColumn>
<header>Collection year</header>
<property>collection_year</property>
</textFileColumn>
</textFilePopulator>
The project build needs to create the modules based on the above configuration files, so we now add this .glue
file in the demo/glue
directory:
# create modules for the project based on XML module files
create module --fileName modules/demoTextFilePopulator.xml
create module --fileName modules/demoAlignmentImporter.xml
We invoke demoModules.glue
from the main project build, by adding this line to demoProject.glue
, just before the validate step:
# load the project modules
run file glue/demoModules.glue
You should re-run the project build to check that the changes worked.
We will now add some sequence data and metadata to the project. Within the demo/sources
directory create two new directories:
demo/ sources/ ncbi-refseqs/ fasta-hev-examples/
We will take nucleotide sequence data from the example project:
- Copy all
.fasta
files fromexampleProject/sources/fasta-hev-examples
intodemo/sources/fasta-hev-examples
. - Copy the single file
exampleProject/sources/ncbi-refseqs/L08816.xml
intodemo/sources/ncbi-refseqs
.
Create this tab-delimited text file in demo/tabular/
:
Isolate Sequence length Country Host species Collection year
IND-HEV-AVH1-1991 7206 India Homo sapiens 1991
IND-HEV-AVH2-1998 7215 India Homo sapiens 1998
IND-HEV-AVH3-2000 7215 India Homo sapiens 2000
IND-HEV-AVH4-2006 7206 India Homo sapiens 2006
IND-HEV-AVH5-2010 7217 India Homo sapiens 2010
IND-HEV-FHF1-2003 7206 India Homo sapiens 2003
IND-HEV-FHF2-2004 7211 India Homo sapiens 2004
IND-HEV-FHF3-2005 7206 India Homo sapiens 2005
IND-HEV-FHF4-2006 7201 India Homo sapiens 2006
IND-HEV-FHF5-2007 7226 India Homo sapiens 2007
We now add these three lines to demoProject.glue
, just before the validate step:
# import the ncbi-refseqs Source containing a single sequence L08816, in GenBank XML format
import source sources/ncbi-refseqs
# import the fasta-hev-examples Source containing set of 10 HEV example sequences, in FASTA format.
import source sources/fasta-hev-examples
# populate metadata for the sequences in Source fasta-hev-examples
module demoTextFilePopulator populate -w "source.name = 'fasta-hev-examples'" -f tabular/fasta-hev-examples.txt
Now re-run the project build to check that the data loads correctly.
In the next step we will define Features for the three coding regions of the virus genome, and create a ReferenceSequence on which these regions are annotated.
Create these two .glue
files in the demo/glue
directory:
# create a feature for each ORF
# indicate that each feature is a protein coding region
# and that each has its own codon numbering scheme.
create feature ORF1
feature ORF1
set field displayName "ORF 1"
set metatag CODES_AMINO_ACIDS true
set metatag OWN_CODON_NUMBERING true
exit
create feature ORF2
feature ORF2
set field displayName "ORF 2"
set metatag CODES_AMINO_ACIDS true
set metatag OWN_CODON_NUMBERING true
exit
create feature ORF3
feature ORF3
set field displayName "ORF 3"
set metatag CODES_AMINO_ACIDS true
set metatag OWN_CODON_NUMBERING true
exit
# create the reference sequence object based on the sequence object
create reference REF_MASTER_L08816 ncbi-refseqs L08816
# enter reference sequence mode
reference REF_MASTER_L08816
# add feature locations for each of ORF1, ORF2 and ORF3
# in each case, enter feature location mode and add a segment specifying
# where the feature is located on L08816
add feature-location ORF1
feature-location ORF1
add segment 4 5085
exit
add feature-location ORF3
feature-location ORF3
add segment 5082 5453
exit
add feature-location ORF2
feature-location ORF2
add segment 5123 7105
exit
exit
We now add lines to demoProject.glue
, to invoke these GLUE files, just before the validate step:
# define genome features for the project
run file glue/demoFeatures.glue
# define reference sequence based on this sequence
run file glue/demoReferenceSequences.glue
Now re-run the project build to check that these changes work correctly.
We will now add an alignment into the project. In this case we will add an unconstrained alignment of all 11 sequences, using a blastFastaAlignmentImporter module to import the alignment from a file.
Copy the single file exampleProject/alignments/demoAlignment.fna
into demo/alignments
.
Now add lines to demoProject.glue
, to import the alignment, just before the validate step:
# import an unconstrained alignment, relating the fasta-hev-examples with the reference sequence
module demoAlignmentImporter import AL_UNCONSTRAINED --fileName alignments/demoAlignment.fna
Rebuild the project to check that this step works.
Here is an example command which makes use of the project data. It will translate the ORF2 coding region of the IND-HEV-FHF1-2003 example sequence.
Mode path: /
GLUE> project demo
OK
Mode path: /project/demo
GLUE> alignment AL_UNCONSTRAINED member fasta-hev-examples IND-HEV-FHF1-2003
OK
Mode path: /project/demo/alignment/AL_UNCONSTRAINED/member/fasta-hev-examples/IND-HEV-FHF1-2003
GLUE> amino-acid -r REF_MASTER_L08816 -f ORF2
+============+==========+==========+===========+
| codonLabel | memberNt | relRefNt | aminoAcid |
+============+==========+==========+===========+
| 1 | 5147 | 5123 | M |
| 2 | 5150 | 5126 | R |
| 3 | 5153 | 5129 | P |
| 4 | 5156 | 5132 | R |
| 5 | 5159 | 5135 | P |
| 6 | 5162 | 5138 | I |
| 7 | 5165 | 5141 | L |
| 8 | 5168 | 5144 | L |
| 9 | 5171 | 5147 | L |
| 10 | 5174 | 5150 | F |
| 11 | 5177 | 5153 | L |
| 12 | 5180 | 5156 | M |
| 13 | 5183 | 5159 | F |
| 14 | 5186 | 5162 | L |
| 15 | 5189 | 5165 | P |
| 16 | 5192 | 5168 | M |
| 17 | 5195 | 5171 | L |
| 18 | 5198 | 5174 | P |
| 19 | 5201 | 5177 | A |
+============+==========+==========+===========+
Rows 1 to 19 of 659 [F:first, L:last, P:prev, N:next, Q:quit]
GLUE by Robert J. Gifford Lab.
For questions, issues, or feedback, please open an issue on the GitHub repository.
- Project Data Model
- Schema Extensions
- Modules
- Alignments
- Variations
- Scripting Layer
- Freemarker Templates
- Example GLUE Project
- Command Line Interpreter
- Build Your Own Project
- Querying the GLUE Database
- Working With Deep Sequencing Data
- Invoking GLUE as a Unix Command
- Known Issues and Fixes
- Overview
- Hepatitis Viruses
- Arboviruses
- Respiratory Viruses
- Animal Viruses
- Spillover Viruses
- Virus Diversity
- Retroviruses
- Paleovirology
- Transposons
- Host Genes