Skip to content

Build Your Own Project

Robert J. Gifford edited this page Nov 21, 2024 · 6 revisions

In this tutorial we will show how a GLUE project can be built up from scratch. This demo project will contain:

  • Reference Sequence Annotated with Coding Features
  • Sequences Imported from FASTA Files]
  • Sequence Metadata Imported from a Tab-Delimited Text File
  • Alignment Imported from a File

We will use some data from the example project, but this will be an independent project. Although the demo project will be very basic, following the tutorial should demonstrate some of the patterns and conventions which can be applied to more complex projects.


Contents


The project directory structure

GLUE does not mandate any particular directory structure but GLUE project developers find it useful to follow this convention. Create the directory structure outlined below:

demo/ alignments/ glue/ modules/ sources/ tabular/ trees/

  • The demo/ directory will contain everything required to build the project in GLUE. You may consider making this directory a single repository in a version control system such as GitHub.
  • The demo/alignments/ directory will contain any files for Alignments which are loaded into the project during the project build.
  • The demo/glue/ directory will contain .glue or .js scripts which are used to execute different phases of the project build.
  • The demo/modules/ directory will contain .xml files containing module configuration, and any associated resource files.
  • The demo/sources/ directory will contain sets of sequence data files, organised into Sources.
  • The demo/tabular/ directory will contain any tabular data required to build the project, for example sequence metadata.
  • The demo/trees/ directory will contain any phylogenetic trees to be loaded in to the project (not used in the demo project).

The build file, project settings and schema extensions

In this step you should will create three files. The project build file is simply a file containing GLUE commands that can be run from the GLUE command line. The build file deletes any previous version of the project then builds the project from scratch.

demo/demoProject.glue

# delete any previous version of the demo project which is in the database
delete project demo

# create the demo project, specifying name, description and minimum GLUE version
create project demo "A demonstration GLUE project based on hepatitis E virus"

# add schema extensions to the demo project
run file glue/demoSchemaExtensions.glue

# enter project mode
project demo

  # set any project-wide GLUE settings
  run file glue/demoProjectSettings.glue

  # validate the project objects
  validate

  exit

Note that at certain points, the build file uses run file to invoke GLUE commands from another file. Generally we suggest using this mechanism to partition GLUE scripts into different files with different purposes. We will now add the two files which are run from the build file. One file defines some schema extensions.

demo/glue/demoSchemaExtensions.glue

schema-project demo

  # add some metadata columns to the sequence table
  table sequence

    create field collection_year INTEGER
    create field length INTEGER
    create field isolate VARCHAR
    create field country VARCHAR
    create field host_species VARCHAR
    exit

  exit

Another file sets up some project-wide GLUE settings.

demo/glue/demoProjectSettings.glue

# define any project-wide GLUE settings for this project

set setting ignore-nt-sequence-hyphens true
set setting translate-beyond-possible-stop true
set setting translate-beyond-definite-stop true

The validate command in Project mode is the final step in the project build. This is a check which ensures the consistency of many of the objects in the project. It can pick up various configuration problems.

After you have created the three files you can check that your project build is working by running it from the GLUE command line. Launch GLUE from the demo/ directory and run the command shown below:

Mode path: /
GLUE> run file demoProject.glue

You can re-run this project build at various points while you are developing and modifying the project definition, to check that it is working as intended.


Module definitions

The demo project will contain two modules. Create these two module XML configuration files in the modules/ directory:

demo/modules/demoAlignmentImporter.xml

<blastFastaAlignmentImporter/>

demo/modules/demoTextFilePopulator.xml

<textFilePopulator>
	<columnDelimiterRegex>\t</columnDelimiterRegex>
	<textFileColumn>
		<identifier>true</identifier>
		<header>Isolate</header>
		<property>sequenceID</property>
	</textFileColumn>
	<textFileColumn>
		<header>Isolate</header>
		<property>isolate</property>
	</textFileColumn>
	<textFileColumn>
		<header>Sequence length</header>
		<property>length</property>
	</textFileColumn>
	<textFileColumn>
		<header>Country</header>
		<property>country</property>
	</textFileColumn>
	<textFileColumn>
		<header>Host species</header>
		<property>host_species</property>
	</textFileColumn>
	<textFileColumn>
		<header>Collection year</header>
		<property>collection_year</property>
	</textFileColumn>
</textFilePopulator>

The project build needs to create the modules based on the above configuration files, so we now add this .glue file in the demo/glue directory:

demo/glue/demoModules.glue

# create modules for the project based on XML module files

create module --fileName modules/demoTextFilePopulator.xml
create module --fileName modules/demoAlignmentImporter.xml

We invoke demoModules.glue from the main project build, by adding this line to demoProject.glue, just before the validate step:

  # load the project modules
  run file glue/demoModules.glue

You should re-run the project build to check that the changes worked.


Sequence nucleotide data and metadata

We will now add some sequence data and metadata to the project. Within the demo/sources directory create two new directories:

demo/ sources/ ncbi-refseqs/ fasta-hev-examples/

We will take nucleotide sequence data from the example project:

  • Copy all .fasta files from exampleProject/sources/fasta-hev-examples into demo/sources/fasta-hev-examples.
  • Copy the single file exampleProject/sources/ncbi-refseqs/L08816.xml into demo/sources/ncbi-refseqs.

Create this tab-delimited text file in demo/tabular/:

demo/tabular/fasta-hev-examples.txt

Isolate	Sequence length	Country	Host species	Collection year
IND-HEV-AVH1-1991	7206	India	Homo sapiens	1991
IND-HEV-AVH2-1998	7215	India	Homo sapiens	1998
IND-HEV-AVH3-2000	7215	India	Homo sapiens	2000
IND-HEV-AVH4-2006	7206	India	Homo sapiens	2006
IND-HEV-AVH5-2010	7217	India	Homo sapiens	2010
IND-HEV-FHF1-2003	7206	India	Homo sapiens	2003
IND-HEV-FHF2-2004	7211	India	Homo sapiens	2004
IND-HEV-FHF3-2005	7206	India	Homo sapiens	2005
IND-HEV-FHF4-2006	7201	India	Homo sapiens	2006
IND-HEV-FHF5-2007	7226	India	Homo sapiens	2007

We now add these three lines to demoProject.glue, just before the validate step:

  # import the ncbi-refseqs Source containing a single sequence L08816, in GenBank XML format
  import source sources/ncbi-refseqs

  # import the fasta-hev-examples Source containing set of 10 HEV example sequences, in FASTA format. 
  import source sources/fasta-hev-examples

  # populate metadata for the sequences in Source fasta-hev-examples
  module demoTextFilePopulator populate -w "source.name = 'fasta-hev-examples'" -f tabular/fasta-hev-examples.txt

Now re-run the project build to check that the data loads correctly.


A reference sequence, features and feature locations

In the next step we will define Features for the three coding regions of the virus genome, and create a ReferenceSequence on which these regions are annotated.

Create these two .glue files in the demo/glue directory:

demo/glue/demoFeatures.glue

# create a feature for each ORF
# indicate that each feature is a protein coding region
# and that each has its own codon numbering scheme.

create feature ORF1
feature ORF1
  set field displayName "ORF 1"
  set metatag CODES_AMINO_ACIDS true
  set metatag OWN_CODON_NUMBERING true
  exit

create feature ORF2
feature ORF2
  set field displayName "ORF 2"
  set metatag CODES_AMINO_ACIDS true
  set metatag OWN_CODON_NUMBERING true
  exit

create feature ORF3
feature ORF3
  set field displayName "ORF 3"
  set metatag CODES_AMINO_ACIDS true
  set metatag OWN_CODON_NUMBERING true
  exit

demo/glue/demoReferenceSequences.glue

# create the reference sequence object based on the sequence object
create reference REF_MASTER_L08816 ncbi-refseqs L08816

# enter reference sequence mode
reference REF_MASTER_L08816

  # add feature locations for each of ORF1, ORF2 and ORF3
  # in each case, enter feature location mode and add a segment specifying
  # where the feature is located on L08816

  add feature-location ORF1
  feature-location ORF1
    add segment 4 5085
    exit

  add feature-location ORF3
  feature-location ORF3
    add segment 5082 5453
    exit

  add feature-location ORF2
  feature-location ORF2
    add segment 5123 7105
    exit

exit

We now add lines to demoProject.glue, to invoke these GLUE files, just before the validate step:

  # define genome features for the project
  run file glue/demoFeatures.glue

  # define reference sequence based on this sequence
  run file glue/demoReferenceSequences.glue

Now re-run the project build to check that these changes work correctly.


An unconstrained alignment

We will now add an alignment into the project. In this case we will add an unconstrained alignment of all 11 sequences, using a blastFastaAlignmentImporter module to import the alignment from a file.

Copy the single file exampleProject/alignments/demoAlignment.fna into demo/alignments.

Now add lines to demoProject.glue, to import the alignment, just before the validate step:

  # import an unconstrained alignment, relating the fasta-hev-examples with the reference sequence
  module demoAlignmentImporter import AL_UNCONSTRAINED --fileName alignments/demoAlignment.fna

Rebuild the project to check that this step works.


Run a command in the project

Here is an example command which makes use of the project data. It will translate the ORF2 coding region of the IND-HEV-FHF1-2003 example sequence.

Mode path: /
GLUE> project demo
OK
Mode path: /project/demo
GLUE> alignment AL_UNCONSTRAINED member fasta-hev-examples IND-HEV-FHF1-2003
OK
Mode path: /project/demo/alignment/AL_UNCONSTRAINED/member/fasta-hev-examples/IND-HEV-FHF1-2003
GLUE> amino-acid -r REF_MASTER_L08816 -f ORF2
+============+==========+==========+===========+
| codonLabel | memberNt | relRefNt | aminoAcid |
+============+==========+==========+===========+
| 1          | 5147     | 5123     | M         |
| 2          | 5150     | 5126     | R         |
| 3          | 5153     | 5129     | P         |
| 4          | 5156     | 5132     | R         |
| 5          | 5159     | 5135     | P         |
| 6          | 5162     | 5138     | I         |
| 7          | 5165     | 5141     | L         |
| 8          | 5168     | 5144     | L         |
| 9          | 5171     | 5147     | L         |
| 10         | 5174     | 5150     | F         |
| 11         | 5177     | 5153     | L         |
| 12         | 5180     | 5156     | M         |
| 13         | 5183     | 5159     | F         |
| 14         | 5186     | 5162     | L         |
| 15         | 5189     | 5165     | P         |
| 16         | 5192     | 5168     | M         |
| 17         | 5195     | 5171     | L         |
| 18         | 5198     | 5174     | P         |
| 19         | 5201     | 5177     | A         |
+============+==========+==========+===========+
Rows 1 to 19 of 659 [F:first, L:last, P:prev, N:next, Q:quit]


Clone this wiki locally