Skip to content

Files

Latest commit

 

History

History

dedal

Dedal: Deep embedding and alignment of protein sequences.

This package contains all the necessary tools to reproduce the experiments presented in the dedal paper.

Install

To install dedal, it is necessary to clone the google-research repo:

git clone https://github.com/google-research/google-research.git

From the google_research folder, you may install the necessary requirements by executing:

pip install -r dedal/requirements.txt

Run

To start training a Dedal network, one can simply run, from the google_research folder, the following command line:

python3 -m dedal.main --base_dir /tmp/dedal/ --task train --gin_config dedal.gin

Note that the transformer architecture might be slow to train on a CPU and running on accelerators would greatly improve the training speed.

The first parameter is the folder where to write the checkpoints and to log the metrics. In the example above, it would be /tmp/dedal/. To visualize the logged metrics, one can simply start a tensorboard pointing to the given folder, such as:

%tensorboard --logdir /tmp/dedal

In case the training is interrupted, restarting the same command would not start the training over from scratch, but from the last available checkpoint. The frequency of checkpointing and logging can be changed from the gin config, in base.gin.

The task flag enables to either run a training, an evaluation of a downstream training with its own eval. In evaluation mode, the training checkpoints will be loaded on the fly until the last one has been reached, such that one can run both an eval process along with a training one, so that the evaluation does not slow the training down. Alternatively, one can set separate_eval=False in the training loop so that eval and train will be run alternatively.

To play with the dedal configuration, for example changing a parameter, the encoder or even add an extra head, one should get inspiration from the base.gin, dedal.gin and substitution_matrix.gin config files. The first one contains the configuration of the training loop, the data, metrics, losses, while the two others only contains what in the network is specific to dedal or to the substitution matrix based sequence alignment methods.

Pre-trained model

DEDAL is available in TensorFlow Hub.

The model expects a tf.Tensor<tf.int32>[2B, 512] as inputs, representing a batch of B sequence pairs to be aligned right-padded to a maximum length of 512, including the special EOS token. Pairs are expected to be arranged consecutively in this batch, that is, inputs[2*b] and inputs[2*b + 1] represent the b-th sequence pair with b ranging from 0 up to B - 1 (inclusive).

By default, the model runs in "alignment" mode and its output consists of:

  • a tf.Tensor<tf.float32>[B, 512] with the alignment scores
  • a tf.Tensor<tf.float32>[B, 512, 512, 9] representing the predicted alignments
  • a tuple of three tf.Tensor<tf.float32>[B, 512, 512] containing the contextual Smith-Waterman parameters (substitution scores, gap open and gap extend penalties)

Additional signatures are provided to run the model in "embedding" mode, in which case it returns a single tf.Tensor<tf.float32>[2B, 512, 768] with the embeddings of each input sequence.

import tensorflow as tf
import tensorflow_hub as hub
from dedal import infer  # Requires google_research/google-research.

dedal_model = hub.load('https://tfhub.dev/google/dedal/1')

# "Gorilla" and "Mallard" sequences from [1, Figure 3].
protein_a = 'SVCCRDYVRYRLPLRVVKHFYWTSDSCPRPGVVLLTFRDKEICADPRVPWVKMILNKL'
protein_b = 'VKCKCSRKGPKIRFSNVRKLEIKPRYPFCVEEMIIVTLWTRVRGEQQHCLNPKRQNTVRLLKWY'
# Represents sequences as `tf.Tensor<tf.float32>[2, 512]` batch of tokens.
inputs = infer.preprocess(protein_a, protein_b)

# Aligns `protein_a` to `protein_b`.
scores, path, sw_params = dedal_model(inputs)

# Retrieves per-position embeddings of both sequences.
embeddings = dedal_model.call(inputs, embeddings_only=True)

# Postprocesses output and displays alignment.
output = infer.expand([scores, path, sw_params])
output = infer.postprocess(output, len(protein_a), len(protein_b))
alignment = infer.Alignment(protein_a, protein_b, *output)
print(alignment)

Data

This repo does not contain real-world data. Training uses synthetic data sampled on-the-fly for illustration purposes. However, the repo does contain tools to build the datasets to be fed to Dedal for training or eval. Sequence identifiers to reproduce all Pfam-A seed splits can be downloaded here.

License

Licensed under the Apache 2.0 License.

Disclaimer

This is not an official Google product.