Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Metagenomics section #200

Merged
merged 12 commits into from
Feb 16, 2017
Merged

Add Metagenomics section #200

merged 12 commits into from
Feb 16, 2017

Conversation

gailrosen
Copy link
Contributor

Put in some text that I wrote for my Metagenomics section. I need some feedback. I give thoughts as to why NNs may be better for some applications than others but don't really have any concluding remarks. Is this ok?

@cgreene
Copy link
Member

cgreene commented Jan 18, 2017

@gailrosen : I have a quick request to facilitate review. Can you reformat to 80 chars/line? When using soft wrapping I can only comment on entire paragraphs. I use the atom text editor, which includes a "reflow selection" feature that will automatically reformat to fit this line length where possible. I think many editors can provide this functionality.

Reworded some of my famously awkward sentences that I tend to generate and reformatted 80 characters/line
@gailrosen
Copy link
Contributor Author

i think i fixed the 80 char/line now

Copy link
Member

@cgreene cgreene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few thoughts but think we need to get @agitter's feedback to best integrate the section with the larger study component. I think it'll probably be this weekend at the earliest before we can get feedback.

downstream analyses. Newer methods hope to classify reads and estimate
relative abundances at faster rates [Vervier] and as of this writing, there
are more than 70 metagenomic taxonomic classifiers in existence. Besides
binning and classification of species, there is functional identification and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does functional identification mean in this context? Is this the functional potential of a microbial community, the function of a gene or single organism?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean it more in the context of the function of a particular gene (protein family/what metabolic pathways that it participates in...)

appropriate features representing low/high pH, which can provide additional
useful information and new features for future metagenomic sample comparison.
Also, an initial study has show promise of these networks for diagnosing
disease [Faruqi]. However, deep neural networks are not ideal for such
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is structured in a very problem-centric way. For this review, I wonder if organizing this section as something like

  • introductory paragraph to metagenomics
  • places where deep NNs have been successful [with some interpretation as to why]
  • places where deep NNs have not been successful [again maybe with some why]
  • places where deep NNs have not been applied, but you think they should be
    would be the way to go. I think all of these materials are essentially in place, so this would involve moving things around and linking components together. @agitter - what's your intuition? I don't want to suggest organization that doesn't work for your section.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this proposed re-organization and that most of the components are already in place. I don't claim ownership of the Study section so you (@cgreene) and anyone else should feel free to make suggestions and merges.

For several methods presented here, we see that NNs have been applied to a metagenomics problem, but I am left wondering how well the approach worked or whether it was a good idea. The re-organization could help with that.

performed better. Due to the complexity of the problem, neural networks have
been applied more to gene annotation (e.g. Orphelia [Hoff]), which usually
have plenty of training examples. Representations (similar to Word2Vec [ref]
in natural language processing) for protein family classification has been
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note for @brettbj - think this connects to some of your thinking.

introduced and classified with a skip-gram neural network [Asgari].
Recurrent neural networks show good performance for homology and protein
family identification [Hochreiter, Sonderby]. Interestingly, Hochreiter, who
invented Long Short Term Memory, delved into homology/protein family
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need Hochreiter's LSTM reference here?

examples (compared to several thousand fully-sequenced whole-genome
sequences), deep neural networks have been successfully applied to taxonomic
classification of 16S rRNA genes, with convolutional networks outperforming
RNNs and even random forests [Mrzelj].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you were going to speculate, is there any analysis that deep neural networks could enable that isn't yet in use?

Copy link
Collaborator

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of good content here overall. Thanks for working on this section @gailrosen.

introduced and classified with a skip-gram neural network [Asgari].
Recurrent neural networks show good performance for homology and protein
family identification [Hochreiter, Sonderby]. Interestingly, Hochreiter, who
invented Long Short Term Memory, delved into homology/protein family
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need Hochreiter's LSTM reference here?


Most neural networks are being used for short sequence->taxa/function
classification, where there is a lot of data for training (and thus suitable
for NNs). And, as a short side note, recurrent neural networks are showing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This base-calling sentence seems out of place. Is there a specific link to metagenomics or should this be moved to our sub-section on sequencing?

appropriate features representing low/high pH, which can provide additional
useful information and new features for future metagenomic sample comparison.
Also, an initial study has show promise of these networks for diagnosing
disease [Faruqi]. However, deep neural networks are not ideal for such
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this proposed re-organization and that most of the components are already in place. I don't claim ownership of the Study section so you (@cgreene) and anyone else should feel free to make suggestions and merges.

For several methods presented here, we see that NNs have been applied to a metagenomics problem, but I am left wondering how well the approach worked or whether it was a good idea. The re-organization could help with that.

examples (compared to several thousand fully-sequenced whole-genome
sequences), deep neural networks have been successfully applied to taxonomic
classification of 16S rRNA genes, with convolutional networks outperforming
RNNs and even random forests [Mrzelj].
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be stronger about the neural network performance here? Do the CNNs and RNNs marginally improve upon random forest or is it a more fundamental leap? We want to send a different message if NNs offer something truly novel for the problem versus NNs being one of several good options for taxonomic classification (including those not based on supervised learning).

(sequence composition->phenotype classification). Also, researchers have
looked into how feature selection can improve classification [Liu, Segata],
and techniques have been proposed that are classifier-independent
[Ditzler,Ditzler].
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many metagenomics problems and methods presented in the first paragraph, and as someone who doesn't work in this area I wasn't sure which were going to be relevant to neural networks and which are presented to introduce the field. Following @cgreene's organization, we might discuss some of these problems and why NNs have been more successful than alternatives. If there are other tasks for which NNs haven't been applied (e.g. relative abundance estimation?), we could either ignore them, present them as an opportunity if we think NNs could work well, or discuss why NNs aren't the right approach.

abundance estimators, they can be useful for faster comparative and other
downstream analyses. Newer methods hope to classify reads and estimate
relative abundances at faster rates [Vervier] and as of this writing, there
are more than 70 metagenomic taxonomic classifiers in existence. Besides
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given 70 existing methods, is this problem solved? If not, are NNs well-posed to address the remaining challenges for some particular reason?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree that this would be an important point to mention.

In general, I think this section nicely lays the groundwork establishing metagenomics as a study area rife with ML approaches to solve analysis problems. It does read like a "History of Metagenomics Methods" paragraph though - which is not necessarily the flavor or our guiding message for the review (see #88). Perhaps we can leave it for now and begin to condense/synthesize thoughts once more words are put on paper!

however, other methods based on interpolated Markov models [Salzberg] have
performed better. Due to the complexity of the problem, neural networks have
been applied more to gene annotation (e.g. Orphelia [Hoff]), which usually
have plenty of training examples. Representations (similar to Word2Vec [ref]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"has plenty"

measurement of a pore's current signal) for the relatively new Oxford
Nanopore sequencer [Boza]. However, due to small nubmers of metagenomic
samples in studies, neural network uses for classifying phenotype from
microbial composition are just beginning. A standard MLP was able to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether we'll have defined MLP previously. @cgreene will that be in the intro?

Also, an initial study has show promise of these networks for diagnosing
disease [Faruqi]. However, deep neural networks are not ideal for such
problems since there are tens of samples (~20->40) available and
hundreds/thousands of features (aka species). Such underdetermined problems
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll refer to the "wide data" discussion that has come up elsewhere #95. @brettbj is planning to write about this.

training examples than features to sufficiently converge the weights on the
hidden layers.

In fact, due to convergence issues of neural networks, one would think
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant by "convergence issues"?

@cgreene cgreene mentioned this pull request Jan 22, 2017
fixed have plenty-> has plenty
more specific about convergence issues
defined the improvement in performance of mrzelj
fixed organization?
@gailrosen
Copy link
Contributor Author

I tried the best I could to reorganize so I put where it works first .. then a bunch of intermediary paragraphs where I don't think it is too clear how well it's working and is a point of contention. And then I end with problems and then exciting challenges.

It's a little too difficult for me to fully decouple metagenomics problem focus.. but I did the best I could.

Copy link
Contributor

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two general comments:

  1. The first paragraph reads like a really well thought out history of metagenomics ML methods. Perhaps we could focus more on the problems they are solving and how DL is improving solutions or not? I go into a bit more detail in an inline comment.
  2. Definitely think the future perspective of using DL to stud metagenomics is great! I am no expert in metagenomics, but from this it appears to be quite a positive outlook.
  3. Are there any areas that deep learning changes how we can study metagenomics? (besides improving detection and classification).

I am going to make a quick commit removing line 85 and adding this TODO but besides that I think it is good to merge. @agitter and @cgreene what do you think?

abundance estimators, they can be useful for faster comparative and other
downstream analyses. Newer methods hope to classify reads and estimate
relative abundances at faster rates [Vervier] and as of this writing, there
are more than 70 metagenomic taxonomic classifiers in existence. Besides
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree that this would be an important point to mention.

In general, I think this section nicely lays the groundwork establishing metagenomics as a study area rife with ML approaches to solve analysis problems. It does read like a "History of Metagenomics Methods" paragraph though - which is not necessarily the flavor or our guiding message for the review (see #88). Perhaps we can leave it for now and begin to condense/synthesize thoughts once more words are put on paper!

and techniques have been proposed that are classifier-independent
[Ditzler,Ditzler].

So, how have neural networks (NNs) been of use? Most neural networks are being
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much improvement is gained with deep networks? Do they allow researchers any additional information that more traditional ML approaches do not?

Perhaps this paragraph and the next could be combined with the next two since it looks like they discuss method improvement in detail with specific examples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gwaygenomics : I think these are important questions. Going to merge but then we will probably want to come back to this when we revise the draft.

@cgreene cgreene merged commit 67cef75 into greenelab:master Feb 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants