-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Metagenomics section #200
Conversation
@gailrosen : I have a quick request to facilitate review. Can you reformat to 80 chars/line? When using soft wrapping I can only comment on entire paragraphs. I use the |
Reworded some of my famously awkward sentences that I tend to generate and reformatted 80 characters/line
i think i fixed the 80 char/line now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few thoughts but think we need to get @agitter's feedback to best integrate the section with the larger study
component. I think it'll probably be this weekend at the earliest before we can get feedback.
downstream analyses. Newer methods hope to classify reads and estimate | ||
relative abundances at faster rates [Vervier] and as of this writing, there | ||
are more than 70 metagenomic taxonomic classifiers in existence. Besides | ||
binning and classification of species, there is functional identification and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does functional identification mean in this context? Is this the functional potential of a microbial community, the function of a gene or single organism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean it more in the context of the function of a particular gene (protein family/what metabolic pathways that it participates in...)
sections/04_study.md
Outdated
appropriate features representing low/high pH, which can provide additional | ||
useful information and new features for future metagenomic sample comparison. | ||
Also, an initial study has show promise of these networks for diagnosing | ||
disease [Faruqi]. However, deep neural networks are not ideal for such |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is structured in a very problem-centric way. For this review, I wonder if organizing this section as something like
- introductory paragraph to metagenomics
- places where deep NNs have been successful [with some interpretation as to why]
- places where deep NNs have not been successful [again maybe with some why]
- places where deep NNs have not been applied, but you think they should be
would be the way to go. I think all of these materials are essentially in place, so this would involve moving things around and linking components together. @agitter - what's your intuition? I don't want to suggest organization that doesn't work for your section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this proposed re-organization and that most of the components are already in place. I don't claim ownership of the Study section so you (@cgreene) and anyone else should feel free to make suggestions and merges.
For several methods presented here, we see that NNs have been applied to a metagenomics problem, but I am left wondering how well the approach worked or whether it was a good idea. The re-organization could help with that.
sections/04_study.md
Outdated
performed better. Due to the complexity of the problem, neural networks have | ||
been applied more to gene annotation (e.g. Orphelia [Hoff]), which usually | ||
have plenty of training examples. Representations (similar to Word2Vec [ref] | ||
in natural language processing) for protein family classification has been |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note for @brettbj - think this connects to some of your thinking.
sections/04_study.md
Outdated
introduced and classified with a skip-gram neural network [Asgari]. | ||
Recurrent neural networks show good performance for homology and protein | ||
family identification [Hochreiter, Sonderby]. Interestingly, Hochreiter, who | ||
invented Long Short Term Memory, delved into homology/protein family |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need Hochreiter's LSTM reference here?
sections/04_study.md
Outdated
examples (compared to several thousand fully-sequenced whole-genome | ||
sequences), deep neural networks have been successfully applied to taxonomic | ||
classification of 16S rRNA genes, with convolutional networks outperforming | ||
RNNs and even random forests [Mrzelj]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you were going to speculate, is there any analysis that deep neural networks could enable that isn't yet in use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of good content here overall. Thanks for working on this section @gailrosen.
sections/04_study.md
Outdated
introduced and classified with a skip-gram neural network [Asgari]. | ||
Recurrent neural networks show good performance for homology and protein | ||
family identification [Hochreiter, Sonderby]. Interestingly, Hochreiter, who | ||
invented Long Short Term Memory, delved into homology/protein family |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need Hochreiter's LSTM reference here?
sections/04_study.md
Outdated
|
||
Most neural networks are being used for short sequence->taxa/function | ||
classification, where there is a lot of data for training (and thus suitable | ||
for NNs). And, as a short side note, recurrent neural networks are showing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This base-calling sentence seems out of place. Is there a specific link to metagenomics or should this be moved to our sub-section on sequencing?
sections/04_study.md
Outdated
appropriate features representing low/high pH, which can provide additional | ||
useful information and new features for future metagenomic sample comparison. | ||
Also, an initial study has show promise of these networks for diagnosing | ||
disease [Faruqi]. However, deep neural networks are not ideal for such |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this proposed re-organization and that most of the components are already in place. I don't claim ownership of the Study section so you (@cgreene) and anyone else should feel free to make suggestions and merges.
For several methods presented here, we see that NNs have been applied to a metagenomics problem, but I am left wondering how well the approach worked or whether it was a good idea. The re-organization could help with that.
sections/04_study.md
Outdated
examples (compared to several thousand fully-sequenced whole-genome | ||
sequences), deep neural networks have been successfully applied to taxonomic | ||
classification of 16S rRNA genes, with convolutional networks outperforming | ||
RNNs and even random forests [Mrzelj]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we be stronger about the neural network performance here? Do the CNNs and RNNs marginally improve upon random forest or is it a more fundamental leap? We want to send a different message if NNs offer something truly novel for the problem versus NNs being one of several good options for taxonomic classification (including those not based on supervised learning).
sections/04_study.md
Outdated
(sequence composition->phenotype classification). Also, researchers have | ||
looked into how feature selection can improve classification [Liu, Segata], | ||
and techniques have been proposed that are classifier-independent | ||
[Ditzler,Ditzler]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many metagenomics problems and methods presented in the first paragraph, and as someone who doesn't work in this area I wasn't sure which were going to be relevant to neural networks and which are presented to introduce the field. Following @cgreene's organization, we might discuss some of these problems and why NNs have been more successful than alternatives. If there are other tasks for which NNs haven't been applied (e.g. relative abundance estimation?), we could either ignore them, present them as an opportunity if we think NNs could work well, or discuss why NNs aren't the right approach.
abundance estimators, they can be useful for faster comparative and other | ||
downstream analyses. Newer methods hope to classify reads and estimate | ||
relative abundances at faster rates [Vervier] and as of this writing, there | ||
are more than 70 metagenomic taxonomic classifiers in existence. Besides |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given 70 existing methods, is this problem solved? If not, are NNs well-posed to address the remaining challenges for some particular reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree that this would be an important point to mention.
In general, I think this section nicely lays the groundwork establishing metagenomics as a study area rife with ML approaches to solve analysis problems. It does read like a "History of Metagenomics Methods" paragraph though - which is not necessarily the flavor or our guiding message for the review (see #88). Perhaps we can leave it for now and begin to condense/synthesize thoughts once more words are put on paper!
sections/04_study.md
Outdated
however, other methods based on interpolated Markov models [Salzberg] have | ||
performed better. Due to the complexity of the problem, neural networks have | ||
been applied more to gene annotation (e.g. Orphelia [Hoff]), which usually | ||
have plenty of training examples. Representations (similar to Word2Vec [ref] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"has plenty"
sections/04_study.md
Outdated
measurement of a pore's current signal) for the relatively new Oxford | ||
Nanopore sequencer [Boza]. However, due to small nubmers of metagenomic | ||
samples in studies, neural network uses for classifying phenotype from | ||
microbial composition are just beginning. A standard MLP was able to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether we'll have defined MLP previously. @cgreene will that be in the intro?
sections/04_study.md
Outdated
Also, an initial study has show promise of these networks for diagnosing | ||
disease [Faruqi]. However, deep neural networks are not ideal for such | ||
problems since there are tens of samples (~20->40) available and | ||
hundreds/thousands of features (aka species). Such underdetermined problems |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sections/04_study.md
Outdated
training examples than features to sufficiently converge the weights on the | ||
hidden layers. | ||
|
||
In fact, due to convergence issues of neural networks, one would think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is meant by "convergence issues"?
fixed have plenty-> has plenty more specific about convergence issues defined the improvement in performance of mrzelj
fixed organization?
I tried the best I could to reorganize so I put where it works first .. then a bunch of intermediary paragraphs where I don't think it is too clear how well it's working and is a point of contention. And then I end with problems and then exciting challenges. It's a little too difficult for me to fully decouple metagenomics problem focus.. but I did the best I could. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two general comments:
- The first paragraph reads like a really well thought out history of metagenomics ML methods. Perhaps we could focus more on the problems they are solving and how DL is improving solutions or not? I go into a bit more detail in an inline comment.
- Definitely think the future perspective of using DL to stud metagenomics is great! I am no expert in metagenomics, but from this it appears to be quite a positive outlook.
- Are there any areas that deep learning changes how we can study metagenomics? (besides improving detection and classification).
I am going to make a quick commit removing line 85 and adding this TODO but besides that I think it is good to merge. @agitter and @cgreene what do you think?
abundance estimators, they can be useful for faster comparative and other | ||
downstream analyses. Newer methods hope to classify reads and estimate | ||
relative abundances at faster rates [Vervier] and as of this writing, there | ||
are more than 70 metagenomic taxonomic classifiers in existence. Besides |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree that this would be an important point to mention.
In general, I think this section nicely lays the groundwork establishing metagenomics as a study area rife with ML approaches to solve analysis problems. It does read like a "History of Metagenomics Methods" paragraph though - which is not necessarily the flavor or our guiding message for the review (see #88). Perhaps we can leave it for now and begin to condense/synthesize thoughts once more words are put on paper!
and techniques have been proposed that are classifier-independent | ||
[Ditzler,Ditzler]. | ||
|
||
So, how have neural networks (NNs) been of use? Most neural networks are being |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much improvement is gained with deep networks? Do they allow researchers any additional information that more traditional ML approaches do not?
Perhaps this paragraph and the next could be combined with the next two since it looks like they discuss method improvement in detail with specific examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gwaygenomics : I think these are important questions. Going to merge but then we will probably want to come back to this when we revise the draft.
Put in some text that I wrote for my Metagenomics section. I need some feedback. I give thoughts as to why NNs may be better for some applications than others but don't really have any concluding remarks. Is this ok?