Add Metagenomics section #200

gailrosen · 2017-01-13T16:51:28Z

Put in some text that I wrote for my Metagenomics section. I need some feedback. I give thoughts as to why NNs may be better for some applications than others but don't really have any concluding remarks. Is this ok?

cgreene · 2017-01-18T20:41:05Z

@gailrosen : I have a quick request to facilitate review. Can you reformat to 80 chars/line? When using soft wrapping I can only comment on entire paragraphs. I use the atom text editor, which includes a "reflow selection" feature that will automatically reformat to fit this line length where possible. I think many editors can provide this functionality.

Reworded some of my famously awkward sentences that I tend to generate and reformatted 80 characters/line

gailrosen · 2017-01-18T23:03:21Z

i think i fixed the 80 char/line now

cgreene

I have a few thoughts but think we need to get @agitter's feedback to best integrate the section with the larger study component. I think it'll probably be this weekend at the earliest before we can get feedback.

cgreene · 2017-01-19T12:58:37Z

sections/04_study.md

+downstream analyses.   Newer methods hope to classify reads and estimate
+relative abundances at faster rates [Vervier] and as of this writing, there
+are more than 70 metagenomic taxonomic classifiers in existence.  Besides
+binning and classification of species, there is functional identification and


What does functional identification mean in this context? Is this the functional potential of a microbial community, the function of a gene or single organism?

I mean it more in the context of the function of a particular gene (protein family/what metabolic pathways that it participates in...)

cgreene · 2017-01-19T13:11:20Z

sections/04_study.md

+appropriate features representing low/high pH, which can provide additional
+useful information and new features for future metagenomic sample comparison.
+ Also, an initial study has show promise of these networks for diagnosing
+disease [Faruqi].  However, deep neural networks are not ideal for such


This is structured in a very problem-centric way. For this review, I wonder if organizing this section as something like

introductory paragraph to metagenomics

places where deep NNs have been successful [with some interpretation as to why]

places where deep NNs have not been successful [again maybe with some why]

places where deep NNs have not been applied, but you think they should be
would be the way to go. I think all of these materials are essentially in place, so this would involve moving things around and linking components together. @agitter - what's your intuition? I don't want to suggest organization that doesn't work for your section.

I agree with this proposed re-organization and that most of the components are already in place. I don't claim ownership of the Study section so you (@cgreene) and anyone else should feel free to make suggestions and merges.

For several methods presented here, we see that NNs have been applied to a metagenomics problem, but I am left wondering how well the approach worked or whether it was a good idea. The re-organization could help with that.

cgreene · 2017-01-19T13:11:57Z

sections/04_study.md

+performed better.  Due to the complexity of the problem, neural networks have
+been applied more to gene annotation (e.g. Orphelia [Hoff]), which usually
+have plenty of training examples.  Representations (similar to Word2Vec [ref]
+in natural language processing) for protein family classification has been


note for @brettbj - think this connects to some of your thinking.

cgreene · 2017-01-19T13:12:18Z

sections/04_study.md

+introduced and classified with a skip-gram neural network [Asgari]. 
+Recurrent neural networks show good performance for homology and protein
+family identification [Hochreiter, Sonderby].  Interestingly, Hochreiter, who
+invented Long Short Term Memory, delved into homology/protein family


Need Hochreiter's LSTM reference here?

cgreene · 2017-01-19T13:12:27Z

sections/04_study.md

+examples (compared to several thousand fully-sequenced whole-genome
+sequences), deep neural networks have been successfully applied to taxonomic
+classification of 16S rRNA genes, with convolutional networks outperforming
+RNNs and even random forests [Mrzelj].


If you were going to speculate, is there any analysis that deep neural networks could enable that isn't yet in use?

agitter

Lots of good content here overall. Thanks for working on this section @gailrosen.

agitter · 2017-01-21T13:45:18Z

sections/04_study.md

+introduced and classified with a skip-gram neural network [Asgari]. 
+Recurrent neural networks show good performance for homology and protein
+family identification [Hochreiter, Sonderby].  Interestingly, Hochreiter, who
+invented Long Short Term Memory, delved into homology/protein family


Need Hochreiter's LSTM reference here?

agitter · 2017-01-21T13:46:34Z

sections/04_study.md

+
+Most neural networks are being used for short sequence->taxa/function
+classification, where there is a lot of data for training (and thus suitable
+for NNs).  And, as a short side note, recurrent neural networks are showing


This base-calling sentence seems out of place. Is there a specific link to metagenomics or should this be moved to our sub-section on sequencing?

agitter · 2017-01-21T13:55:39Z

sections/04_study.md

+appropriate features representing low/high pH, which can provide additional
+useful information and new features for future metagenomic sample comparison.
+ Also, an initial study has show promise of these networks for diagnosing
+disease [Faruqi].  However, deep neural networks are not ideal for such


I agree with this proposed re-organization and that most of the components are already in place. I don't claim ownership of the Study section so you (@cgreene) and anyone else should feel free to make suggestions and merges.

For several methods presented here, we see that NNs have been applied to a metagenomics problem, but I am left wondering how well the approach worked or whether it was a good idea. The re-organization could help with that.

agitter · 2017-01-21T14:01:53Z

sections/04_study.md

+examples (compared to several thousand fully-sequenced whole-genome
+sequences), deep neural networks have been successfully applied to taxonomic
+classification of 16S rRNA genes, with convolutional networks outperforming
+RNNs and even random forests [Mrzelj].


Can we be stronger about the neural network performance here? Do the CNNs and RNNs marginally improve upon random forest or is it a more fundamental leap? We want to send a different message if NNs offer something truly novel for the problem versus NNs being one of several good options for taxonomic classification (including those not based on supervised learning).

agitter · 2017-01-21T14:07:47Z

sections/04_study.md

+(sequence composition->phenotype classification).  Also, researchers have
+looked into how feature selection can improve classification [Liu, Segata],
+and techniques have been proposed that are classifier-independent
+[Ditzler,Ditzler].


There are many metagenomics problems and methods presented in the first paragraph, and as someone who doesn't work in this area I wasn't sure which were going to be relevant to neural networks and which are presented to introduce the field. Following @cgreene's organization, we might discuss some of these problems and why NNs have been more successful than alternatives. If there are other tasks for which NNs haven't been applied (e.g. relative abundance estimation?), we could either ignore them, present them as an opportunity if we think NNs could work well, or discuss why NNs aren't the right approach.

agitter · 2017-01-21T16:05:37Z

sections/04_study.md

+abundance estimators, they can be useful for faster comparative and other
+downstream analyses.   Newer methods hope to classify reads and estimate
+relative abundances at faster rates [Vervier] and as of this writing, there
+are more than 70 metagenomic taxonomic classifiers in existence.  Besides


Given 70 existing methods, is this problem solved? If not, are NNs well-posed to address the remaining challenges for some particular reason?

i agree that this would be an important point to mention.

In general, I think this section nicely lays the groundwork establishing metagenomics as a study area rife with ML approaches to solve analysis problems. It does read like a "History of Metagenomics Methods" paragraph though - which is not necessarily the flavor or our guiding message for the review (see #88). Perhaps we can leave it for now and begin to condense/synthesize thoughts once more words are put on paper!

agitter · 2017-01-21T16:08:32Z

sections/04_study.md

+however, other methods based on interpolated Markov models [Salzberg] have
+performed better.  Due to the complexity of the problem, neural networks have
+been applied more to gene annotation (e.g. Orphelia [Hoff]), which usually
+have plenty of training examples.  Representations (similar to Word2Vec [ref]


"has plenty"

agitter · 2017-01-21T16:11:40Z

sections/04_study.md

+measurement of a pore's current signal) for the relatively new Oxford
+Nanopore sequencer [Boza].  However, due to small nubmers of metagenomic
+samples in studies, neural network uses for classifying phenotype from
+microbial composition are just beginning.   A standard MLP was able to


Not sure whether we'll have defined MLP previously. @cgreene will that be in the intro?

agitter · 2017-01-21T16:17:06Z

sections/04_study.md

+ Also, an initial study has show promise of these networks for diagnosing
+disease [Faruqi].  However, deep neural networks are not ideal for such
+problems since there are tens of samples (~20->40) available and
+hundreds/thousands of features (aka species).  Such underdetermined problems


I'll refer to the "wide data" discussion that has come up elsewhere #95. @brettbj is planning to write about this.

agitter · 2017-01-21T16:18:03Z

sections/04_study.md

+training examples than features to sufficiently converge the weights on the
+hidden layers.
+
+In fact, due to convergence issues of neural networks, one would think


What is meant by "convergence issues"?

fixed have plenty-> has plenty more specific about convergence issues defined the improvement in performance of mrzelj

fixed organization?

gailrosen · 2017-02-07T20:00:41Z

I tried the best I could to reorganize so I put where it works first .. then a bunch of intermediary paragraphs where I don't think it is too clear how well it's working and is a point of contention. And then I end with problems and then exciting challenges.

It's a little too difficult for me to fully decouple metagenomics problem focus.. but I did the best I could.

gwaybio

Two general comments:

The first paragraph reads like a really well thought out history of metagenomics ML methods. Perhaps we could focus more on the problems they are solving and how DL is improving solutions or not? I go into a bit more detail in an inline comment.
Definitely think the future perspective of using DL to stud metagenomics is great! I am no expert in metagenomics, but from this it appears to be quite a positive outlook.
Are there any areas that deep learning changes how we can study metagenomics? (besides improving detection and classification).

I am going to make a quick commit removing line 85 and adding this TODO but besides that I think it is good to merge. @agitter and @cgreene what do you think?

gwaybio · 2017-02-08T01:21:30Z

sections/04_study.md

+abundance estimators, they can be useful for faster comparative and other
+downstream analyses.   Newer methods hope to classify reads and estimate
+relative abundances at faster rates [Vervier] and as of this writing, there
+are more than 70 metagenomic taxonomic classifiers in existence.  Besides


i agree that this would be an important point to mention.

In general, I think this section nicely lays the groundwork establishing metagenomics as a study area rife with ML approaches to solve analysis problems. It does read like a "History of Metagenomics Methods" paragraph though - which is not necessarily the flavor or our guiding message for the review (see #88). Perhaps we can leave it for now and begin to condense/synthesize thoughts once more words are put on paper!

gwaybio · 2017-02-08T01:26:27Z

sections/04_study.md

+and techniques have been proposed that are classifier-independent
+[Ditzler,Ditzler].
+
+So, how have neural networks (NNs) been of use?    Most neural networks are being 


How much improvement is gained with deep networks? Do they allow researchers any additional information that more traditional ML approaches do not?

Perhaps this paragraph and the next could be combined with the next two since it looks like they discuss method improvement in detail with specific examples.

@gwaygenomics : I think these are important questions. Going to merge but then we will probably want to come back to this when we revise the draft.

gailrosen added 2 commits January 13, 2017 11:46

Update 04_study.md

26370e1

Update 04_study.md

15673ad

cgreene self-requested a review January 18, 2017 20:41

cgreene mentioned this pull request Jan 18, 2017

the first draft for protein structure prediction #191

Merged

gailrosen added 5 commits January 18, 2017 17:53

Update 04_study.md

e6ceb32

Update 04_study.md

b565c5a

Reworded some of my famously awkward sentences that I tend to generate and reformatted 80 characters/line

Update 04_study.md

ff9f3a6

Update 04_study.md

e2b2b0d

Update 04_study.md

37ac5fa

cgreene reviewed Jan 19, 2017

View reviewed changes

agitter requested changes Jan 21, 2017

View reviewed changes

cgreene mentioned this pull request Jan 22, 2017

Current Section Status #188

Closed

gailrosen added 4 commits February 7, 2017 11:59

Update 04_study.md

2087ce4

fixed have plenty-> has plenty more specific about convergence issues defined the improvement in performance of mrzelj

Update 04_study.md

4551df2

Update 04_study.md

b03dd39

Update 04_study.md

b069b91

fixed organization?

gwaybio approved these changes Feb 8, 2017

View reviewed changes

remove reference to author and add todo

897f1a1

cgreene merged commit 67cef75 into greenelab:master Feb 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Metagenomics section #200

Add Metagenomics section #200

gailrosen commented Jan 13, 2017

cgreene commented Jan 18, 2017

gailrosen commented Jan 18, 2017

cgreene left a comment

cgreene Jan 19, 2017

gailrosen Feb 7, 2017

cgreene Jan 19, 2017

agitter Jan 21, 2017

cgreene Jan 19, 2017

cgreene Jan 19, 2017

agitter Jan 21, 2017

cgreene Jan 19, 2017

agitter left a comment

agitter Jan 21, 2017

agitter Jan 21, 2017

agitter Jan 21, 2017

agitter Jan 21, 2017

agitter Jan 21, 2017

agitter Jan 21, 2017

gwaybio Feb 8, 2017

agitter Jan 21, 2017

agitter Jan 21, 2017

agitter Jan 21, 2017

agitter Jan 21, 2017

gailrosen commented Feb 7, 2017

gwaybio left a comment •

edited

Loading

gwaybio Feb 8, 2017

gwaybio Feb 8, 2017

cgreene Feb 16, 2017

Add Metagenomics section #200

Add Metagenomics section #200

Conversation

gailrosen commented Jan 13, 2017

cgreene commented Jan 18, 2017

gailrosen commented Jan 18, 2017

cgreene left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agitter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gailrosen commented Feb 7, 2017

gwaybio left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gwaybio left a comment •

edited

Loading