Covariance-based sample selection for heterogeneous data: Applications to gene expression and autism risk gene detection

doi:10.1080/01621459.2020.1738234

. 2021;116(533):54-67.

doi: 10.1080/01621459.2020.1738234. Epub 2020 Apr 13.

Covariance-based sample selection for heterogeneous data: Applications to gene expression and autism risk gene detection

Kevin Z Lin¹, Han Liu², Kathryn Roeder¹

Affiliations

¹ Carnegie Mellon University, Department of Statistics & Data Science, Pittsburgh, PA.
² Northwestern University, Department of Electrical Engineering and Computer Science, Evanston, IL.

PMID: 33731968
PMCID: PMC7958652
DOI: 10.1080/01621459.2020.1738234

Covariance-based sample selection for heterogeneous data: Applications to gene expression and autism risk gene detection

Kevin Z Lin et al. J Am Stat Assoc. 2021.

. 2021;116(533):54-67.

doi: 10.1080/01621459.2020.1738234. Epub 2020 Apr 13.

Authors

Kevin Z Lin¹, Han Liu², Kathryn Roeder¹

Affiliations

¹ Carnegie Mellon University, Department of Statistics & Data Science, Pittsburgh, PA.
² Northwestern University, Department of Electrical Engineering and Computer Science, Evanston, IL.

PMID: 33731968
PMCID: PMC7958652
DOI: 10.1080/01621459.2020.1738234

Abstract

Risk for autism can be influenced by genetic mutations in hundreds of genes. Based on findings showing that genes with highly correlated gene expressions are functionally interrelated, "guilt by association" methods such as DAWN have been developed to identify these autism risk genes. Previous research analyze the BrainSpan dataset, which contains gene expression of brain tissues from varying regions and developmental periods. Since the spatiotemporal properties of brain tissue is known to affect the gene expression's covariance, previous research have focused only on a specific subset of samples to avoid the issue of heterogeneity. This analysis leads to a potential loss of power when detecting risk genes. In this article, we develop a new method called COBS (COvariance-Based sample Selection) to find a larger and more homogeneous subset of samples that share the same population covariance matrix for the downstream DAWN analysis. To demonstrate COBS's effectiveness, we use genetic risk scores from two sequential data freezes obtained in 2014 and 2020. We show COBS improves DAWN's ability to predict risk genes detected in the newer data freeze when using the risk scores of the older data freeze as input.

Keywords: Bootstrap covariance test; Microarray; Multiple testing with dependence.

PubMed Disclaimer

Figures

**Fig. 1**
(A) 107 microarray samples grouped by the originating 10 brains. This forms 10 different partitions. Since all these partitions originate from the same brain region and developmental period, they are further grouped into the same window. (B) The 57 postmortem brains belong to 4 different developmental periods (columns). Here, PCW stands for post-conceptual weeks. Each brain is dissected and sampled at 4 different brain regions (rows). In total, over the 212 partitions, there are 1294 microarray samples, each measuring the expression of over 13,939 genes. Window 1B (outlined in black) is the window that previous work (Liu et al., 2015) focus on, and the hierarchical tree from Willsey et al. (2013) is shown to the right. Additional details about the abbreviations are given in Appendix B.

**Fig. 2**
QQ-plots of the 250 p-values generated when applying our diagnostic to the BrainSpan dataset. (A) The diagnostic using only the partitions in Window 1B, showing a moderate amount of heterogeneity. (B) The diagnostic using all 125 partitions in the BrainSpan dataset, showing a larger amount of heterogeneity.

**Fig. 3**
(A) Visualization of an (example) adjacency matrix that can be formed using (4.3), where the ith row from the top and column from the left denotes the ith vertex. A red square in position (i, j) denotes an edge between vertex i and j, and a pale square denotes the lack of an edge. (B) Illustration of the desired goal. The rows and columns are reordered from Figure A, and the dotted box denotes the vertices that were found to form a γ-quasi-clique.

**Fig. 4**
Schematic of Algorithm 4’s implementation. Step 2 is able to leverage hash tables which stores previous calculations to see if the union of vertices in a pair of children sets forms a γ-quasi-clique. This has a near-constant computational complexity. This can save tremendous computational time since Step 3, which checks if the union of vertices in both parent sets form a γ-quasi-clique, has a computational complexity of O(r²).

**Fig. 5**
(Top row) Heatmap visualizations of the empirical covariance matrix of the three partitions, each drawn from a different nonparanormal distribution when β = 0.3. The distributions using Σ⁽¹⁾, Σ⁽²⁾ and Σ⁽³⁾ are shown as the left, middle and right plots respectively. The darker shades of red denote a higher covariance. (Bottom row) Visualizations similar to the top row except for *β =* 1, so the dissimilarity comparing Σ⁽²⁾ or Σ⁽³⁾ to Σ⁽¹⁾ is increased.

**Fig. 6**
RoC curves for the accepted null hypotheses, for settings where *β =* (0,0.3,0.6,1), where each curve traces out the results as α varies from 0 to 1. (A) The curves resulting from using a Bonferroni correction to the $(\begin{array}{l} r \\ 2 \end{array})$ individual hypothesis tests. (B) The curves resulting from using our Stepdown method.

**Fig. 7**
Number of selected partitions for a particular simulated dataset as the number of accepted null hypotheses varies with the FWER level α. (A) Results using our clique-based selection method developed in Subsection 4.2 and spectral clustering. (B) Results using the methods developed in Tsourakakis et al. (2013) and Chen and Saad (2010). See Appendix D for more details of these methods.

**Fig. 8**
A) Similar RoC curves to Figure 6, but for selected partitions selected by COBS. B) The mean spectral error of each method’s downstream estimated covariance matrix for varying β over 25 trials. The four methods to select partitions shown are COBS for α = 0.1 (black), the method that selects all partitions (green), the method that selects a fixed set of 5 partitions (blue), and the method that selects exactly the partitions that contain samples drawn from a nonparanormal distribution with proxy covariance Σ⁽¹⁾ (red).

**Fig. 9**
(A) The graph G containing all 125 nodes. The red nodes correspond to the 24 selected partitions, while the pale nodes correspond to partitions not selected. (B) The adjacency matrix of a connected component of G, where each row and corresponding column represents a different node, similar to Figure 3. The dotted box denotes the 24 selected nodes that form a γ-quasi-clique.

**Fig. 10**
(A) The number of partitions and samples (n) selected within each window. Partitions from 6 different windows are chosen, and the estimated γ_w is the empirical fraction of selected partitions within each window. The more vibrant colors display a higher value of ${\hat{γ}}_{w}$ . (B) A QQ-plot of the 250 p-values generated when applying our diagnostic to the 24 selected partitions, similar to Figure 2. While these p-values are slightly left-skewed, the plot suggests that the selected partitions are more homogeneous when compared to their counterparts shown in Figure 2.

**Fig. 11**
Flowchart of how COBS (Stepdown method and clique-based selection method) is used downstream to find risk genes within the DAWN framework. Step 2 and 3 are taken directly from Liu et al. (2015).

See this image and copyright information in PMC

Cited by

Age, sex, and apolipoprotein E isoform alter contextual fear learning, neuronal activation, and baseline DNA damage in the hippocampus.
Boutros SW, Zimmerman B, Nagy SC, Unni VK, Raber J. Boutros SW, et al. Mol Psychiatry. 2023 Aug;28(8):3343-3354. doi: 10.1038/s41380-023-01966-8. Epub 2023 Feb 2. Mol Psychiatry. 2023. PMID: 36732588 Free PMC article.

References

1. Alamgir M and Von Luxburg U (2012). Shortest path distance in random k - nearest neighbor graphs. arXiv preprint arXiv:1206.6381.
1. Autism and Investigators, D. D. M. N. S. Y.. P. (2014). Prevalence of autism spectrum disorder among children aged 8 years - Autism and developmental disabilities monitoring network, 11 sites, United States, 2010. Morbidity and Mortality Weekly Report: Surveillance Summaries, 63(2):1–21. - PubMed
1. Buxbaum JD, Daly MJ, Devlin B, Lehner T, Roeder K, State MW, and The Autism Sequencing Consortium (2012). The Autism Sequencing Consortium: Large-scale, high-throughput sequencing in autism spectrum disorders. Neuron, 76(6):1052–1056. - PMC - PubMed
1. Cai T, Liu W, and Xia Y (2013). Two-sample covariance matrix testing and support recovery in high-dimensional and sparse settings. Journal of the American Statistical Association, 108(501):265–277.
1. Chang J, Zhou W, Zhou W-X, and Wang L (2017). Comparing large covariance matrices under weak conditions on the dependence structure and its application to gene clustering. Biometrics, 73(1):31–41. - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Other Literature Sources
- figshare - Data

[1] Alamgir M and Von Luxburg U (2012). Shortest path distance in random k - nearest neighbor graphs. arXiv preprint arXiv:1206.6381.

[2] Alamgir M and Von Luxburg U (2012). Shortest path distance in random k - nearest neighbor graphs. arXiv preprint arXiv:1206.6381.

[3] Autism and Investigators, D. D. M. N. S. Y.. P. (2014). Prevalence of autism spectrum disorder among children aged 8 years - Autism and developmental disabilities monitoring network, 11 sites, United States, 2010. Morbidity and Mortality Weekly Report: Surveillance Summaries, 63(2):1–21. - PubMed

[4] Autism and Investigators, D. D. M. N. S. Y.. P. (2014). Prevalence of autism spectrum disorder among children aged 8 years - Autism and developmental disabilities monitoring network, 11 sites, United States, 2010. Morbidity and Mortality Weekly Report: Surveillance Summaries, 63(2):1–21. - PubMed

[5] Buxbaum JD, Daly MJ, Devlin B, Lehner T, Roeder K, State MW, and The Autism Sequencing Consortium (2012). The Autism Sequencing Consortium: Large-scale, high-throughput sequencing in autism spectrum disorders. Neuron, 76(6):1052–1056. - PMC - PubMed

[6] Buxbaum JD, Daly MJ, Devlin B, Lehner T, Roeder K, State MW, and The Autism Sequencing Consortium (2012). The Autism Sequencing Consortium: Large-scale, high-throughput sequencing in autism spectrum disorders. Neuron, 76(6):1052–1056. - PMC - PubMed

[7] Cai T, Liu W, and Xia Y (2013). Two-sample covariance matrix testing and support recovery in high-dimensional and sparse settings. Journal of the American Statistical Association, 108(501):265–277.

[8] Cai T, Liu W, and Xia Y (2013). Two-sample covariance matrix testing and support recovery in high-dimensional and sparse settings. Journal of the American Statistical Association, 108(501):265–277.

[9] Chang J, Zhou W, Zhou W-X, and Wang L (2017). Comparing large covariance matrices under weak conditions on the dependence structure and its application to gene clustering. Biometrics, 73(1):31–41. - PubMed

[10] Chang J, Zhou W, Zhou W-X, and Wang L (2017). Comparing large covariance matrices under weak conditions on the dependence structure and its application to gene clustering. Biometrics, 73(1):31–41. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Covariance-based sample selection for heterogeneous data: Applications to gene expression and autism risk gene detection

Affiliations

Covariance-based sample selection for heterogeneous data: Applications to gene expression and autism risk gene detection

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources