Accelerated cryo-EM structure determination with parallelisation using GPUs in RELION-2

doi:10.7554/eLife.18722

. 2016 Nov 15:5:e18722.

doi: 10.7554/eLife.18722.

Accelerated cryo-EM structure determination with parallelisation using GPUs in RELION-2

Dari Kimanius¹, Björn O Forsberg¹, Sjors Hw Scheres², Erik Lindahl^{1

3}

Affiliations

¹ Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Stockholm, Sweden.
² MRC Laboratory of Molecular Biology, Cambridge, United Kingdom.
³ Swedish e-Science Research Center, KTH Royal Institute of Technology, Stockholm, Sweden.

PMID: 27845625
PMCID: PMC5310839
DOI: 10.7554/eLife.18722

Accelerated cryo-EM structure determination with parallelisation using GPUs in RELION-2

Dari Kimanius et al. Elife. 2016.

. 2016 Nov 15:5:e18722.

doi: 10.7554/eLife.18722.

Authors

Dari Kimanius¹, Björn O Forsberg¹, Sjors Hw Scheres², Erik Lindahl^{1

3}

Affiliations

¹ Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Stockholm, Sweden.
² MRC Laboratory of Molecular Biology, Cambridge, United Kingdom.
³ Swedish e-Science Research Center, KTH Royal Institute of Technology, Stockholm, Sweden.

PMID: 27845625
PMCID: PMC5310839
DOI: 10.7554/eLife.18722

Abstract

By reaching near-atomic resolution for a wide range of specimens, single-particle cryo-EM structure determination is transforming structural biology. However, the necessary calculations come at large computational costs, which has introduced a bottleneck that is currently limiting throughput and the development of new methods. Here, we present an implementation of the RELION image processing software that uses graphics processors (GPUs) to address the most computationally intensive steps of its cryo-EM structure determination workflow. Both image classification and high-resolution refinement have been accelerated more than an order-of-magnitude, and template-based particle selection has been accelerated well over two orders-of-magnitude on desktop hardware. Memory requirements on GPUs have been reduced to fit widely available hardware, and we show that the use of single precision arithmetic does not adversely affect results. This enables high-resolution cryo-EM structure determination in a matter of days on a single workstation.

Keywords: GPU; biophysics; classification; cryo-EM; image reconstruction; micrograph; none; refinement; structural biology.

PubMed Disclaimer

Conflict of interest statement

SHWS: Reviewing editor, eLife.

The other authors declare that no competing interests exist.

Figures

**Figure 1.. High level flowchart of RELION.**
(A) Operations and the real vs. Fourier spaces used during (B) image reconstruction in RELION. Micrograph input and model setup use the CPU, while most subsequent processing steps have been adapted for accelerator hardware. The highlighted orientation-dependent difference calculation is by far the most demanding task, and fully accelerated. Taking 2D slices out of (and setting them back into) the reference transforms has also been accelerated at high gain. The inverse FFT operation has not yet been accelerated, but uses the CPU. **DOI:** http://dx.doi.org/10.7554/eLife.18722.002

**Figure 2.. Extensive task-level parallelism for accelerators.**
While previously relion only exploited parallelism over images (left), in the new implementation classes and all orientations of each class are expressed as tasks that can be scheduled independently on the accelerator hardware (e.g. GPUs). Even individual pixels for each orientation can be calculated in parallel, which makes the algorithm highly suited for GPUs. **DOI:** http://dx.doi.org/10.7554/eLife.18722.003

**Figure 3.. Semi-automated particle picking in RELION-2.**
The low-pass filter applied to micrographs is a novel feature in RELION, aimed at reducing the size and execution time of the highlighted inverse FFTs, which accounts for most of the computational work. In addition to the inverse FFTs, all template- and rotation-dependent parallel steps have also been accelerated on GPUs. **DOI:** http://dx.doi.org/10.7554/eLife.18722.004

**Figure 4.. RELION-2 enables desktop classification and refinement using GPUs.**
EMPIAR (Iudin et al., 2016) entry 10028 was used to assess performance, using refinements of 105 k ribosomal particles in 360²-pixel images. (A) A quad-GPU workstation easily outperforms even a large cluster job in 3D classification. (B) In 2D classification, the GPU desktop performs slightly better in the first few iteration and then provides performance equivalent to the 280 CPU cores. (C) Total time for 25 iterations of 3D classification for a few different hardware configurations. (D) Additional classes are processed at reduced cost compared to CPU-only execution, due to faster execution and increased capacity for latency hiding. (E) With increasing number of classes, the time spent in non-accelerated vs accelerated execution increases. (F) The workstation also beats the cluster for single-class refinement to high resolution, despite the generally lower degree of parallelism. This is particularly striking for the finer exhaustive sampling at 3.8 $^{\circ}$ due to the GPU’s ability to parallelise the drastically increased number of tasks. **DOI:** http://dx.doi.org/10.7554/eLife.18722.005

**Figure 5.. The GPU reconstruction is qualitatively identical to the CPU version.**
(A) A high-resolution refinement of the Plasmodium falciparum 80S ribosome using single precision GPU arithmetic achieves a gold-standard Fourier shell correlation (FSC) indistinguishable from double precision CPU-only refinement (previously deposited as EMD-2660). The FSC of full reconstructions comparing the two methods shows their agreement far exceeds the recoverable signal (grey), and as shown in Figure 5—figure supplement 1 the variation in angle assignments match the differences between CPU runs with different random seeds. (B) Partial snapshots of the final reconstruction following post-processing, superimposed on PDB ID 3J79 (Wong et al., 2014). **DOI:** http://dx.doi.org/10.7554/eLife.18722.006

**Figure 5—figure supplement 1.. The CPU and GPU implementations provide qualitatively identical distributions of image orientations.**
For two CPU runs with different random seeds, 81% of images fall within 1 $^{\circ}$ , and for a GPU vs. CPU run 82%. Note that the probability of observing small angles vanishes since the number of potentially available points is proportional to the sine of the angle, which approaches zero for identical orientations. Both distributions were aligned against the reference refinement by fitting reconstructed models. **DOI:** http://dx.doi.org/10.7554/eLife.18722.007

**Figure 6.. GPU memory requirements.**
(A) The required GPU memory scales linearly with the number of classes. (B) The maximum required GPU memory occurs for single-class refinement to the Nyquist frequency, which increases rapidly with the image size. Horizontal grey lines indicate avaliable GPU memory on different cards. **DOI:** http://dx.doi.org/10.7554/eLife.18722.008

**Figure 7.. Low-pass filtering and acceleration of particle picking.**
(A) Ribosomal particles were auto-picked from representative $4096^{2}$ -pixel micrographs collected at 1.62 Å/pixel using four template classes, showing near-identical picking with and without low-pass filtering to 20 Å. The only differing particle is indicated in orange, and likely does not depict a ribosomal particle. (**B–C**) Despite near-identical particle selection, performance is dramatically improved. (D) Filtering alone provides almost 20-fold performance improvement on any hardware compared to previsos versions of relion, and when combined with GPU-accelerated particle picking the resulting performance gain is more than two orders of magnitude using only a single GPU (GTX 1080). **DOI:** http://dx.doi.org/10.7554/eLife.18722.009

**Figure 8.. High-resolution structure determination on a single desktop.**
(A) The resulting 2.2 Å map (deposited as EMD-4116) shows excellent high-resolution density throughout the complex. (B) The most time-consuming steps in the image processing workflow. GPU-accelerated steps are indicated in orange. The total time of image processing was less than that of downloading the data. (C) The resolution estimate is based on the gold-standard FSC after correcting for the convolution effects of a soft solvent mask (black). The FSC between the relion map and the atomic model in PDB ID 5A1A is shown in orange. The FSC between EMD-2984 and the same atomic model is shown for comparison (dashed gray). **DOI:** http://dx.doi.org/10.7554/eLife.18722.010

**Appendix 1—figure 1.. Computational flow in difference calculation kernel.**
The kernel is initiated with $c e i l (𝐏 / P_{0})$ thread-blocks and $N$ threads, where P is the total number of projections. The work flow of a thread-block in each iteration $i$ is divided into two stages. In stage A the $N$ pixels of $P_{0}$ reference slices are fetched through texture memory, interpolated, and stored in shared memory. This data is then exhaustively reused in stage B, where groups of threads compute the differences to the corresponding translated image components. Individual threads within a group work with different image components, $n$ , of each reference slice, $p$ . Collectively all threads iterate through the $N$ components of each reference slice, for a total of $N \times P_{0}$ components for each iteration $i$ . The final result is reduced back into shared memory through atomic reduction operations. All image components are covered as $i$ goes from 1 to $c e i l (C / N)$ , where C is the total number of Fourier components. A reduced sum of differences for each pair of orientation and translation is written to global memory prior to the kernel exiting. **DOI:** http://dx.doi.org/10.7554/eLife.18722.012

**Appendix 1—figure 2.. A dedicated kernel function performs the targeted fine-grained examination of the most significantly matching regions during image alignment against a reference model.**
The oversampling of each of five fitting dimensions during fine-grained search renders storage of all possible weights intractable, so input and output data are stored with explicit mapping arrays. These are read by the kernel function thread-block, rather than inferred based on block ID. This creates overhead and possible latency of global memory access, which makes this kernel even further separated from the exhaustive kernel represented in Figure 9. Here, a pixel-chunk of a single projection is reused for a number of sequential translations, arranged contiguously if possible. Invoking separate thread blocks for non-contiguous translations allows some implicit indexing of them, which affords better access patterns for SIMD instructions and reduced latency. Due to the sparseness, shared memory can also be used for in-kernel summation of all pixels of each image, which despite some some required explicit thread-level synchronisation increases throughput by avoiding the higher latency of atomic write operations during image summation in the coarse-search kernel. **DOI:** http://dx.doi.org/10.7554/eLife.18722.013

**Appendix 2—figure 1.. Computational flow of fine-grained search kernel.**
(A) Weighted back-projection of a 2D image into three different planes. We explored two memory access approaches (B) for this task, namely gather and scatter. In the gather approach a process (marked with orange) is assigned to individual or groups of 3D voxels. The process read from the input image and updates the data of the assigned voxel(s). **DOI:** http://dx.doi.org/10.7554/eLife.18722.014

See this image and copyright information in PMC

Cited by

Molecular basis of inhibition of the amino acid transporter B⁰AT1 (SLC6A19).
Xu J, Hu Z, Dai L, Yadav A, Jiang Y, Bröer A, Gardiner MG, McLeod M, Yan R, Bröer S. Xu J, et al. Nat Commun. 2024 Aug 22;15(1):7224. doi: 10.1038/s41467-024-51748-1. Nat Commun. 2024. PMID: 39174516 Free PMC article.
Cryo-EM structure of ACE2-SIT1 in complex with tiagabine.
Bröer A, Hu Z, Kukułowicz J, Yadav A, Zhang T, Dai L, Bajda M, Yan R, Bröer S. Bröer A, et al. J Biol Chem. 2024 Sep;300(9):107687. doi: 10.1016/j.jbc.2024.107687. Epub 2024 Aug 17. J Biol Chem. 2024. PMID: 39159813 Free PMC article.
Structures of Mature and Urea-Treated Empty Bacteriophage T5: Insights into Siphophage Infection and DNA Ejection.
Peng Y, Tang H, Xiao H, Chen W, Song J, Zheng J, Liu H. Peng Y, et al. Int J Mol Sci. 2024 Aug 3;25(15):8479. doi: 10.3390/ijms25158479. Int J Mol Sci. 2024. PMID: 39126049 Free PMC article.
Comparison of structure and immunogenicity of CVB1-VLP and inactivated CVB1 vaccine candidates.
Soppela S, Plavec Z, Gröhn S, Jartti M, Oikarinen S, Laajala M, Marjomaki V, Butcher SJ, Hankaniemi MM. Soppela S, et al. Res Sq [Preprint]. 2024 Jun 28:rs.3.rs-4545395. doi: 10.21203/rs.3.rs-4545395/v1. Res Sq. 2024. PMID: 38978565 Free PMC article. Preprint.
Scaffold-enabled high-resolution cryo-EM structure determination of RNA.
Haack DB, Rudolfs B, Jin S, Weeks KM, Toor N. Haack DB, et al. bioRxiv [Preprint]. 2024 Jun 10:2024.06.10.598011. doi: 10.1101/2024.06.10.598011. bioRxiv. 2024. PMID: 38915706 Free PMC article. Preprint.

See all "Cited by" articles

References

1. Abraham MJ, Murtola T, Schulz R, Páll S, Smith JC, Hess B, Lindahl E. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX. 2015;1-2:19–25. doi: 10.1016/j.softx.2015.06.001. - DOI
1. Bai XC, Rajendra E, Yang G, Shi Y, Scheres SH. Sampling the conformational space of the catalytic subunit of human γ-secretase. eLife. 2015;4:e11182. doi: 10.7554/eLife.11182. - DOI - PMC - PubMed
1. Bartesaghi A, Merk A, Banerjee S, Matthies D, Wu X, Milne JL, Subramaniam S. 2.2 Å resolution cryo-EM structure of β-galactosidase in complex with a cell-permeant inhibitor. Science. 2015;348:1147–1151. doi: 10.1126/science.aab1576. - DOI - PMC - PubMed
1. Castaño-Díez D, Moser D, Schoenegger A, Pruggnaller S, Frangakis AS. Performance evaluation of image processing algorithms on the Gpu. Journal of Structural Biology. 2008;164:153–160. doi: 10.1016/j.jsb.2008.07.006. - DOI - PubMed
1. Chen S, McMullan G, Faruqi AR, Murshudov GN, Short JM, Scheres SH, Henderson R. High-resolution noise substitution to measure overfitting and validate resolution in 3D structure determination by single particle electron cryomicroscopy. Ultramicroscopy. 2013;135:24–35. doi: 10.1016/j.ultramic.2013.06.004. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions

Grants and funding

MC_UP_A025_1013/MRC_/Medical Research Council/United Kingdom

LinkOut - more resources

Full Text Sources
Other Literature Sources

[1] Abraham MJ, Murtola T, Schulz R, Páll S, Smith JC, Hess B, Lindahl E. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX. 2015;1-2:19–25. doi: 10.1016/j.softx.2015.06.001. - DOI

[2] Abraham MJ, Murtola T, Schulz R, Páll S, Smith JC, Hess B, Lindahl E. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX. 2015;1-2:19–25. doi: 10.1016/j.softx.2015.06.001. - DOI

[3] Bai XC, Rajendra E, Yang G, Shi Y, Scheres SH. Sampling the conformational space of the catalytic subunit of human γ-secretase. eLife. 2015;4:e11182. doi: 10.7554/eLife.11182. - DOI - PMC - PubMed

[4] Bai XC, Rajendra E, Yang G, Shi Y, Scheres SH. Sampling the conformational space of the catalytic subunit of human γ-secretase. eLife. 2015;4:e11182. doi: 10.7554/eLife.11182. - DOI - PMC - PubMed

[5] Bartesaghi A, Merk A, Banerjee S, Matthies D, Wu X, Milne JL, Subramaniam S. 2.2 Å resolution cryo-EM structure of β-galactosidase in complex with a cell-permeant inhibitor. Science. 2015;348:1147–1151. doi: 10.1126/science.aab1576. - DOI - PMC - PubMed

[6] Bartesaghi A, Merk A, Banerjee S, Matthies D, Wu X, Milne JL, Subramaniam S. 2.2 Å resolution cryo-EM structure of β-galactosidase in complex with a cell-permeant inhibitor. Science. 2015;348:1147–1151. doi: 10.1126/science.aab1576. - DOI - PMC - PubMed

[7] Castaño-Díez D, Moser D, Schoenegger A, Pruggnaller S, Frangakis AS. Performance evaluation of image processing algorithms on the Gpu. Journal of Structural Biology. 2008;164:153–160. doi: 10.1016/j.jsb.2008.07.006. - DOI - PubMed

[8] Castaño-Díez D, Moser D, Schoenegger A, Pruggnaller S, Frangakis AS. Performance evaluation of image processing algorithms on the Gpu. Journal of Structural Biology. 2008;164:153–160. doi: 10.1016/j.jsb.2008.07.006. - DOI - PubMed

[9] Chen S, McMullan G, Faruqi AR, Murshudov GN, Short JM, Scheres SH, Henderson R. High-resolution noise substitution to measure overfitting and validate resolution in 3D structure determination by single particle electron cryomicroscopy. Ultramicroscopy. 2013;135:24–35. doi: 10.1016/j.ultramic.2013.06.004. - DOI - PMC - PubMed

[10] Chen S, McMullan G, Faruqi AR, Murshudov GN, Short JM, Scheres SH, Henderson R. High-resolution noise substitution to measure overfitting and validate resolution in 3D structure determination by single particle electron cryomicroscopy. Ultramicroscopy. 2013;135:24–35. doi: 10.1016/j.ultramic.2013.06.004. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Accelerated cryo-EM structure determination with parallelisation using GPUs in RELION-2

Affiliations

Accelerated cryo-EM structure determination with parallelisation using GPUs in RELION-2

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources