Profiling performance #86

antoinewdg · 2017-01-20T13:47:18Z

Kind of related to #62 but not quite: I intend to do some serious profiling to see where the bottlenecks are, I open this issue to keep you updated. I think I'll do it on some examples and some of the benchmarks.

I'll be starting from this, this and this. If you have any suggestion on what code would be interesting to profile or how to do this (this is the first time I'm profiling Rust code) don't hesitate. @Luthaf I think you told me you already did some kind of profiling, anything I can reuse ?

Luthaf · 2017-01-20T15:01:38Z

I used Instruments.app on OS X, and Kcachegrind on Linux with data collected using cachegrind. In both cases I compiled the code in release mode, with debug symbols on. The profiling I made leaded to the changes in 3f2e9bc and 2a43eaf. Good luck with the profiling !

g-bauer · 2017-01-20T15:26:33Z

Systematic profiling would be great.

MD and MC have different hotspots so I'd recommend to separate the two cases.
Concerning systems, it may be informative to separate topologies and potential (computation) types. By that I mean profiling the difference between atomic and molecular systems and between (exclusively) short-range potentials (Lennard-Jones) versus. short-ranged plus Coulomb potentials.

g-bauer · 2017-01-25T20:08:05Z

This might also be interesting (cachegrind).

antoinewdg · 2017-02-01T14:03:53Z

Adding this one to the list.

g-bauer · 2017-02-01T15:30:35Z

Adding this one to the list.

FWIW, the comments in the reddit post concerning this article where very much in favor of perf over valgrind.

Edit: Added link to said post on reddit.

antoinewdg · 2017-02-01T15:45:46Z

I didn't catch that, thanks !

antoinewdg · 2017-02-09T10:15:09Z

Adding this to the list of good reads

antoinewdg · 2017-03-14T14:30:50Z

I started doing some profiling using perf and flamegraphs. First thing to notice is that even BIG functions tend to be inlined, for example Ewald::kspace_forces is inlined by default. This is a little bit painful to have to put #[inline(never)] manually everywhere.

I also have some weird stuff, for example in the bench energy_ewald for water I get the following flamegraph. You can see that erfc is takes a noticeable chunk of the execution time, but what's weird is that appears outside of Ewald::energy, which should not be the case (right ?).

g-bauer · 2017-03-14T14:40:53Z

You can see that erfc is takes a noticeable chunk of the execution time, but what's weird is that appears outside of Ewald::energy, which should not be the case (right ?).

No AFAIK, there should be no erfc outside of the energy computation. This is the CPU graph, right? Did you also check the memory? I found that Lumol currently has a massive memory footprint compared to other codes (MC at least).

How do I read the graph? When I click a bar with a function, I see above the contributions to the function, i.e. which functions are called?

antoinewdg · 2017-03-14T15:04:57Z

This is the CPU graph, right? Did you also check the memory?

Yes this is the CU graph. I didn't check the memory, do you know any tool to do that ?

How do I read the graph? When I click a bar with a function, I see above the contributions to the function, i.e. which functions are called?

This is exactly it. This is explained in more detail here.

Luthaf · 2017-03-14T15:17:58Z

You can see that erfc is takes a noticeable chunk of the execution time, but what's weird is that appears outside of Ewald::energy, which should not be the case (right ?).

Yes, that make no sense. Maybe there is an issue with the flamegraph generation?

I found that Lumol currently has a massive memory footprint compared to other codes (MC at least).

Do you have numbers to share? jemalloc (the default rust allocator) is known to have a bigger memory footprint but is faster than the standard malloc. Also, if the footprint is not too big, I tend not to worry: most of the classical simulations are CPU bound (this is not the case for quantum calculations), so I happily trade any increase in memory for faster code.

We still need to be careful about memory usage to fit as much data as possible in the cache though.

Yes this is the CU graph. I didn't check the memory, do you know any tool to do that ?

I've heard of massif, which is an heap profiler coming with valgrind.

antoinewdg · 2017-03-14T15:21:44Z

@Luthaf in #114 you mentioned that forces_ewald takes much longer that energy_ewald. However kspace_forces is clearly O(k^3 n^2) (k=k_max, n=n_atoms) while the energy is O(n^2) in real space and O(k^3 n) (in density_fft) so I'm not that surprised by the difference. Where can I find the litterature for the Ewald summation with forces (I got the energy from "Understanding Molecular Simulation" but I can't find anything about the forces here) ?

antoinewdg · 2017-03-14T15:24:56Z

Yes, that make no sense. Maybe there is an issue with the flamegraph generation?

I'll check the direct result from perf to try to find the origin of the problem. About memory profiling I don't think I'm going to do this at the moment, I'd rather tune the CPU perf first.

Luthaf · 2017-03-14T15:33:22Z

I remember having a hard time finding expressions for the ewald forces. I'll check if I can find my sources again (remind me tomorrow if I did not answer here).

I've read in a lot of places that one can tune the alpha parameter for Ewald to get O(n^3/2) behaviour (k is not really relevant here, as it will never be bigger than a few tens -- and less than 10 in most of the cases). I'll try to find something about it too!

g-bauer · 2017-03-14T15:38:36Z

I'd rather tune the CPU perf first.

Sure, was just wondering if you also looked at memory.

...but I can't find anything about the forces here.

I currently only have a german source. Fourier space (k-space) is equation 4.5, real space part is 4.4. Eq. 4.6 is the dipole correction. But there are plenty of ways to write the equations so you will find different formulations depending on where you look.

Edit: Here you go! (complete derivation)

antoinewdg · 2017-03-14T15:48:03Z

I've read in a lot of places that one can tune the alpha parameter for Ewald to get O(n^3/2) behaviour

"Understanding molecular simulation" mentions it in section 12.1.5, but how I understand it this is under the hypothesis of a uniform distribution of the particles (is it OK ?) and when taking advantage of the cutoff in the real space computation, which we don't really do.

k is not really relevant here, as it will never be bigger than a few tens

Well 5^3 is already more than 100, which is approximately the ratio of performance between forces and energy in your benchmark.

I currently only have a german source.

I'll look at it thanks ! If you find an English version in between I'll take it, my german isn't that good ^^

g-bauer · 2017-03-14T15:53:19Z

I edited my comment with a source to the complete derivation in english. Can you open the link? Not sure if one needs special access (I'm in university network at the moment).

antoinewdg · 2017-03-14T15:54:03Z

Nope I get a "403 Forbidden".

g-bauer · 2017-03-14T15:56:00Z

You have mail 😄

Luthaf · 2017-03-14T15:58:09Z

You have mail 😄

You beat me on this!

antoinewdg · 2017-03-14T17:31:07Z

From the looks of it, we'll never get to the O(n^(3/2)) dream for the forces. I even think that if we use the optimized alpha that gives the best complexity for the energy computation, we may end up with a O(n^(5/2)) for the forces, which is not cool.

Is the Ewald summation usually used for MD, or is it more of a MC thing ?

Luthaf · 2017-03-14T18:06:31Z

It is the first and crudest technique for coulombic interactions, and it is still used in MD (LAMMPS propose both Ewald and PPPM summations for coulombic interactions).

We also have other techniques for computing coulombic interactions: Particle-Mesh Ewald (PME), Smooth Particle-Mesh Ewald (SPME) and Particle-Particle Particle-Mesh summation (PPPM) are the more common techniques. There is a current trend for using the Wolf method, but it is only good for homogeneous systems.

I think we can still spend a few time optimizing Ewald (even if it is just the constant before the n^5/2), because our implementation is correct (at least for the NIST tests) but very slow compared to other MD codes.

antoinewdg · 2017-03-14T18:36:24Z

our implementation is correct (at least for the NIST tests) but very slow compared to other MD codes.

Do the other codes use Verlet lists and such ? Do we plan to use some ?

Luthaf · 2017-03-14T20:21:10Z

Do the other codes use Verlet lists and such ? Do we plan to use some ?

Yes and yes. But the speed difference is already perceptible for small systems (the same size as the cutoff radius), where from my understanding the Verlet lists do not give such a big improvement (evert particle in the system is your neighbour!)

g-bauer · 2017-03-14T22:22:13Z

From my experience, cells/verlet lists are good when systems are beyond a certain size. As Luthaf said, if the box is too small, we will not have any advantage or even disadvantages due to overhead.

I wrote this in the closed issue #109: I think there is a problem with the forces at the moment. At least the resulting pressure is weird (running NPT MC at 1 bar gives 8e4 bar as internal pressure). As soon as I know how to track whats going wrong there, I'll open an issue.

Luthaf · 2019-12-10T20:45:06Z

Closing this issue since there is not much to be done. We have #12 for verlet/cell lists

Luthaf added A-Performance Discussion labels Feb 3, 2017

Luthaf closed this as completed Dec 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiling performance #86

Profiling performance #86

antoinewdg commented Jan 20, 2017

Luthaf commented Jan 20, 2017

g-bauer commented Jan 20, 2017 •

edited

Loading

g-bauer commented Jan 25, 2017

antoinewdg commented Feb 1, 2017

g-bauer commented Feb 1, 2017 •

edited

Loading

antoinewdg commented Feb 1, 2017

antoinewdg commented Feb 9, 2017

antoinewdg commented Mar 14, 2017

g-bauer commented Mar 14, 2017 •

edited

Loading

antoinewdg commented Mar 14, 2017

Luthaf commented Mar 14, 2017

antoinewdg commented Mar 14, 2017

antoinewdg commented Mar 14, 2017

Luthaf commented Mar 14, 2017

g-bauer commented Mar 14, 2017 •

edited

Loading

antoinewdg commented Mar 14, 2017

g-bauer commented Mar 14, 2017 •

edited

Loading

antoinewdg commented Mar 14, 2017

g-bauer commented Mar 14, 2017

Luthaf commented Mar 14, 2017

antoinewdg commented Mar 14, 2017

Luthaf commented Mar 14, 2017

antoinewdg commented Mar 14, 2017

Luthaf commented Mar 14, 2017

g-bauer commented Mar 14, 2017

Luthaf commented Dec 10, 2019

Profiling performance #86

Profiling performance #86

Comments

antoinewdg commented Jan 20, 2017

Luthaf commented Jan 20, 2017

g-bauer commented Jan 20, 2017 • edited Loading

g-bauer commented Jan 25, 2017

antoinewdg commented Feb 1, 2017

g-bauer commented Feb 1, 2017 • edited Loading

antoinewdg commented Feb 1, 2017

antoinewdg commented Feb 9, 2017

antoinewdg commented Mar 14, 2017

g-bauer commented Mar 14, 2017 • edited Loading

antoinewdg commented Mar 14, 2017

Luthaf commented Mar 14, 2017

antoinewdg commented Mar 14, 2017

antoinewdg commented Mar 14, 2017

Luthaf commented Mar 14, 2017

g-bauer commented Mar 14, 2017 • edited Loading

antoinewdg commented Mar 14, 2017

g-bauer commented Mar 14, 2017 • edited Loading

antoinewdg commented Mar 14, 2017

g-bauer commented Mar 14, 2017

Luthaf commented Mar 14, 2017

antoinewdg commented Mar 14, 2017

Luthaf commented Mar 14, 2017

antoinewdg commented Mar 14, 2017

Luthaf commented Mar 14, 2017

g-bauer commented Mar 14, 2017

Luthaf commented Dec 10, 2019

g-bauer commented Jan 20, 2017 •

edited

Loading

g-bauer commented Feb 1, 2017 •

edited

Loading

g-bauer commented Mar 14, 2017 •

edited

Loading

g-bauer commented Mar 14, 2017 •

edited

Loading

g-bauer commented Mar 14, 2017 •

edited

Loading