Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiling performance #86

Closed
antoinewdg opened this issue Jan 20, 2017 · 26 comments
Closed

Profiling performance #86

antoinewdg opened this issue Jan 20, 2017 · 26 comments

Comments

@antoinewdg
Copy link
Contributor

Kind of related to #62 but not quite: I intend to do some serious profiling to see where the bottlenecks are, I open this issue to keep you updated. I think I'll do it on some examples and some of the benchmarks.

I'll be starting from this, this and this. If you have any suggestion on what code would be interesting to profile or how to do this (this is the first time I'm profiling Rust code) don't hesitate. @Luthaf I think you told me you already did some kind of profiling, anything I can reuse ?

@Luthaf
Copy link
Member

Luthaf commented Jan 20, 2017

I used Instruments.app on OS X, and Kcachegrind on Linux with data collected using cachegrind. In both cases I compiled the code in release mode, with debug symbols on. The profiling I made leaded to the changes in 3f2e9bc and 2a43eaf. Good luck with the profiling !

@g-bauer
Copy link
Contributor

g-bauer commented Jan 20, 2017

Systematic profiling would be great.

MD and MC have different hotspots so I'd recommend to separate the two cases.
Concerning systems, it may be informative to separate topologies and potential (computation) types. By that I mean profiling the difference between atomic and molecular systems and between (exclusively) short-range potentials (Lennard-Jones) versus. short-ranged plus Coulomb potentials.

@g-bauer
Copy link
Contributor

g-bauer commented Jan 25, 2017

This might also be interesting (cachegrind).

@antoinewdg
Copy link
Contributor Author

Adding this one to the list.

@g-bauer
Copy link
Contributor

g-bauer commented Feb 1, 2017

Adding this one to the list.

FWIW, the comments in the reddit post concerning this article where very much in favor of perf over valgrind.

Edit: Added link to said post on reddit.

@antoinewdg
Copy link
Contributor Author

I didn't catch that, thanks !

@antoinewdg
Copy link
Contributor Author

Adding this to the list of good reads

@antoinewdg
Copy link
Contributor Author

I started doing some profiling using perf and flamegraphs. First thing to notice is that even BIG functions tend to be inlined, for example Ewald::kspace_forces is inlined by default. This is a little bit painful to have to put #[inline(never)] manually everywhere.

I also have some weird stuff, for example in the bench energy_ewald for water I get the following flamegraph. You can see that erfc is takes a noticeable chunk of the execution time, but what's weird is that appears outside of Ewald::energy, which should not be the case (right ?).

@g-bauer
Copy link
Contributor

g-bauer commented Mar 14, 2017

You can see that erfc is takes a noticeable chunk of the execution time, but what's weird is that appears outside of Ewald::energy, which should not be the case (right ?).

No AFAIK, there should be no erfc outside of the energy computation. This is the CPU graph, right? Did you also check the memory? I found that Lumol currently has a massive memory footprint compared to other codes (MC at least).

How do I read the graph? When I click a bar with a function, I see above the contributions to the function, i.e. which functions are called?

@antoinewdg
Copy link
Contributor Author

This is the CPU graph, right? Did you also check the memory?

Yes this is the CU graph. I didn't check the memory, do you know any tool to do that ?

How do I read the graph? When I click a bar with a function, I see above the contributions to the function, i.e. which functions are called?

This is exactly it. This is explained in more detail here.

@Luthaf
Copy link
Member

Luthaf commented Mar 14, 2017

You can see that erfc is takes a noticeable chunk of the execution time, but what's weird is that appears outside of Ewald::energy, which should not be the case (right ?).

Yes, that make no sense. Maybe there is an issue with the flamegraph generation?

I found that Lumol currently has a massive memory footprint compared to other codes (MC at least).

Do you have numbers to share? jemalloc (the default rust allocator) is known to have a bigger memory footprint but is faster than the standard malloc. Also, if the footprint is not too big, I tend not to worry: most of the classical simulations are CPU bound (this is not the case for quantum calculations), so I happily trade any increase in memory for faster code.

We still need to be careful about memory usage to fit as much data as possible in the cache though.

Yes this is the CU graph. I didn't check the memory, do you know any tool to do that ?

I've heard of massif, which is an heap profiler coming with valgrind.

@antoinewdg
Copy link
Contributor Author

@Luthaf in #114 you mentioned that forces_ewald takes much longer that energy_ewald. However kspace_forces is clearly O(k^3 n^2) (k=k_max, n=n_atoms) while the energy is O(n^2) in real space and O(k^3 n) (in density_fft) so I'm not that surprised by the difference. Where can I find the litterature for the Ewald summation with forces (I got the energy from "Understanding Molecular Simulation" but I can't find anything about the forces here) ?

@antoinewdg
Copy link
Contributor Author

Yes, that make no sense. Maybe there is an issue with the flamegraph generation?

I'll check the direct result from perf to try to find the origin of the problem. About memory profiling I don't think I'm going to do this at the moment, I'd rather tune the CPU perf first.

@Luthaf
Copy link
Member

Luthaf commented Mar 14, 2017

I remember having a hard time finding expressions for the ewald forces. I'll check if I can find my sources again (remind me tomorrow if I did not answer here).

I've read in a lot of places that one can tune the alpha parameter for Ewald to get O(n^3/2) behaviour (k is not really relevant here, as it will never be bigger than a few tens -- and less than 10 in most of the cases). I'll try to find something about it too!

@g-bauer
Copy link
Contributor

g-bauer commented Mar 14, 2017

I'd rather tune the CPU perf first.

Sure, was just wondering if you also looked at memory.

...but I can't find anything about the forces here.

I currently only have a german source. Fourier space (k-space) is equation 4.5, real space part is 4.4. Eq. 4.6 is the dipole correction. But there are plenty of ways to write the equations so you will find different formulations depending on where you look.

Edit: Here you go! (complete derivation)

@antoinewdg
Copy link
Contributor Author

I've read in a lot of places that one can tune the alpha parameter for Ewald to get O(n^3/2) behaviour

"Understanding molecular simulation" mentions it in section 12.1.5, but how I understand it this is under the hypothesis of a uniform distribution of the particles (is it OK ?) and when taking advantage of the cutoff in the real space computation, which we don't really do.

k is not really relevant here, as it will never be bigger than a few tens

Well 5^3 is already more than 100, which is approximately the ratio of performance between forces and energy in your benchmark.

I currently only have a german source.

I'll look at it thanks ! If you find an English version in between I'll take it, my german isn't that good ^^

@g-bauer
Copy link
Contributor

g-bauer commented Mar 14, 2017

I edited my comment with a source to the complete derivation in english. Can you open the link? Not sure if one needs special access (I'm in university network at the moment).

@antoinewdg
Copy link
Contributor Author

Nope I get a "403 Forbidden".

@g-bauer
Copy link
Contributor

g-bauer commented Mar 14, 2017

You have mail 😄

@Luthaf
Copy link
Member

Luthaf commented Mar 14, 2017

You have mail 😄

You beat me on this!

@antoinewdg
Copy link
Contributor Author

From the looks of it, we'll never get to the O(n^(3/2)) dream for the forces. I even think that if we use the optimized alpha that gives the best complexity for the energy computation, we may end up with a O(n^(5/2)) for the forces, which is not cool.

Is the Ewald summation usually used for MD, or is it more of a MC thing ?

@Luthaf
Copy link
Member

Luthaf commented Mar 14, 2017

It is the first and crudest technique for coulombic interactions, and it is still used in MD (LAMMPS propose both Ewald and PPPM summations for coulombic interactions).

We also have other techniques for computing coulombic interactions: Particle-Mesh Ewald (PME), Smooth Particle-Mesh Ewald (SPME) and Particle-Particle Particle-Mesh summation (PPPM) are the more common techniques. There is a current trend for using the Wolf method, but it is only good for homogeneous systems.

I think we can still spend a few time optimizing Ewald (even if it is just the constant before the n^5/2), because our implementation is correct (at least for the NIST tests) but very slow compared to other MD codes.

@antoinewdg
Copy link
Contributor Author

our implementation is correct (at least for the NIST tests) but very slow compared to other MD codes.

Do the other codes use Verlet lists and such ? Do we plan to use some ?

@Luthaf
Copy link
Member

Luthaf commented Mar 14, 2017

Do the other codes use Verlet lists and such ? Do we plan to use some ?

Yes and yes. But the speed difference is already perceptible for small systems (the same size as the cutoff radius), where from my understanding the Verlet lists do not give such a big improvement (evert particle in the system is your neighbour!)

@g-bauer
Copy link
Contributor

g-bauer commented Mar 14, 2017

From my experience, cells/verlet lists are good when systems are beyond a certain size. As Luthaf said, if the box is too small, we will not have any advantage or even disadvantages due to overhead.

I wrote this in the closed issue #109: I think there is a problem with the forces at the moment. At least the resulting pressure is weird (running NPT MC at 1 bar gives 8e4 bar as internal pressure). As soon as I know how to track whats going wrong there, I'll open an issue.

@Luthaf
Copy link
Member

Luthaf commented Dec 10, 2019

Closing this issue since there is not much to be done. We have #12 for verlet/cell lists

@Luthaf Luthaf closed this as completed Dec 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants