Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better benchmark stability #69

Merged
merged 1 commit into from
Jun 13, 2023

Conversation

bclement-ocp
Copy link
Contributor

@bclement-ocp bclement-ocp commented Jun 7, 2023

This is configured with on the command-line with the new --cpus option, which replaces -j and allows to specify a list of cpus or cpu ranges in taskset format, such as 0-3,7-12,15 (strides are not supported).

When the --cpus option is provided, provers will be run in parallel on the provided cpus, with at most one prover running on a given cpu at once.

Note that the --cpus setting only concerns the prover runs (and some glue code around one specific prover run): it does not otherwise restrict the cpus used by benchpress itself. It is recommended to use an external method such as the taskset utility to assign to the toplevel benchpress process a CPU affinity that does not overlap with the prover CPUs provided in --cpus.

When using the --cpus setting, users should be aware that there may still be other processes on the system using these cpus (obviously)! It is thus recommended to use one of the existing techniques to isolate these CPUs from the rest of the system. I know of two ways to do this: the isolcpus kernel parameter, which is deprecated but is slightly easier to use, and cpusets, which are not deprecated but harder to use.

To use isolcpus, simply set the cpus to use for benchmarking as the isolated CPUs on the kernel cmdline and reboot (OK, maybe not that simple for one-shot benchmarking, but fairly easy for a machine that is mostly used for benchmarking). There is no need to set the CPU affinity for the benchpress binary: it will never be scheduled on the isolated CPUs, and neither will any other processes (unless manually required to).

To use cpusets (which is the solution I employed on our benchmark machine at OCamlPro), you should create a system cpuset that contains only the CPUs that will NOT be used by benchpress, and move all processes to that cpuset (this can be done on a running system, consult the cpuset documentation). Then, create another cpuset that contains the CPUs to use for benchpress, including the CPUs to use for the benchpress binary (in practice I use the root cpuset that contains all the CPUs), and run benchpress in that cpuset. You must not forget to use taskset to prevent the benchpress binary from using the CPUs destined for the provers. In practice, I move the shell that I use to run benchpress to that second cpuset.

@bclement-ocp bclement-ocp force-pushed the bclement/cpuaff branch 2 times, most recently from 19a6a1e to 9cafa5b Compare June 7, 2023 15:19
@c-cube
Copy link
Member

c-cube commented Jun 7, 2023

This looks very impressive, I'm not going to pretend I read it all in details but I've been aware of the issue for a while. @Gbury actually had something based on cgroups at one point.

Two points:

  • the ramdisk seems a bit more flimsy to me (see the comment about windows, but also if provers are very verbose, their output is going to start competing with the available RAM for the provers themselves). There must be an option for that I think.
  • You should install ocamlformat (0.24.1) and probably configure autoformat, I saw that the code is not formatted :)

@bclement-ocp
Copy link
Contributor Author

Ah yes I disabled autoformat in my editor because of unmentionable reasons but it is formatted now, apologies.

Regarding Windows, I am not sure benchpress is used on Windows atm (and also don't have access to a system to test), but it should just be a little bit of glue to make this work on Windows (well, and an additional argument, because I believe that there is no equivalent to /dev/shm on Windows and a ram disk must be manually mounted). Of course, by "work", I mean mostly do nothing, because there doesn't seem to be an OCaml library to set process affinity on Windows.

I am not sure I understand your point about the output of the provers competing for the available RAM with the provers themselves because I believe that is already an issue with the current implementation where the output is copied to benchpress using CCIO.read_all.

I agree that having the input and proof files in RAM should be an option however; would you rather I add it to this PR or separate the cpu affinity stuff from the ram stuff?

@bclement-ocp
Copy link
Contributor Author

By the way, cpu affinity is a nice bonus, but from my initial testing it seems like it is copying files to RAM prior to running the prover that eliminated the spurious 100x slowdowns.

@c-cube
Copy link
Member

c-cube commented Jun 7, 2023

the output is copied to benchpress using CCIO.read_all.

Ah! Touché 😁. That's partly due to limitations in the Unix API and threads, I'd say. We could also spawn processes with their output directly targeted at a file on disk, although it's an annoying dance to do from Unix.

it seems like it is copying files to RAM prior to running the prover that eliminated the spurious 100x slowdowns.

Ah, so it was the OS filesystem cache all along, I suppose. If you have the bandwidth, I think 2 PRs would be nice indeed.

This is configured with on the command-line with the new `--cpus`
option, which replaces `-j` and allows to specify a list of cpus or cpu
ranges in taskset format, such as `0-3,7-12,15` (strides are not
supported).

When the `--cpus` option is provided, provers will be run in parallel on
the provided cpus, with at most one prover running on a given cpu at
once.

Note that the `--cpus` setting *only* concerns the prover runs (and some
glue code around one specific prover run): it does *not* otherwise
restrict the cpus used by benchpress itself. It is recommended to use an
external method such as the `taskset` utility to assign to the toplevel
benchpress process a CPU affinity that does not overlap with the prover
CPUs provided in `--cpus`.

When using the `--cpus` setting, users should be aware that *there may
still be other processes on the system using these cpus* (obviously)! It
is thus recommended to use one of the existing techniques to isolate
these CPUs from the rest of the system. I know of two ways to do this:
the `isolcpus` kernel parameter, which is deprecated but is slightly
easier to use, and cpusets, which are not deprecated but harder to use.

To use `isolcpus`, simply set the cpus to use for benchmarking as the
isolated CPUs on the kernel cmdline and reboot (OK, maybe not that
simple for one-shot benchmarking, but fairly easy for a machine that is
mostly used for benchmarking). There is no need to set the CPU affinity
for the `benchpress` binary: it will never be scheduled on the isolated
CPUs, and neither will any other processes (unless manually required
to).

To use cpusets (which is the solution I employed on our benchmark
machine at OCamlPro), you should create a `system` cpuset that contains
only the CPUs that will *NOT* be used by benchpress, and move all
processes to that cpuset (this can be done on a running system, consult
the cpuset documentation). Then, create another cpuset that contains the
CPUs to use for benchpress, including the CPUs to use for the
`benchpress` binary (in practice I use the root cpuset that contains all
the CPUs), and run `benchpress` in that cpuset. You must not forget to
use `taskset` to prevent the `benchpress` binary from using the CPUs
destined for the provers. In practice, I move the shell that I use to
run benchpress to that second cpuset.

This helps with sneeuwballen#58
@bclement-ocp
Copy link
Contributor Author

I have pushed just the CPU affinity changes to this PR and will create a different PR for the ramdisk stuff, but GitHub is not working well currently and does not show the new commit in the PR right now :(

master...bclement-ocp:benchpress:bclement/cpuaff

@bclement-ocp
Copy link
Contributor Author

We could also spawn processes with their output directly targeted at a file on disk, although it's an annoying dance to do from Unix.

The "ramdisk" part of the PR (soon to be another PR) does this dance, but the files are in /dev/shm. It would be fairly easy to write on disk instead (in fact this is what will happen on Windows in the current implementation I have).

@c-cube
Copy link
Member

c-cube commented Jun 8, 2023

Now that we depend on processor, maybe we could have -j auto? Use the number of cores as the level of parallelism :)

@bclement-ocp
Copy link
Contributor Author

I think this is more or less what guess_cpu (I forgot what it is called) in Misc does already? But yes it would be cleaner to use processor for that. Is it OK with you if I do it in a separate PR?

@c-cube
Copy link
Member

c-cube commented Jun 8, 2023 via email

@c-cube
Copy link
Member

c-cube commented Jun 12, 2023

is this ready? :)

@bclement-ocp
Copy link
Contributor Author

Yes, it should be ready for review, sorry if I wasn't clear. I am waiting on this to be merged to do the processor-for--j auto one.

@c-cube c-cube merged commit 80b184f into sneeuwballen:master Jun 13, 2023
@c-cube
Copy link
Member

c-cube commented Jun 13, 2023

All good, thank you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants